Implement T9 Text Prediction

I have a T9 dictionary in memory (trie / hash_map). A dictionary contains pairs of word-words, therefore, when a word is selected from the dictionary, its rating increases, and a pair of word-words increases in the list of words.

Let's say that there is a way to select a word from the dictionary. This method also performs some word evaluation procedure.

At the input, I have a string of numbers (1-9, '*' to change the word and '') that were pressed on the phone.

Questions:

  • Is there an algorithm for parsing a string quickly?
  • What data structure will be good there?

UPD:

Full text of the problem (problem D)

Hash_map implementation

Trie implementation

+4
source share
1 answer

One option that I think would be especially effective would be to trie the preprocess into a modified structure specifically optimized for predicting words based on keystrokes.

Intuitively, the new structure is three created from possible numbers that could be pressed anywhere. Each trie node then stores a priority queue of words that could potentially be written using these numbers. You can then predict which words to use using the following algorithm:

  • Start with the trie root.
  • For each digit, follow the pointer corresponding to the digit.
  • If you go away from trie, then there are no offers.
  • Otherwise, look at the priority queue for words that can be formed from these numbers, and then suggest an element in this priority queue with a maximum usage counter.

This algorithm takes O (n + T max ) time, where n is the length of the string of digits, and T max is the time required to search for the most popular word for a given prefix. It could be O (1) for something like a binary heap, but it might be slower for more complex heaps.

To create this data structure, you must pre-process the original word / frequency list as follows. For each word, define a series of numbers corresponding to its letters. For example, the word "monsoon" will be 6667666, and the word "apple" will be 27753. Then insert this sequence into the number trie, creating new nodes as needed. When you reach the final node corresponding to this digital line, insert the word in the corresponding priority queue for this node. The total time for this operation is actually quite good; given the list of words with a total number of n letters in it, this can be done in O (n T insert ) time, where T insert is the time required to insert a word into the priority queue. This is because we need to visit each letter no more than three times - once to convert it to a digit, once to follow the corresponding edge in the trie digit, and once to insert it into the priority queue.

This approach also makes it easy to easily insert new words into the predictor. Whenever a user deletes a trie or selects a word that is not number one, you can simply insert that word into a trie by creating the appropriate series of nodes in trie, then inserting the word into the correct priority of the queue.

If you make priority queues from a balanced binary search tree that stores a pointer to its smallest element, you can search for the fastest word for a series of n digits in O (n) time and then all other words with this prefix can be written off in depreciated O (1) times apiece. You can also easily insert or delete new words this way.

Since the words are stored in trie, you can in O (1) update your assumption about the current word by simply moving from the current trie node down to the child element specified by the current number. When the user presses the space bar, you simply reset back to the root of the trie digit. Finally, when a user clicks on a star, you can use the aforementioned trick to just take a look at the next most popular word for a given trie node.

Hope this helps!

+8
source

Source: https://habr.com/ru/post/1369214/


All Articles