Returning the 10 most frequently used words in a document in O (n)

How can I create an algorithm that can return the 10 most frequently used words in a document in O (n) time? If you can use the extra space.

I can parse and put the words in the hash map with the score. But then I have to sort the values ​​to get the most frequent ones. In addition, I must have a Btw mapping of values ​​-> Key, which cannot be stored since the values ​​can be repeated.

So how can I solve this?

+4
source share
5 answers

This can be done in O (n) if you use the correct data structure.

Consider a Node consisting of two things:

  • Counter (initially set to 0).
  • An array of 255 (or any number of characters) points to Node . First, all pointers are set to NULL .

Create the root node. Define the "current" Node pointer, first set it to the root node. Then go through all the characters of the document and do the following:

  • If the following characters are not spaces, select the appropriate pointer from the array of the current node. If it is NULL , select it. The current Node pointer has been updated.
  • If it is a space (or any word delimiter) - increase the counter "current" Node . Then, reset the "current" Node pointer to the root of the node.

So you build a tree in O (n). Each element (both node and leave) designates a specific word along with its counter.

Then move the tree to find the node with the largest counter. This is also O (n), since the number of elements in the tree is not more than O (n).

Update:

The last step is optional. In fact, the most common word can be updated during character processing. Below is the pseudo code:

 struct Node { size_t m_Counter; Node* m_ppNext[255]; Node* m_pPrev; Node(Node* pPrev) :m_Counter(0) { m_pPrev = pPrev; memset(m_ppNext, 0, sizeof(m_ppNext)); } ~Node() { for (int i = 0; i < _countof(m_ppNext) i++) if (m_ppNext[i]) delete m_ppNext[i]; } }; Node root(NULL); Node* pPos = &root; Node* pBest = &root; char c; while (0 != (c = GetNextDocumentCharacter())) { if (c == ' ') { if (pPos != &root) { pPos->m_Counter++; if (pBest->m_Counter < pPos->m_Counter) pBest = pPos; pPos = &root; } } else { Node*& pNext = pPos->m_ppNext[c - 1]; if (!pNext) pNext = new Node(pPos); pPos = pNext; } } // pBest points to the most common word. Using pBest->m_pPrev we iterate in reverse order through its characters 
+2
source

Here's a simple algorithm:

  • Read one word at a time in the document. O (n)
  • Create a HashTable using every word. O (n)
    • Use the word as a key. O (1)
    • Use the number of times you saw this word as a meaning. O (1)
    • (for example, if you add a key to a hash table, then the value is 1, if you already have a key in the hash table, increase its associated value by 1) O (1)
  • Create a pair of arrays of size 10 (for example, String words [10] / int count [10] or use <Pair>), use this pair to track the 10 most frequently occurring words and their number of words in the next step. O (1)
  • Iterate through a completed HashTable once: O (n)
    • If the current word has a higher number of words than an entry in a pair of arrays, replace this particular entry and shift everything down by 1 slot. O (1)
  • Print a couple of arrays. O (1)

O (n) Runtime.

O (n) Storage for HashTable + arrays

(Side note: you can just think of a HashTable as a dictionary: a way to store key: value pairs, where keys are unique. Technically, HashMaps means asynchronous access, and HashTable means synchronization.)

+5
source

The fastest approach is to use the base tree. You can store the number of words in a leaf of the base tree. Save a separate list of the 10 most frequently occurring words and their number of occurrences along with a variable that stores the threshold value required to enter this list. Refresh this list as items are added to the tree.

+2
source

Maintaining cards (words, numbers) will be O (n).

Once the map is built, repeat the keys and get the ten most common keys.

O (n) + O (n)

β€œBut I'm not quite happy with this solo combination of the extra amount of external memory.”

0
source

I would use ArrayList and HashTable.

Here is the algorithm I'm thinking of

 Loop through all the word in the document. if (HashTable.contains(word) ) increment count for that word in the HashTable; else ArrayList.add(word); HashTable.add(word); word count in HashTable = 1; 

After scrolling through the whole document

 Loop through ArrayList<word> Retrieve the word count for that word from the HashTable; Keep a list of the top 10 words; 

Runtime must be O (n) to build a HashTable and ArrayList. Creating top list 10 should be O (m), where m is the number of unique words. O (n + m), where n β†’ m β†’ O (n)

0
source

Source: https://habr.com/ru/post/1442191/


All Articles