Returning the 10 most frequently used words in a document in O (n)

Question

Returning the 10 most frequently used words in a document in O (n)

How can I create an algorithm that can return the 10 most frequently used words in a document in O (n) time? If you can use the extra space.

I can parse and put the words in the hash map with the score. But then I have to sort the values to get the most frequent ones. In addition, I must have a Btw mapping of values -> Key, which cannot be stored since the values can be repeated.

So how can I solve this?

+4

java algorithm complexity-theory time-complexity

Laavaa Oct 25 '12 at 21:37

source share

5 answers

Here's a simple algorithm:

Read one word at a time in the document. O (n)
Create a HashTable using every word. O (n)
- Use the word as a key. O (1)
- Use the number of times you saw this word as a meaning. O (1)
- (for example, if you add a key to a hash table, then the value is 1, if you already have a key in the hash table, increase its associated value by 1) O (1)
Create a pair of arrays of size 10 (for example, String words [10] / int count [10] or use <Pair>), use this pair to track the 10 most frequently occurring words and their number of words in the next step. O (1)
Iterate through a completed HashTable once: O (n)
- If the current word has a higher number of words than an entry in a pair of arrays, replace this particular entry and shift everything down by 1 slot. O (1)
Print a couple of arrays. O (1)

O (n) Runtime.

O (n) Storage for HashTable + arrays

(Side note: you can just think of a HashTable as a dictionary: a way to store key: value pairs, where keys are unique. Technically, HashMaps means asynchronous access, and HashTable means synchronization.)

+5

sampson-chen Oct 25 '12 at 21:44

source share

The fastest approach is to use the base tree. You can store the number of words in a leaf of the base tree. Save a separate list of the 10 most frequently occurring words and their number of occurrences along with a variable that stores the threshold value required to enter this list. Refresh this list as items are added to the tree.

+2

Tyler durden Oct 25 '12 at 21:55

source share

Maintaining cards (words, numbers) will be O (n).

Once the map is built, repeat the keys and get the ten most common keys.

O (n) + O (n)

“But I'm not quite happy with this solo combination of the extra amount of external memory.”

0

smk Oct 25 '12 at 21:45

source share

I would use ArrayList and HashTable.

Here is the algorithm I'm thinking of

 Loop through all the word in the document. if (HashTable.contains(word) ) increment count for that word in the HashTable; else ArrayList.add(word); HashTable.add(word); word count in HashTable = 1;

After scrolling through the whole document

 Loop through ArrayList<word> Retrieve the word count for that word from the HashTable; Keep a list of the top 10 words;

Runtime must be O (n) to build a HashTable and ArrayList. Creating top list 10 should be O (m), where m is the number of unique words. O (n + m), where n → m → O (n)

0

byteherder Oct 25 '12 at 21:50

source share

valdo · Accepted Answer · 2012-10-25T21:48:59+0000

This can be done in O (n) if you use the correct data structure.

Consider a Node consisting of two things:

Counter (initially set to 0).
An array of 255 (or any number of characters) points to Node . First, all pointers are set to NULL .

Create the root node. Define the "current" Node pointer, first set it to the root node. Then go through all the characters of the document and do the following:

If the following characters are not spaces, select the appropriate pointer from the array of the current node. If it is NULL , select it. The current Node pointer has been updated.
If it is a space (or any word delimiter) - increase the counter "current" Node . Then, reset the "current" Node pointer to the root of the node.

So you build a tree in O (n). Each element (both node and leave) designates a specific word along with its counter.

Then move the tree to find the node with the largest counter. This is also O (n), since the number of elements in the tree is not more than O (n).

Update:

The last step is optional. In fact, the most common word can be updated during character processing. Below is the pseudo code:

 struct Node { size_t m_Counter; Node* m_ppNext[255]; Node* m_pPrev; Node(Node* pPrev) :m_Counter(0) { m_pPrev = pPrev; memset(m_ppNext, 0, sizeof(m_ppNext)); } ~Node() { for (int i = 0; i < _countof(m_ppNext) i++) if (m_ppNext[i]) delete m_ppNext[i]; } }; Node root(NULL); Node* pPos = &root; Node* pBest = &root; char c; while (0 != (c = GetNextDocumentCharacter())) { if (c == ' ') { if (pPos != &root) { pPos->m_Counter++; if (pBest->m_Counter < pPos->m_Counter) pBest = pPos; pPos = &root; } } else { Node*& pNext = pPos->m_ppNext[c - 1]; if (!pNext) pNext = new Node(pPos); pPos = pNext; } } // pBest points to the most common word. Using pBest->m_pPrev we iterate in reverse order through its characters

Returning the 10 most frequently used words in a document in O (n)

More articles: