Algorithm: the best way to calculate word list frequencies

This question is actually quite simple, but I would like to hear some ideas before moving on to coding. If the file contains a word in each line, it computes the majority of n frequent numbers.

The first and, unfortunately, the only thing that appears in my mind is using std::map . I know that other C ++ say that unordered_map will be so reasonable.

I would like to know if something can be added to the side of the algorithm, or is it just “whoever chooses the best type of data structure”. I looked it over the Internet and read this hash table, and the priority queue can be provided by an algorithm with O (n) , but I assume it will be difficult to implement

Any ideas?

+6
source share
5 answers

The best data structure used for this task is Trie:

http://en.wikipedia.org/wiki/Trie

It will outperform the hash table for counting rows.

+5
source

If you are interested in only the most popular words N, and you do not need it, then there is a very smart structure that you can use. I heard about it through Udi Manber, it works as follows:

You create an array of N elements, each element keeps track of the value and counter, you also save a counter that is indexed into this array. In addition, you have a map from the value to be indexed into this array. Each time you update your structure with a value (for example, words from a text stream), you first check your map to see if that value is already in your array if you increment the counter for that value. If this is not the case, you decrease the number of items that your counter points to, and then increase the counter.

It sounds simple, and nothing about the algorithm seems like it will bring anything useful, but for typical real data, it tends to do very well. Usually, if you want to keep track of the top N things that you might need to create this structure with a capacity of 10 * N, since it will have many empty values. Using the King James Bible as an input, here is what this structure represents as the most common words (in a specific order):

 0 : in 1 : And 2 : shall 3 : of 4 : that 5 : to 6 : he 7 : and 8 : the 9 : I 

And here are the ten most common words (in order):

 0 : the , 62600 1 : and , 37820 2 : of , 34513 3 : to , 13497 4 : And , 12703 5 : in , 12216 6 : that , 11699 7 : he , 9447 8 : shall , 9335 9 : unto , 8912 

You see that he got 9 of the top 10 words, and he did it using space for only 50 elements. Depending on your use case, saving on location here can be very beneficial. It is also very fast.

Here is the topN implementation I used written in Go:

 type Event string type TopN struct { events []Event counts []int current int mapped map[Event]int } func makeTopN(N int) *TopN { return &TopN{ counts: make([]int, N), events: make([]Event, N), current: 0, mapped: make(map[Event]int, N), } } func (t *TopN) RegisterEvent(e Event) { if index, ok := t.mapped[e]; ok { t.counts[index]++ } else { if t.counts[t.current] == 0 { t.counts[t.current] = 1 t.events[t.current] = e t.mapped[e] = t.current } else { t.counts[t.current]-- if t.counts[t.current] == 0 { delete(t.mapped, t.events[t.current]) } } } t.current = (t.current + 1) % len(t.counts) } 
+2
source

There are many different approaches to this issue. This will ultimately depend on the script and other factors such as file size (if the file has a billion lines), then a HashMap would not be an effective way to do this. Here's what you can do depending on your problem:

  • If you know that the number of unique words is very limited, you can use TreeMap or in your case std::map .
  • If the number of words is very large, you can build a trie and count the number of different words in a different data structure. It can be a bunch (min / max depends on what you want to do) of size n . Therefore, you do not need to store all the words, just the right ones.
+1
source

I would not start with std::map (or unordered_map ) if I had a lot of choices (although I don't know what other restrictions might be).

You have two data items here, and you use them as a key part of the time, and the other as part of the other part of the time. To do this, you probably need something like Boost Bimap, or perhaps Boost MultiIndex .

Here's a general idea using Bimap:

 #include <boost/bimap.hpp> #include <boost/bimap/list_of.hpp> #include <iostream> #define elements(array) ((sizeof(array)/sizeof(array[0]))) class uint_proxy { unsigned value; public: uint_proxy() : value(0) {} uint_proxy& operator++() { ++value; return *this; } unsigned operator++(int) { return value++; } operator unsigned() const { return value; } }; int main() { int b[]={2,4,3,5,2,6,6,3,6,4}; boost::bimap<int, boost::bimaps::list_of<uint_proxy> > a; // walk through array, counting how often each number occurs: for (int i=0; i<elements(b); i++) ++a.left[b[i]]; // print out the most frequent: std::cout << a.right.rbegin()->second; } 

At the moment, I only printed the most frequent number, but repeating N times to print N most often is pretty trivial.

+1
source

The specified file with a word in each line, calculating the majority of n frequent numbers .... I looked through the Internet and read this hash table, and the algorithm with O (n) can provide priority queues

If you meant that * n * s is the same, then no, this is not possible. However, if you simply meant time linear in the size of the input file, then a trivial implementation with a hash table would do what you want.

There may be probabilistic approximate sublinear memory algorithms.

0
source

Source: https://habr.com/ru/post/913471/


All Articles