The most common words in terabytes of data

I ran into a problem when we need to find the 10 most frequent words in terabytes of a file or line.

One solution I could think of was to use a hash table (word, count) along with a maximum heap. But when installing all the words, if the words are unique, a problem may arise. I was thinking of another solution using Map-Reduce, dividing chunks on different nodes. Another solution would be to build a Trie for all words and update the count of each word when scanning a file or line.

Which of the above options would be the best solution? I think the first solution is pretty naive.

+4
source share
3 answers

Divide the available memory into two halves. Use one as a 4-bit color filter counter and the other half as a fixed-size hash table with counts. The role of the Bloom counting filter is to filter out rare words with high memory efficiency.

Test your 1 TB words against an initially empty Bloom filter; if the word is already included and all buckets are set to a maximum value of 15 (this may be partially or completely false positive), skip it. If it is not, add it.

Words passed through the count; for most words it is every time, but the first 15 times you see them. A small percentage will begin to be calculated even earlier, resulting in a potential inaccuracy of up to 15 cases per word. This is a limitation of Bloom filters.

When the first pass is completed, you can correct the inaccuracy with the second pass, if necessary. Free the Bloom filter, free also all counts that are not included in 15 cases behind the tenth most frequent word. Go through the entrance again, this time accurately counting words (using a separate hash table), but ignoring words that were not saved as approximate winners from the first pass.

Notes

The hash table used in the first pass can theoretically overflow with certain statistical input distributions (for example, each word exactly 16 times) or with extremely limited RAM. It is up to you to calculate or try whether it can really happen to you or not.

Also note that the bucket width (4 bits in the above description) is just a design parameter. Apart from the Bloom filter (bucket width 1) it will beautifully select the most unique words, but do nothing to filter out other very rare words. A wider bucket size will be more prone to cross-talk between words (because there will be fewer buckets), and will also reduce the guaranteed level of accuracy after the first pass (15 cases in the case of 4 bits). But these shortcomings will be quantitatively insignificant to some extent, while I imagine that a more aggressive filtering effect is absolutely decisive for storing a hash table up to a gigabyte in size with non-repeating natural language data.

As for the memory order for the Bloom filter itself; these people work below 100 MB and have a much more complex application ("full" n-gram statistics, not a threshold of 1 gram statistics).

+4
source

Sort terabyte file alphabetically using mergesort. In the initial pass, use quick sorting using all available physical memory to pre-sort long lines of words.

In this case, imagine a continuous sequence of identical words in one single word and number. (That is, you add counts during mergers.)

Then resort to the file, again using mergesort with quick sort sorting, but this time by counters, not alphabetically.

This is slower but easier to implement than my other answer.

+3
source

Best I could think of:

  • Divide the data into parts that you can save in memory.
  • For each part you will receive N most frequently occurring words, you will receive the words N * partsNumber .
  • Read all the data, again counting the words that you received before that.

It will not always give you the correct answer, but it will work in fixed memory and linear time.

+2
source

Source: https://habr.com/ru/post/1435440/


All Articles