Divide the available memory into two halves. Use one as a 4-bit color filter counter and the other half as a fixed-size hash table with counts. The role of the Bloom counting filter is to filter out rare words with high memory efficiency.
Test your 1 TB words against an initially empty Bloom filter; if the word is already included and all buckets are set to a maximum value of 15 (this may be partially or completely false positive), skip it. If it is not, add it.
Words passed through the count; for most words it is every time, but the first 15 times you see them. A small percentage will begin to be calculated even earlier, resulting in a potential inaccuracy of up to 15 cases per word. This is a limitation of Bloom filters.
When the first pass is completed, you can correct the inaccuracy with the second pass, if necessary. Free the Bloom filter, free also all counts that are not included in 15 cases behind the tenth most frequent word. Go through the entrance again, this time accurately counting words (using a separate hash table), but ignoring words that were not saved as approximate winners from the first pass.
Notes
The hash table used in the first pass can theoretically overflow with certain statistical input distributions (for example, each word exactly 16 times) or with extremely limited RAM. It is up to you to calculate or try whether it can really happen to you or not.
Also note that the bucket width (4 bits in the above description) is just a design parameter. Apart from the Bloom filter (bucket width 1) it will beautifully select the most unique words, but do nothing to filter out other very rare words. A wider bucket size will be more prone to cross-talk between words (because there will be fewer buckets), and will also reduce the guaranteed level of accuracy after the first pass (15 cases in the case of 4 bits). But these shortcomings will be quantitatively insignificant to some extent, while I imagine that a more aggressive filtering effect is absolutely decisive for storing a hash table up to a gigabyte in size with non-repeating natural language data.
As for the memory order for the Bloom filter itself; these people work below 100 MB and have a much more complex application ("full" n-gram statistics, not a threshold of 1 gram statistics).
source share