Probabilistic Counting in Python

Question

Probabilistic Counting in Python

I have a text file with a size of 50 GB of random strings from which I want to count the number of occurrences of a substring in this file .. many times for different non-predefined random substrings.

I was wondering if there is another approach to solving the problem.

probabilistic way

Something like a flowering filter, but instead of a probabilistic membership check, we may have a probabilistic count . This data structure will be used for count estimates .

Another statistical method (?)

Any dummy method that I could use to estimate the number of occurrences of a string in a text file? Discover alternatives.

It would be nice if this could be done in <= logarithmic time, since I will do the same task many times.

+5

python string algorithm probability stat

Retrocode Nov 11 '16 at 20:10

source share

2 answers

Ami tavory · Answer 1 · 2016-11-11T20:27:46+0000

Some streaming algorithms sound related to this problem, separately or in combination with each other.

The initial passage through the file may give the approximation of heavy hitters . Depending on your problem, perhaps the distribution of heavy hitters is enough for you, but this set is small enough to store in memory. If so, you can complete the second pass, counting only the heavy hitters from the first pass.
The count-min sketch data structure can do an approximate calculation. You can use this data structure yourself or use it to count the occurrences of heavy hitters.

Since this is tagged as Python:

Peter de Rivaz · Answer 2 · 2016-11-11T20:35:44+0000

You can calculate an array of entities for your file.

This array contains the starting positions of the suffixes in sorted order. With 50 GB of text, you could allocate 5 bytes for each position and end up with an array of suffix 5 * 50 = 250 GB. If this is too much, you can try a compressed array of suffixes .

This array can be calculated in O (n) (perhaps several hours with the appropriate algorithm, mainly limited by the speed of reading / writing to disk).

When you have an array, you can count the number of occurrences of any substring in logarithmic time. In practice, the time will be determined by the search time in different parts of your disk, so this part will be much faster if you save files to a solid state drive.

Probabilistic Counting in Python

More articles: