More efficient HashMap (dictionary) for Python for use in big data

Question

More efficient HashMap (dictionary) for Python for use in big data

I am creating a program that takes into account the appearance of lines in a huge file. For this, I used a python dictionary with strings as keys and values as values.

The program works great for small files up to 10,000 lines long. But when I test it in real files ~ 2-3 mil , my program starts to slow down when it reaches a 50% mark to about 50% of its original speed.

I suspect that this is due to the fact that the built-in dictionary is not designed to handle such large amounts of data and receives much more collisions. I would like to know if there is an effective way to solve this problem. I was looking for alternative hashmap implementations or even compiled a hashmaps list (this slowed it down further).

Details:

Lines are not known in advance.
The range of line lengths is about 10-200.
There are many lines that occur only once (and will be discarded at the end)
I have already implemented concurrency to speed it up.
It takes about 1 hour to complete one file.
- I also do other calculations, although it takes a lot of time, it does not slow down the work with smaller file sizes. Therefore, I suspect that this is a problem with hash memory or memory.
I have a lot of memory, at startup it takes up only 8 GB 32 GB.

+4

python string dictionary bigdata

CJX3711 11 . '17 4:29

1

Raymond Hettinger · Accepted Answer · 2017-08-11T04:43:28+0000

, , .

. Python , . .

, , L3 ( 6 ). , DRAM (. ExtremeTech ).

, , .

More efficient HashMap (dictionary) for Python for use in big data

More articles: