More efficient HashMap (dictionary) for Python for use in big data

I am creating a program that takes into account the appearance of lines in a huge file. For this, I used a python dictionary with strings as keys and values ​​as values.

The program works great for small files up to 10,000 lines long. But when I test it in real files ~ 2-3 mil , my program starts to slow down when it reaches a 50% mark to about 50% of its original speed.

I suspect that this is due to the fact that the built-in dictionary is not designed to handle such large amounts of data and receives much more collisions. I would like to know if there is an effective way to solve this problem. I was looking for alternative hashmap implementations or even compiled a hashmaps list (this slowed it down further).

Details:

  • Lines are not known in advance.
  • The range of line lengths is about 10-200.
  • There are many lines that occur only once (and will be discarded at the end)
  • I have already implemented concurrency to speed it up.
  • It takes about 1 hour to complete one file.
    • I also do other calculations, although it takes a lot of time, it does not slow down the work with smaller file sizes. Therefore, I suspect that this is a problem with hash memory or memory.
  • I have a lot of memory, at startup it takes up only 8 GB 32 GB.
+4
1

, , .

. Python , . .

, , L3 ( 6 ). , DRAM (. ExtremeTech ).

, , .


https://www.extremetech.com/wp-content/uploads/2015/02/latency.png

+4

Source: https://habr.com/ru/post/1683492/


All Articles