I am creating a program that takes into account the appearance of lines in a huge file. For this, I used a python dictionary with strings as keys and values as values.
The program works great for small files up to 10,000 lines long. But when I test it in real files ~ 2-3 mil , my program starts to slow down when it reaches a 50% mark to about 50% of its original speed.
I suspect that this is due to the fact that the built-in dictionary is not designed to handle such large amounts of data and receives much more collisions. I would like to know if there is an effective way to solve this problem. I was looking for alternative hashmap implementations or even compiled a hashmaps list (this slowed it down further).
Details:
- Lines are not known in advance.
- The range of line lengths is about 10-200.
- There are many lines that occur only once (and will be discarded at the end)
- I have already implemented concurrency to speed it up.
- It takes about 1 hour to complete one file.
- I also do other calculations, although it takes a lot of time, it does not slow down the work with smaller file sizes. Therefore, I suspect that this is a problem with hash memory or memory.
- I have a lot of memory, at startup it takes up only 8 GB 32 GB.