Python quick string analysis, manipulation

I use python to parse a string separated by commas. I want to do some calculations afterwards from the data. Line length: 800 characters with 120 fields separated by commas. There are 1.2 million lines to process.

for v in item.values():
         l.extend(get_fields(v.split(',')))  
#process l 

get_fields uses operator.itemgetter () to retrieve from 20 fields out of 120.

This entire operation takes about 4-5 minutes, excluding the time for data entry. In a later part of the program, I insert these rows into the sqlite memory table for future reference. But in general, 4-5 minutes for just parsing and listing is not suitable for my project.

I start this processing after about 6-8 threads.

Can help switch to C / C ++?

+3
source share
2 answers

Your program may slow down by trying to allocate enough memory for 1.2M lines. In other words, the speed problem may not be due to parsing / manipulation, but to l.extend. To test this hypothesis, you can put a print statement in a loop:

for v in item.values():
    print('got here')
    l.extend(get_fields(v.split(',')))  

If print instructions get slower and slower, you can probably conclude what l.extendis the culprit. In this case, you can see a significant improvement in speed if you can move the processing of each line into a loop.

PS: You should probably use the module csvto take care of the parsing for you in a higher level way, but I don’t think it will affect the speed very much.

+2

? , :

datafile = file("file_with_1point2million_records.dat")
# uncomment next to skip over a header record
# file.next()

l = sum(get_fields(v.split(',')) for v in file, [])

, get_fields.

+2

Source: https://habr.com/ru/post/1752891/


All Articles