Python quick string analysis, manipulation

Question

Python quick string analysis, manipulation

I use python to parse a string separated by commas. I want to do some calculations afterwards from the data. Line length: 800 characters with 120 fields separated by commas. There are 1.2 million lines to process.

for v in item.values():
         l.extend(get_fields(v.split(',')))  
#process l

get_fields uses operator.itemgetter () to retrieve from 20 fields out of 120.

This entire operation takes about 4-5 minutes, excluding the time for data entry. In a later part of the program, I insert these rows into the sqlite memory table for future reference. But in general, 4-5 minutes for just parsing and listing is not suitable for my project.

I start this processing after about 6-8 threads.

Can help switch to C / C ++?

+3

performance python string parsing

Sujit Jul 02 '10 at 19:25

source share

2 answers

unutbu · Answer 1 · 2010-07-02T19:37:45+0000

Your program may slow down by trying to allocate enough memory for 1.2M lines. In other words, the speed problem may not be due to parsing / manipulation, but to l.extend. To test this hypothesis, you can put a print statement in a loop:

for v in item.values():
    print('got here')
    l.extend(get_fields(v.split(',')))

If print instructions get slower and slower, you can probably conclude what l.extendis the culprit. In this case, you can see a significant improvement in speed if you can move the processing of each line into a loop.

PS: You should probably use the module csvto take care of the parsing for you in a higher level way, but I don’t think it will affect the speed very much.

PaulMcG · Answer 2 · 2010-07-02T20:56:33+0000

? , :

datafile = file("file_with_1point2million_records.dat")
# uncomment next to skip over a header record
# file.next()

l = sum(get_fields(v.split(',')) for v in file, [])

, get_fields.

Python quick string analysis, manipulation

More articles: