Python string management - performance issues

Question

Python string management - performance issues

I have the following code snippet, which I execute about 2 million times in my application to analyze this number of records. This part seems to be a bottleneck, and I was wondering if anyone could help me out by suggesting some neat tricks that could make these simple string manipulations faster.

try: data = [] start = 0 end = 0 for info in self.Columns(): end = start + (info.columnLength) slice = line[start:end] if slice == '' or len(slice) != info.columnLength: raise 'Wrong Input' if info.hasSignage: if(slice[0:1].strip() != '+' and slice[0:1].strip() != '-'): raise 'Wrong Input' if not info.skipColumn: data.append(slice) start = end parsedLine = data except: parsedLine = False

+6

performance python string

bbekdemir Sep 2 '11 at 18:18

source share

6 answers

 def fubarise(data): try: if nasty(data): raise ValueError("Look, Ma, I'm doing a big fat GOTO ...") # sheesh #1 more_of_the_same() parsed_line = data except ValueError: parsed_line = False # so it can be a "data" or False -- sheesh #2 return parsed_line

It makes no sense to indicate different error messages in the raise ; they have never been seen. Sheesh # 3.

Update:. An improvement is proposed that uses struct.unpack to quickly separate input lines. It also illustrates better exception handling under the assumption that the author of the code also runs it and stops at the first error. Another issue is a robust implementation that logs all errors in all columns of all rows for a user audience. Note that typically error checking for each column will be much more extensive, for example. checking for the leading character, but not checking if the column contains a valid number, seems a little strange.

 import struct def unpacked_records(self): cols = self.Columns() unpack_fmt = "" sign_checks = [] start = 0 for colx, info in enumerate(cols, 1): clen = info.columnLength if clen < 1: raise ValueError("Column %d: Bad columnLength %r" % (colx, clen)) if info.skipColumn: unpack_fmt += str(clen) + "x" else: unpack_fmt += str(clen) + "s" if info.hasSignage: sign_checks.append(start) start += clen expected_len = start unpack = struct.Struct(unpack_fmt).unpack for linex, line in enumerate(self.whatever_the_list_of_lines_is, 1): if len(line) != expected_len: raise ValueError( "Line %d: Actual length %d, expected %d" % (linex, len(line), expected_len)) if not all(line[i] in '+-' for i in sign_checks): raise ValueError("Line %d: At least one column fails sign check" % linex) yield unpack(line) # a tuple

+3

John machin Sep 03 '11 at 11:31

source share

that (using some classes to have an executable example):

 class Info(object): columnLength = 5 hasSignage = True skipColumn = False class Something(object): def Columns(self): return [Info()]*4 def bottleneck(self): try: data = [] start = 0 end = 0 line = '+this-is just a line for testing' for info in self.Columns(): start = end collength = info.columnLength end = start + collength if info.skipColumn: # start with this continue elif collength == 0: raise ValueError('Wrong Input') slice = line[start:end] # only now slicing, because it # is probably most expensive part if len(slice) != collength: raise ValueError('Wrong Input') elif info.hasSignage and slice[0] not in '+-': # bit more compact raise ValueError('Wrong Input') else: data.append(slice) parsedLine = data except: parsedLine = False Something().bottleneck()

edit: when the length of the slice is 0, the slice [0] does not exist, therefore if collength == 0 must be checked for the first

edit2: You use this bit of code for many lines, but the column information does not change, right? It allows you

pre-compute a list of starting points of each column (you no longer need to calculate the start, end)
knowing the initial end in advance, .Columns () only need to return columns that are not missing and have a column length> 0 (or do you really need to raise the input for length == 0 in each row?)
the length of the scores of each row is known and equal to or each row and can be checked before the loop on the column information

Edit3: I wonder how you know which data index belongs to that column if you use "skipColumn" ...

+2

Remi Sep 2 '11 at 18:56

source share

Do not evaluate start and end every time through this loop.

Compute them exactly once before using self.Columns() (whatever it is. If the "Columns" is a class with static values, this is stupid. If it is a function with a name starting with a capital letter, which is confusing.)

if slice == '' or len(slice) != info.columnLength can only happen if the row is too short compared to the total size required by Columns . Check once, out of cycle.

slice[0:1].strip() != '+' sure looks like .startswith() .

if not info.skipColumn . Apply this filter before starting the cycle. Remove them from self.Columns() .

+1

S. Lott Sep 2 '11 at 18:27

source share

The first thing I would like to consider is slice = line[start:end] . Slicing creates new instances; you can try to avoid the explicit construction of line [start:end] and manually examine its contents.

Why are you doing slice[0:1] ? This should give a subsequence containing one slice element (right?), So it can probably be checked more efficiently.

+1

phimuemue Sep 2 '11 at 18:38

source share

I want to say that you are using some kind of built-in Python function to split the string, but I can't think of it. Therefore, I remained only with the goal of reducing the amount of code that you have.

When we are done, end should point to the end of the line; if so, then all .columnLength values should be in order. (If someone was not negative or something!)

Since this has a reference to self , it must be a segment from a member function. Thus, instead of throwing exceptions, you can simply return False to exit the function earlier and return the error flag. But I like the debugging potential when changing the except clause so that I no longer catch the exception and get a stack trace to determine where the problem came from.

@Remi used slice[0] in '+-' , where I used slice.startswith(('+', '-)) . I think I like the @Remi code there better, but I left it unchanged to show you differently. The .startswith() path will work for strings longer than length 1, but since it is just a string of length 1, a subtle solution works.

 try: line = line.strip('\n') data = [] start = 0 for info in self.Columns(): end = start + info.columnLength slice = line[start:end] if info.hasSignage and not slice.startswith(('+', '-')): raise ValueError, "wrong input" if not info.skipColumn: data.append(slice) start = end if end - 1 != len(line): raise ValueError, "bad .columnLength" parsedLine = data except ValueError: parsedLine = False

0

steveha Sep 2 '11 at 19:13

source share

steveha · Accepted Answer · 2011-09-02T19:33:26+0000

EDIT: I am slightly modifying this answer. I will leave the original answer below.

In my other answer, I commented that it would be best to find a Python built-in module that will do the unpacking. I could not think of one thing, but perhaps I should have Google search for it. @John Machin provided an answer that showed how to do this: use the Python module module struct . Since this is written in C, it should be faster than my pure Python solution. (I actually didn’t measure anything, so this is an assumption.)

I agree that the logic in the source code is "non-Pythonic." Returning a sentinel value is no better; it’s better to either return the actual value or raise the exception. Another way to do this is to return a list of valid values plus another list of invalid values. Since @John Machin was suggesting code to get real values, I thought I would write a version here that returns two lists.

NOTE. Perhaps the best possible answer would be @John Machin's answer and modify it to save the invalid values in a file for possible future review. His answer gives answers one at a time, so there is no need to create a large list of analyzed records; and keeping bad lines on disk means that you don’t have to create as many lists of bad lines as possible.

 import struct def parse_records(self): """ returns a tuple: (good, bad) good is a list of valid records (as tuples) bad is a list of tuples: (line_num, line, err) """ cols = self.Columns() unpack_fmt = "" sign_checks = [] start = 0 for colx, info in enumerate(cols, 1): clen = info.columnLength if clen < 1: raise ValueError("Column %d: Bad columnLength %r" % (colx, clen)) if info.skipColumn: unpack_fmt += str(clen) + "x" else: unpack_fmt += str(clen) + "s" if info.hasSignage: sign_checks.append(start) start += clen expected_len = start unpack = struct.Struct(unpack_fmt).unpack good = [] bad = [] for line_num, line in enumerate(self.whatever_the_list_of_lines_is, 1): if len(line) != expected_len: bad.append((line_num, line, "bad length")) continue if not all(line[i] in '+-' for i in sign_checks): bad.append((line_num, line, "sign check failed")) continue good.append(unpack(line)) return good, bad

ORIGINAL TEXT ANSWER: This answer should be much faster if the self.Columns() identical to all records. We process the information self.Columns() once and create a couple of lists that contain only what we need to process the record.

This code shows how to compute a parsedList , but does not actually return it or does not return it or does nothing with it. Obviously, you will need to change this.

 def parse_records(self): cols = self.Columns() slices = [] sign_checks = [] start = 0 for info in cols: if info.columnLength < 1: raise ValueError, "bad columnLength" end = start + info.columnLength if not info.skipColumn: tup = (start, end) slices.append(tup) if info.hasSignage: sign_checks.append(start) expected_len = end # or use (end - 1) to not count a newline try: for line in self.whatever_the_list_of_lines_is: if len(line) != expected_len: raise ValueError, "wrong length" if not all(line[i] in '+-' for i in sign_checks): raise ValueError, "wrong input" parsedLine = [line[s:e] for s, e in slices] except ValueError: parsedLine = False

Python string management - performance issues

More articles: