I created the following function to extract data from a file. It works fine, but for large files it is very slow.
def get_data(file, indexes, data_start, sieve_first = is_float): file_list = list(file) for i in indexes: d_line = i+data_start for line in file_list[d_line:]: if sieve_first(line.strip().split(',')[0]): yield file_list[d_line].strip() d_line += 1 else: break def is_float(f): try: float(str(f)) except: return False else: return True with open('my_data') as f: data = get_data(f, index_list, 3)
The file may look like this (line numbers added for clarity):
line 1234567: # <-- INDEX line 1234568: # +1 line 1234569: # +2 line 1234570: 8, 17.0, 23, 6487.6 line 1234571: 8, 17.0, 23, 6487.6 line 1234572: 8, 17.0, 23, 6487.6 line 1234572: line 1234572: line 1234572:
In the above example, lines 1234570 through 1234572 will be returned.
Since my files are large, there are a couple of things in my function that I don't like.
- Firstly, it reads the entire file into memory; I do this, so I can use row indexing to parse the data.
- Secondly, the same lines in a file are repeated many times - it is very expensive for a large file.
I tried using iterators to go through the file at a time, but couldn't crack it. Any suggestions?
source share