Get data from a file without repeating through it several times

I created the following function to extract data from a file. It works fine, but for large files it is very slow.

def get_data(file, indexes, data_start, sieve_first = is_float): file_list = list(file) for i in indexes: d_line = i+data_start for line in file_list[d_line:]: if sieve_first(line.strip().split(',')[0]): yield file_list[d_line].strip() d_line += 1 else: break def is_float(f): try: float(str(f)) except: return False else: return True with open('my_data') as f: data = get_data(f, index_list, 3) 

The file may look like this (line numbers added for clarity):

 line 1234567: # <-- INDEX line 1234568: # +1 line 1234569: # +2 line 1234570: 8, 17.0, 23, 6487.6 line 1234571: 8, 17.0, 23, 6487.6 line 1234572: 8, 17.0, 23, 6487.6 line 1234572: line 1234572: line 1234572: 

In the above example, lines 1234570 through 1234572 will be returned.

Since my files are large, there are a couple of things in my function that I don't like.

  • Firstly, it reads the entire file into memory; I do this, so I can use row indexing to parse the data.
  • Secondly, the same lines in a file are repeated many times - it is very expensive for a large file.

I tried using iterators to go through the file at a time, but couldn't crack it. Any suggestions?

+5
source share
2 answers

If you want only a small part of the file, I would use itertools.islice . This function will not save any data other than the data you want to keep in memory.

Here is an example:

 from itertools import islice def yield_specific_lines_from_file(filename, start, stop): with open(filename, 'rb') as ifile: for line in islice(ifile, start, stop): yield line lines = list(yield_specific_lines_from_file('test.txt', 10, 20)) 

If you are using Python 3.3 or later, you can also simplify this by using the yield from :

 from itertools import islice def yield_specific_lines_from_file(filename, start, stop): with open(filename, 'rb') as ifile: yield from islice(ifile, start, stop) lines = list(yield_specific_lines_from_file('test.txt', 10, 20)) 

This will not cache lines that you have already read from the file. If you want this, I suggest that you store all the reading lines in a dictionary with the line number as the key and, if necessary, extract data from the file.

+2
source

From the left margin a bit. But if you have control over your files, you can transfer data to the sqlite3 database.

Also look at mmap and linecache. I assume these last two are just wrappers around random access files. those. you can overturn your files by scanning the files once, and then creating the index lookup table → offset and using the search.

Some of these approaches suggest that you control the files you read?

It also depends on whether you read a lot and write infrequently (if building an index is not such a bad idea).

+1
source

Source: https://habr.com/ru/post/1245350/


All Articles