The fastest way to import a 500 GB text file using partitions only

I have about 500 GB of text file shared over several months. In these text files, the first 43 lines are connection information only (not required). The next 75 lines are descriptors for observation. This is followed by 4 lines (not necessary), then the next observation, which is 75 lines.

All I want is these 75 lines (descriptors are in one place for each observation), which are characterized as follows:

ID: 5523 Date: 20052012 Mixed: <Null> . . 

And I want to change it to csv 5523;20052012;;.. format for each observation. So I get a lot less text files. Since the descriptors are the same, I know that the first position, for example, is ID.

As soon as I finish the text file, I will open the next one and add it (or will we create a new file faster?).

What I did is pretty inefficient. I am opening a file. Download it. The removal of these observations occurs sequentially. If it accepts a fair bit with a test pattern, this is clearly not the best method.

Any suggestions would be great.

+6
source share
3 answers

You said that you have "about 500 GB of text files." If I understand correctly, you do not have a fixed length for each observation (note, I'm not talking about the number of lines, I mean the total length in bytes of all lines for the observation). This means that you have to go through the whole file because you cannot know exactly where the new lines will be.

Now, depending on how large each individual text file is, you may need to find a different answer. But if each file is small enough (less than 1 GB?), You can use the linecache module, which handles the -by-line search for you.

You could use it, perhaps like this:

 import linecache filename = 'observations1.txt' # Start at 44th line curline = 44 lines = [] # Keep looping until no return string is found # getline() never throws errors, but returns an empty string '' # if the line wasn't found (if the line was actually empty, it would have # returned the newline character '\n') while linecache.getline(filename, curline): for i in xrange(75): lines.append(linecache.getline(filename, curline).rstrip()) curline += 1 # Perform work with the set of observation lines add_to_observation_log(lines) # Skip the unnecessary section and reset the lines list curline += 4 lines = [] 

I tried to verify this, and he chewed through a 23 MB file in five seconds.

+6
source

file opening. Download it. The removal of these observations occurs sequentially.

What do you mean by "download"? If you mean to read the whole thing in a line, then yes, it will suck. A natural way to process a file is to use the fact that the file object is an iterator over the lines of the file:

 for line in file: if should_use(line): do_something_with(line) 
+2
source

You should consider recording the information you want to store in the database. In python, you can use the built-in sqlite3. It is easy to understand what docs are .

You say that now you precisely specify the lines in each file that you want to save. So you can try something like this.

  import csv reader = csv.reader(open("afile.csv","rb"),delimiter="\t",quotechar='"') info_to_keep = [] obs = [] for row in reader: i+=1 if i<43: continue elif i-43 <79*(len(info_to_keep)+1)-4: obs.append(row) elif i-43 <79*(len(info_to_keep)+1): continue else: info_to_keep.append(obs) obs = [row] 

Thus, you can have a list called info_to_keep with each record containing a list of 75 records, each of which contains a list with fields from a csv file

0
source

Source: https://habr.com/ru/post/916210/


All Articles