Indexing huge text file

I have one huge text file (over 100 concerts) with 6 columns of data (tab as a delimiter). In the first column, I have an integer value (2500 different values ​​in the set). I need to split this file into several smaller files depending on the value in the first column (note that the rows are NOT sorted). Each of these smaller files will be used for plotting in Matlab.

I have only 8 GB of RAM.

The problem is how to do this efficiently? Any ideas?

+6
source share
6 answers

Using bash:

cat 100gigfile | while read line; do intval="$( echo "$line" | cut -f 1)" chunkfile="$( printf '%010u.txt' "$intval" )" echo "$line" >> "$chunkfile" done 

This will split your 100 gigabyte file into (as you say) 2,500 separate files named by the value of the first field. You may need to adapt the format argument to printf to your liking.

+5
source

single line with bash + awk:

 awk '{print $0 >> $1".dat" }' 100gigfile 

this will add each line of your large file to a file named as the value of the first column + ".dat", for example. line 12 aa bb cc dd ee ff will go to 12.dat file.

+2
source

For linux 64 bit (I'm not sure if it works for Windows), you can mmap file and copy blocks to new files. I think that would be the most effective way to do this.

+1
source

In your shell ...

 $ split -d -l <some number of lines> Foo Foo 

This will split the large Foo file into Foo1 by FooN , where n is determined by the number of lines in the original divided by the value you specify with -l. Iterating over parts in a loop ...

EDIT ... a good point in the comment ... this script (below) will read line by line, classify and assign to a file based on the first field ...

 #!/usr/bin/env python import csv prefix = 'filename' reader = csv.reader(open('%s.csv' % prefix, 'r')) suffix = 0 files = {} # read one row at a time, classify on first field, and send to a file # row[0] assumes csv reader does *not* split the line... if you make it do so, # remove the [0] indexing (and strip()s) below for row in reader: tmp = row[0].split('\t') fh = files.get(tmp[0].strip(), False) if not fh: fh = open('%s%05i.csv' % (prefix, suffix), 'a') files[tmp[0].strip()] = fh suffix += 1 fh.write(row[0]) for key in files.keys(): files[key].close() 
0
source

The most effective way is to block by block, simultaneously open all files and reuse the read buffer for writing. Since the information is provided, there is no other template in the data that can be used to speed up.

You will open each file in a different file descriptor to avoid opening and closing each line. Open them all at the beginning or lazily when you go. Close them all until graduation. On most Linux distributions, by default, only 1,024 open files will be available, so you will have to overcome the limitation by, say, using ulimit -n 2600 if you have permission to do this (see also /etc/security/limits.conf ).

Highlight a buffer, say a couple of kb and raw, read from the source file. Iterate and save control variables. Whenever you reach the end or end of a buffer, write from the buffer to the correct file descriptor. There are a few extreme cases that you will have to consider, for example, when a reading gets a new line, but not enough to figure out which file to enter.

You can do the reverse to avoid processing the first few bytes of the buffer if you choose the minimum row size. It will turn out to be a little more complicated, but nevertheless accelerated.

Interestingly, non-blocking I / O takes care of issues like this.

0
source

The obvious solution is to open a new file every time you encounter a new value and keep it open until the end. But your OS may not allow you to open 2500 files at the same time. Therefore, if you need to do this only once, you can do it as follows:

  • Go through the file, creating a list of all the values. Sort this list. (You do not need this step if you know in advance what values ​​will be.)
  • Set StartIndex to 0.
  • Open, say, 100 files (regardless of your operating system). They correspond to the following 100 values ​​in the list, from list[StartIndex] to list[StartIndex+99] .
  • Go through the file by canceling these entries using list[StartIndex] <= value <= list[StartIndex+99] .
  • Close all files.
  • Add 100 to StartIndex and go to step 3 if you are not done yet.

So, you need 26 passes through the file.

0
source

Source: https://habr.com/ru/post/885642/


All Articles