How to quickly parse large (> 10 GB) files?

I need to process text files with a size of 10-20 GB in the format size: field1 field2 field3 field4 field5

I would like to analyze the data from each row of field2 into one of several files; the file into which it falls is determined line by line with the value in field4. In field 2 there are 25 different possible values ​​and therefore 25 different files, data can be analyzed.

I tried using Perl (slow) and awk (faster, but still slow) - does anyone have suggestions or pointers to alternative approaches?

FYI here is the awk code I tried to use; Note that I had to go back to the large file 25 times, because I could not save 25 files at once in awk:

chromosomes=(1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25)
for chr in ${chromosomes[@]}
do

awk < my_in_file_here -v pat="$chr" '{if ($4 == pat) for (i = $2; i <= $2+52; i++) print i}' >> my_out_file_"$chr".query 

done
+3
6

Python. , . , , C Python. , ; Python Perl.

import sys

s_usage = """\
Usage: csplit <filename>
Splits input file by columns, writes column 2 to file based on chromosome from column 4."""

if len(sys.argv) != 2 or sys.argv[1] in ("-h", "--help", "/?"):

    sys.stderr.write(s_usage + "\n")
    sys.exit(1)


# replace these with the actual patterns, of course
lst_pat = [
    'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j',
    'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't',
    'u', 'v', 'w', 'x', 'y'
]


d = {}
for s_pat in lst_pat:
    # build a dictionary mapping each pattern to an open output file
    d[s_pat] = open("my_out_file_" + s_pat, "wt")

if False:
    # if the patterns are unsuitable for filenames (contain '*', '?', etc.) use this:
    for i, s_pat in enumerate(lst_pat):
        # build a dictionary mapping each pattern to an output file
        d[s_pat] = open("my_out_file_" + str(i), "wt")

for line in open(sys.argv[1]):
    # split a line into words, and unpack into variables.
    # use '_' for a variable name to indicate data we don't care about.
    # s_data is the data we want, and s_pat is the pattern controlling the output
    _, s_data, _, s_pat, _ = line.split()
    # use s_pat to get to the file handle of the appropriate output file, and write data.
    d[s_pat].write(s_data + "\n")

# close all the output file handles.
for key in d:
    d[key].close()

EDIT: , , .

. , Python " ", . , , , Python , , . "try/except", , , .

, , , . , . ( "_" - , Python , , .) Python , . , , ; :

for line in open(sys.argv[1]):
    lst = line.split()
    d[lst[3]].write(lst[1] + "\n")

, lst. , . , . Python , 0 , lst[1], - lst[3]. , .

, , , Python . . "try/except" :

for line in open(sys.argv[1]):
    lst = line.split()
    try:
        d[lst[3]].write(lst[1] + "\n")
    except KeyError:
        sys.stderr.write("Warning: illegal line seen: " + line)

.

EDIT: @larelogio , AWK-. AWK , . Python, :

for line in open(sys.argv[1]):
    lst = line.split()
    n = int(lst[1])
    for i in range(n, n+53):
        d[lst[3]].write(i + "\n")

. , , .

for line in open(sys.argv[1]):
    lst = line.split()
    n = int(lst[1])
    s = "\n".join(str(i) for i in range(n, n+53))
    d[lst[3]].write(s + "\n")

, . .write() 53 .

+7

Perl :

#! /usr/bin/perl

use warnings;
use strict;

my @values = (1..25);

my %fh;
foreach my $chr (@values) {
  my $path = "my_out_file_$chr.query";
  open my $fh, ">", $path
    or die "$0: open $path: $!";

  $fh{$chr} = $fh;
}

while (<>) {
  chomp;
  my($a,$b,$c,$d,$e) = split " ", $_, 5;

  print { $fh{$d} } "$_\n"
    for $b .. $b+52;
}
+15

, ? 25 .!!

awk '
$4 <=25 {
    for (i = $2; i <= $2+52; i++){
        print i >> "my_out_file_"$4".query"
    }
}' bigfile
+7

, awk .

, , K & R C.

, temp, , , awk script, , , scanf(), C, 25 , awk script C.

+1

, , I/O Memory Mapped . , .5GB Visual Basic 5 ()... API CreateFileMapping ( - "" ) . .

, API Microsoft, , MMIO : MSDN

!

+1

, .

, 2. , 25, 4:

my %tx = map {my $tx=''; for my $tx1 ($_ .. $_+52) {$tx.="$tx1\n"}; $_=>$tx} (1..25);

Latter print {$fh{$pat}} $tx{$base};

0
source

Source: https://habr.com/ru/post/1725914/


All Articles