Awk associative array is growing rapidly

Question

Awk associative array is growing rapidly

I have a file that assigns md5sums numbers as follows:

0 0000001732816557DE23435780915F75 1 00000035552C6F8B9E7D70F1E4E8D500 2 00000051D63FACEF571C09D98659DC55 3 0000006D7695939200D57D3FBC30D46C 4 0000006E501F5CBD4DB56CA48634A935 5 00000090B9750D99297911A0496B5134 6 000000B5AEA2C9EA7CC155F6EBCEF97F 7 00000100AD8A7F039E8F48425D9CB389 8 0000011ADE49679AEC057E07A53208C1

Another file contains three md5sums in each line, for example:

 00000035552C6F8B9E7D70F1E4E8D500 276EC96E149571F8A27F4417D7C6BC20 9CFEFED8FB9497BAA5CD519D7D2BB5D7 00000035552C6F8B9E7D70F1E4E8D500 44E48C092AADA3B171CE899FFC6943A8 1B757742E1BF2AA5DB6890E5E338F857

What I want replaces the first and third md5sums in the second file with the integers of the first file. I am currently trying to execute the following awk script:

 awk '{OFS="\t"}FNR==NR{map[$2]=$1;next} {print map[$1],$2,map[$3]}' mapping.txt relation.txt

The problem is that the script requires more than 16 GB, although the first file only has 5.7 g on the hard drive.

+6

associative-array awk out-of-memory

pNRuag Apr 05 '15 at 2:15

source share

3 answers

If you don’t have enough memory to store the first file, you need to write something like this to view the 1st file for each value in the second file:

 awk 'BEGIN{OFS="\t"} { val1 = val3 = "" while ( (getline line < "mapping.txt") > 0 ) { split(line,flds) if (flds[2] == $1) { val1 = flds[1] } if (flds[2] == $3) { val3 = flds[1] } if ( (val1 != "") && (val3 != "") ) { break } } close("mapping.txt") print val1,$2,val3 }' relation.txt

It will be slow. You can add N getline-d line cache to speed it up if you want.

+2

Ed morton Apr 05 '15 at 20:18

source share

If the file size causes awk to run out of memory, then either use a different tool or a completely different approach.

The sed command can succeed with much less memory usage. The idea is to read the index file and create a sed script that reassigns and then calls sed in the generated sedscript.

The bash script below is an implementation of this idea. It includes some STDERR output to track progress. I like to create output to track problems with large datasets or other forms of time processing.

This script has been tested on a small data set; It can work with your data. Please, try.

 #!/bin/bash # md5-indexes.txt # 0 0000001732816557DE23435780915F75 # 1 00000035552C6F8B9E7D70F1E4E8D500 # 2 00000051D63FACEF571C09D98659DC55 # 3 0000006D7695939200D57D3FBC30D46C # 4 0000006E501F5CBD4DB56CA48634A935 # 5 00000090B9750D99297911A0496B5134 # 6 000000B5AEA2C9EA7CC155F6EBCEF97F # 7 00000100AD8A7F039E8F48425D9CB389 # 8 0000011ADE49679AEC057E07A53208C1 # md5-data.txt # 00000035552C6F8B9E7D70F1E4E8D500 276EC96E149571F8A27F4417D7C6BC20 9CFEFED8FB9497BAA5CD519D7D2BB5D7 # 00000035552C6F8B9E7D70F1E4E8D500 44E48C092AADA3B171CE899FFC6943A8 1B757742E1BF2AA5DB6890E5E338F857 # Goal replace field 1 and field 3 with indexes to md5 checksums from md5-indexes md5_indexes='md5-indexes.txt' md5_data='md5-data.txt' talk() { echo 1>&2 "$*" ; } talkf() { printf 1>&2 " $@ " ; } track() { local var="$1" interval="$2" local val eval "val=\$$var" if (( interval == 0 || val % interval == 0 )); then shift 2 talkf " $@ " fi eval "(( $var++ ))" # increment the counter } # Build a sedscript to translate all occurances of the 1st & 3rd MD5 sums into their # corresponding indexes talk "Building the sedscript from the md5 indexes.." sedscript=/tmp/$$.sed linenum=0 lines=`wc -l <$md5_indexes` interval=$(( lines / 100 )) while read index md5sum ; do track linenum $interval "..$linenum" echo "s/^[[:space:]]*[[:<:]]$md5sum[[:>:]]/$index/" >>$sedscript echo "s/[[:<:]]$md5sum[[:>:]]\$/$index/" >>$sedscript done <$md5_indexes talk '' sedlength=`wc -l <$sedscript` talkf "The sedscript is %d lines\n" $sedlength cmd="sed -E -f $sedscript -i .bak $md5_data" talk "Invoking: $cmd" $cmd changes=`diff -U 0 $md5_data.bak $md5_data | tail +3 | grep -c '^+'` talkf "%d lines changed in $md5_data\n" $changes exit

Here are two files:

 cat md5-indexes.txt 0 0000001732816557DE23435780915F75 1 00000035552C6F8B9E7D70F1E4E8D500 2 00000051D63FACEF571C09D98659DC55 3 0000006D7695939200D57D3FBC30D46C 4 0000006E501F5CBD4DB56CA48634A935 5 00000090B9750D99297911A0496B5134 6 000000B5AEA2C9EA7CC155F6EBCEF97F 7 00000100AD8A7F039E8F48425D9CB389 8 0000011ADE49679AEC057E07A53208C1 cat md5-data.txt 00000035552C6F8B9E7D70F1E4E8D500 276EC96E149571F8A27F4417D7C6BC20 9CFEFED8FB9497BAA5CD519D7D2BB5D7 00000035552C6F8B9E7D70F1E4E8D500 44E48C092AADA3B171CE899FFC6943A8 1B757742E1BF2AA5DB6890E5E338F857

Here is an example of execution:

 $ ./md5-reindex.sh Building the sedscript from the md5 indexes.. ..0..1..2..3..4..5..6..7..8 The sedscript is 18 lines Invoking: sed -E -f /tmp/83800.sed -i .bak md5-data.txt 2 lines changed in md5-data.txt

Finally, the resulting file:

 $ cat md5-data.txt 1 276EC96E149571F8A27F4417D7C6BC20 9CFEFED8FB9497BAA5CD519D7D2BB5D7 1 44E48C092AADA3B171CE899FFC6943A8 1B757742E1BF2AA5DB6890E5E338F857

+1

aks Apr 6 '15 at 6:42

source share

Simon · Accepted Answer · 2015-04-06T04:42:33+0000

This problem can be solved as follows ( file1.txt is a file with integers and md5sums, and file2.txt is a file with three columns md5sums):

 #!/bin/sh # First sort each of file 1 and the first and third columns of file 2 by MD5 awk '{ print $2 "\t" $1}' file1.txt | sort >file1_n.txt # Before we sort the file 2 columns, we number the rows so we can put them # back into the original order later cut -f1 file2.txt | cat -n - | awk '{ print $2 "\t" $1}' | sort >file2_1n.txt cut -f3 file2.txt | cat -n - | awk '{ print $2 "\t" $1}' | sort >file2_3n.txt # Now do a join between them, extract the two columns we want, and put them back in order join -t' ' file2_1n.txt file1_n.txt | awk '{ print $2 "\t" $3}' | sort -n | cut -f2 >file2_1.txt join -t' ' file2_3n.txt file1_n.txt | awk '{ print $2 "\t" $3}' | sort -n | cut -f2 >file2_3.txt cut -f2 file2.txt | paste file2_1.txt - file2_3.txt >file2_new1.txt

For the case where file1.txt and file2.txt are 1 million lines long, this solution and Ed Morton's awk -only solution take about the same time period on my system. My system will take a very long time to solve the problem of 140 million lines, regardless of the approach used, but I checked a test case for files with 10 million lines.

I suggested that a solution based on sort (which automatically uses temporary files if necessary) should be faster for a large number of lines, because it will be O (N log N) runtime, whereas a solution, reading a mapping file for each line The input will be O (N ^ 2) if the two files are the same size.

Sync Results

My assumption regarding the performance relationship of the two candidate solutions turned out to be erroneous for the test cases that I tried. On my system, the sort based solution and the awk -only solution took the same (within 30%) time intervals for each of the 1 million and 10 million input line files, with awk , only the solution is faster in each case. I do not know if this ratio will be true if the size of the input file is increased by another 10%.

It is strange that the problem with 10 million lines took about 10 times to work with both solutions as a problem with 1 million lines, which puzzled me, as I would expect non-linear relationships with file length for both solutions.

Awk associative array is growing rapidly

More articles: