If the file size causes awk to run out of memory, then either use a different tool or a completely different approach.
The sed command can succeed with much less memory usage. The idea is to read the index file and create a sed script that reassigns and then calls sed in the generated sedscript.
The bash script below is an implementation of this idea. It includes some STDERR output to track progress. I like to create output to track problems with large datasets or other forms of time processing.
This script has been tested on a small data set; It can work with your data. Please, try.
#!/bin/bash # md5-indexes.txt # 0 0000001732816557DE23435780915F75 # 1 00000035552C6F8B9E7D70F1E4E8D500 # 2 00000051D63FACEF571C09D98659DC55 # 3 0000006D7695939200D57D3FBC30D46C # 4 0000006E501F5CBD4DB56CA48634A935 # 5 00000090B9750D99297911A0496B5134 # 6 000000B5AEA2C9EA7CC155F6EBCEF97F # 7 00000100AD8A7F039E8F48425D9CB389 # 8 0000011ADE49679AEC057E07A53208C1 # md5-data.txt # 00000035552C6F8B9E7D70F1E4E8D500 276EC96E149571F8A27F4417D7C6BC20 9CFEFED8FB9497BAA5CD519D7D2BB5D7 # 00000035552C6F8B9E7D70F1E4E8D500 44E48C092AADA3B171CE899FFC6943A8 1B757742E1BF2AA5DB6890E5E338F857 # Goal replace field 1 and field 3 with indexes to md5 checksums from md5-indexes md5_indexes='md5-indexes.txt' md5_data='md5-data.txt' talk() { echo 1>&2 "$*" ; } talkf() { printf 1>&2 " $@ " ; } track() { local var="$1" interval="$2" local val eval "val=\$$var" if (( interval == 0 || val % interval == 0 )); then shift 2 talkf " $@ " fi eval "(( $var++ ))" # increment the counter } # Build a sedscript to translate all occurances of the 1st & 3rd MD5 sums into their # corresponding indexes talk "Building the sedscript from the md5 indexes.." sedscript=/tmp/$$.sed linenum=0 lines=`wc -l <$md5_indexes` interval=$(( lines / 100 )) while read index md5sum ; do track linenum $interval "..$linenum" echo "s/^[[:space:]]*[[:<:]]$md5sum[[:>:]]/$index/" >>$sedscript echo "s/[[:<:]]$md5sum[[:>:]]\$/$index/" >>$sedscript done <$md5_indexes talk '' sedlength=`wc -l <$sedscript` talkf "The sedscript is %d lines\n" $sedlength cmd="sed -E -f $sedscript -i .bak $md5_data" talk "Invoking: $cmd" $cmd changes=`diff -U 0 $md5_data.bak $md5_data | tail +3 | grep -c '^+'` talkf "%d lines changed in $md5_data\n" $changes exit
Here are two files:
cat md5-indexes.txt 0 0000001732816557DE23435780915F75 1 00000035552C6F8B9E7D70F1E4E8D500 2 00000051D63FACEF571C09D98659DC55 3 0000006D7695939200D57D3FBC30D46C 4 0000006E501F5CBD4DB56CA48634A935 5 00000090B9750D99297911A0496B5134 6 000000B5AEA2C9EA7CC155F6EBCEF97F 7 00000100AD8A7F039E8F48425D9CB389 8 0000011ADE49679AEC057E07A53208C1 cat md5-data.txt 00000035552C6F8B9E7D70F1E4E8D500 276EC96E149571F8A27F4417D7C6BC20 9CFEFED8FB9497BAA5CD519D7D2BB5D7 00000035552C6F8B9E7D70F1E4E8D500 44E48C092AADA3B171CE899FFC6943A8 1B757742E1BF2AA5DB6890E5E338F857
Here is an example of execution:
$ ./md5-reindex.sh Building the sedscript from the md5 indexes.. ..0..1..2..3..4..5..6..7..8 The sedscript is 18 lines Invoking: sed -E -f /tmp/83800.sed -i .bak md5-data.txt 2 lines changed in md5-data.txt
Finally, the resulting file:
$ cat md5-data.txt 1 276EC96E149571F8A27F4417D7C6BC20 9CFEFED8FB9497BAA5CD519D7D2BB5D7 1 44E48C092AADA3B171CE899FFC6943A8 1B757742E1BF2AA5DB6890E5E338F857
source share