Optimize shell script for multiple serial replacements

Question

Optimize shell script for multiple serial replacements

I have a file containing a list of replacement pairs (about 100 of them) that sed use to replace strings in files.

The steam is as follows:

 old|new tobereplaced|replacement (stuffiwant).*(too)|\1\2

and my current code is:

 cat replacement_list | while read i do old=$(echo "$i" | awk -F'|' '{print $1}') #due to the need for extended regex new=$(echo "$i" | awk -F'|' '{print $2}') sed -r "s/`echo "$old"`/`echo "$new"`/g" -i file done

I cannot help but think that there is a better way to do replacements. I tried to flip the loop to start the lines of the file first, but that turned out to be much more expensive.

Are there any other ways to speed up this script?

EDIT

Thanks for all the quick answers. Let me try out the various suggestions before choosing an answer.

One thing to clean up: I also need subexpression / group functions. For example, I may need one replacement:

 ([0-9])U|\10 #the extra brackets and escapes were required for my original code

Some information about improvements (for updating):

Method: Processing Time
Original script: 0.85s
cut instead of awk : 0.71s
anubhava method: 0.18s
Chthonicdaemon method: 0.01s

+5

bash shell sed

Reuben L. Aug 29 '14 at 6:50

source share

7 answers

I recently compared various string replacement methods, including the user program sed -e , perl -lnpe and probably the not-so-well-known MySQL command-line utility, replace . replace , optimized for replacing strings, was almost an order of magnitude faster than sed . The results looked something like this: slower at first:

 custom program > sed > LANG=C sed > perl > LANG=C perl > replace

If you need performance, use replace . For it to be available on your system, you will need to install some MySQL distribution.

From replace.c :

Replace lines in a text file
This program replaces lines in files or from stdin to stdout. It takes a list of from-string / to-string strings and replaces each occurrence of a string from a string with the corresponding string. The first match of the found string is consistent. If there is more than one possibility of replacing a string, longer matches are preferable to shorter matches.
...
Programs make a DFA-state-machine of lines, and the speed does not depend on the number of replacement lines (only the number of replacements). The string is supposed to end with \ n or \ 0. There are no restrictions on the number of lines for line lengths.

Read more about sed. You can use multiple kernels with sed by splitting your replacements into #cpus groups and then pass them through sed commands, something like this:

 $ sed -e 's/A/B/g; ...' file.txt | \ sed -e 's/B/C/g; ...' | \ sed -e 's/C/D/g; ...' | \ sed -e 's/D/E/g; ...' > out

In addition, if you use sed or perl , and your system has UTF-8 configuration, then this also improves performance to place LANG=C before the commands:

 $ LANG=C sed ...

+3

miku Aug 29 '14 at 7:00

source share

You can reduce unnecessary awk calls and use BASH to break name-value pairs:

 while IFS='|' read -r old new; do # echo "$old :: $new" sed -i "s~$old~$new~g" file done < replacement_list

IFS = '|' will enable reading to populate the name value in two different shell variables old and new .

~ Is assumed to be missing from your name-value pairs. If this is not the case then feel free to use an alternative sed delimiter.

+1

anubhava Aug 29 '14 at 7:01

source share

Here is what I would like to try:

save the sed replacement pair in a Bash array, for example:
create sed command based on this array using parameter extension
execute the command.

 patterns=( old new tobereplaced replacement ) pattern_count=${#patterns[*]} # number of pattern sedArgs=() # will hold the list of sed arguments for (( i=0 ; i<$pattern_count ; i=i+2 )); do # don't need to loop on the replacement… search=${patterns[i]}; replace=${patterns[i+1]}; # … here we got the replacement part sedArgs+=" -es/$search/$replace/g" done sed ${sedArgs[@]} file

The result of this command:

sed -es / old / new / g -es / tobereplaced / replacement / g file

+1

Édouard Lopez Aug 29 '14 at 8:04

source share

You can try this.

 pattern='' cat replacement_list | while read i do old=$(echo "$i" | awk -F'|' '{print $1}') #due to the need for extended regex new=$(echo "$i" | awk -F'|' '{print $2}') pattern=${pattern}"s/${old}/${new}/g;" done sed -r ${pattern} -i file

This will cause the sed command to be executed only once in the file with all replacements. You can also replace awk with cut . cut may be more optimized than awk , although I'm not sure about that.

 old=`echo $i | cut -d"|" -f1` new=`echo $i | cut -d"|" -f2`

0

nisargjhaveri Aug 29 '14 at 7:03

source share

You might want to do all this in awk:

 awk -F\| 'NR==FNR{old[++n]=$1;new[n]=$2;next}{for(i=1;i<=n;++i)gsub(old[i],new[i])}1' replacement_list file

Create a list of old and new words from the first file. next ensures that the rest of the script will not be run in the first file. For the second file, scroll through the list of substitutions and complete them one at a time. 1 at the end means the line is being printed.

0

Tom fenech Aug 29 '14 at 7:40

source share

 { cat replacement_list;echo "-End-"; cat YourFile; } | sed -n '1,/-End-/ s/$/³/;1h;1!H;$ {g t again :again /^-End-³\n/ {s///;b done } s/^\([^|]*\)|\([^³]*\)³\(\n\)\(.*\)\1/\1|\2³\3\4\2/ t again s/^[^³]*³\n// t again :done p }'

More fun for code through sed. Maybe try for a while, because this is the beginning of only 1 sed, which is recursif.

for posix sed (so --posix with GNU sed)

explanations

copy the list of notes before the contents of the delimited file (for the line with ³ and for the list with -End- ) for easier sed handling (it’s difficult to use \ n in the character class in posix sed.
put the whole line in the buffer (add a line separator for the list of notes and -End- before)
if it is -End-³ , delete the line and go to the final print
replace each first pattern (group 1) found in the text with the second patttern (group 2)
if found, restart ( t again )
delete first row
restart the process ( t again ). T is necessary because b does not reset the test, and the next t always true.

0

NeronLeVelu Aug 29 '14 at 8:43

source share

chthonicdaemon · Accepted Answer · 2014-08-29T07:02:51+0000

You can use sed to create properly formatted sed input:

 sed -e 's/^/s|/; s/$/|g/' replacement_list | sed -r -f - file

Optimize shell script for multiple serial replacements

More articles: