Replace each} a} \ n in a huge (12 GB), which consists of 1 line?

I have a log file (from the client). 18 concerts. The entire contents of the file is in 1 line. I want to read a file in logstash. But I have memory problems. The file is read line by line, but unfortunately it is on the same line.

I tried to split the file into lines so that logstash could process it (the file has a simple json format, there are no nested objects), I wanted to have each json on one line, dividing by } by replacing with }\n

 sed -i 's/}/}\n/g' NonPROD.log.backup 

But sed killed - I guess also because of memory. How can i solve this? Can I let sed manipulate a file using different pieces of data than strings? I know that by default sed is read line by line.

+5
source share
5 answers

The following are the only functions built into the shell:

 #!/bin/bash # as long as there exists another } in the file, read up to it... while IFS= read -r -d '}' piece; do # ...and print that content followed by '}' and a newline. printf '%s}\n' "$piece" done # print any trailing content after the last } [[ $piece ]] && printf '%s\n' "$piece" 

If you have a logstash configured to read from a TCP port (using 14321 as an arbitrary example below), you can run thescript <NonPROD.log.backup >"/dev/tcp/127.0.0.1/14321" or similar, and there you - without having to have a double source input file available on disk, as other answers require.

+6
source

With GNU awk for RT :

 $ printf 'abc}def}ghi\n' | awk -v RS='}' '{ORS=(RT?"}\n":"")}1' abc} def} ghi 

with other awks:

 $ printf 'abc}def}ghi\n' | awk -v RS='}' -v ORS='}\n' 'NR>1{print p} {p=$0} END{printf "%s",p}' abc} def} ghi 

I decided to test all currently existing solutions for functionality and runtime using the input file generated by this command:

 awk 'BEGIN{for(i=1;i<=1000000;i++)printf "foo}"; print "foo"}' > file1m 

and here is what I got:

1) awk (both awk scripts above had similar results):

 time awk -v RS='}' '{ORS=(RT?"}\n":"")}1' file1m 

Expected output, time =

 real 0m0.608s user 0m0.561s sys 0m0.045s 

2) shell loop :

 $ cat tst.sh #!/bin/bash # as long as there exists another } in the file, read up to it... while IFS= read -r -d '}' piece; do # ...and print that content followed by '}' and a newline. printf '%s}\n' "$piece" done # print any trailing content after the last } [[ $piece ]] && printf '%s\n' "$piece" $ time ./tst.sh < file1m 

Expected output, time =

 real 1m52.152s user 1m18.233s sys 0m32.604s 

3) tr + sed :

 $ time tr '}' '\n' < file1m | sed 's/$/}/' 

Failed to get the expected result (unwanted added } at the end of the file), timing =

 real 0m0.577s user 0m0.468s sys 0m0.078s 

Using settings to remove this final unwanted } :

 $ time tr '}' '\n' < file1m | sed 's/$/}/; $s/}//' real 0m0.718s user 0m0.670s sys 0m0.108s 

4) fold + sed + tr :

 $ time fold -w 1000 file1m | sed 's/}/}\n\n/g' | tr -s '\n' 

Expected output, time =

 real 0m0.811s user 0m1.137s sys 0m0.076s 

5) split + sed + cat :

 $ cat tst2.sh mkdir tmp$$ pwd="$(pwd)" cd "tmp$$" split -b 1m "${pwd}/${1}" sed -i 's/}/}\n/g' x* cat x* rm -fx* cd "$pwd" rmdir tmp$$ $ time ./tst2.sh file1m 

Expected output, time =

 real 0m0.983s user 0m0.685s sys 0m0.167s 
+3
source

You can run it through tr , and then put the end of the bracket at the end of each line:

 $ cat NonPROD.log.backup | tr '}' '\n' | sed 's/$/}/' > tmp$$ $ wc -l NonPROD.log.backup tmp$$ 0 NonPROD.log.backup 43 tmp10528 43 total 

(My test file had only 43 brackets.)

+2
source

You can:

  • Split the file to say 1M chunks using split -b 1m file.log
  • Process all files sed 's/}/}\n/g' x*
  • ... and redirect sed output to combine them with one part

The disadvantage of this is the doubled storage space.

+1
source

another alternative with fold

 $ fold -w 1000 long_line_file | sed 's/}/}\n\n/g' | tr -s '\n' 
0
source

Source: https://habr.com/ru/post/1269376/


All Articles