Bash: Split a file in linux into 10 pieces with only empty lines

I am currently working with some analysis files using the Scala application. The problem is that the files are too large, so they always throw an exception in the heap size (and I tried with the maximum heap size, which I can and still don't use).

Now the files look like this:

This is
one paragraph
for Scala
to parse

This is
another paragraph
for Scala
to parse

Yet another
paragraph

Etc. Basically, I would like to take all these files and split them into 10 or 20 pieces each, but I have to be sure that the paragraph will not be split in half in the results. Is there any way to do this?

Thank!

+4
source share
4 answers

awk script, batch_size ( , ). :

#!/usr/bin/awk -f

BEGIN {RS=""; ORS="\n\n"; last_f=""; batch_size=20}

# perform setup whenever the filename changes
FILENAME!=last_f {r_per_f=calc_r_per_f(); incr_out(); last_f=FILENAME; fnum=1}

# write a record to an output file
{print $0 > out}

# after a batch, change the file name
(FNR%r_per_f)==0 {incr_out()}

# function to roll the file name
function incr_out() {close(out); fnum++; out=FILENAME"_"fnum".out"}

# function to get the number of records per file
function calc_r_per_f() {
    cmd=sprintf( "grep \"^$\" %s | wc -l", FILENAME )
    cmd | getline rcnt
    close(cmd)
    return( sprintf( "%d", rcnt/batch_size ) )
    }

batch_size , , , out= incr_out().

awko, , awko data1 data2, , , data2_7.out. , , , ..

+1

csplit file.txt /^$/ {*}

csplit , .

/^$/ .

{*} .

+5

3 :

awk 'BEGIN{nParMax=3;npar=0;nFile=0}
     /^$/{npar++;if(npar==nParMax){nFile++;npar=0;next}}
     {print $0 > "foo."nFile}' foo.orig

10 :

awk 'BEGIN{nLineMax=10;nline=0;nFile=0}
    /^$/{if(nline>=nLineMax){nFile++;nline=0;next}}
    {nline++;print $0 > "foo."nFile}' foo.orig
+1

"split", , script:

awk -v RS="\n\n" 'BEGIN {n=1}{print $0 > "file"n++".txt"}' yourfile.txt

This divides each paragraph in a file named "file1.txt", "file2.txt", etc ...

To set "n ++" each paragraph is "N", you can do:

awk -v RS="\n\n" 'BEGIN{n=1; i=0; nbp=100}{if (i++ == nbp) {i=0; n++} print $0 > "file"n".txt"}' yourfile.txt

Just change the value of "nbp" to set paragraph numbers

0
source

Source: https://habr.com/ru/post/1533623/


All Articles