What is the best python or bash for selectively concatenating a large number of files?

Question

What is the best python or bash for selectively concatenating a large number of files?

I have about 20,000 files coming from the output of a program, and their names follow the format:

data1.txt
data2.txt
...
data99.txt
data100.txt
...
data999.txt
data1000.txt
...
data20000.txt

I would like to write a script that introduces the number N as an input argument. Then it makes blocks of N concatenated files, so if N = 5, it will make the following new files:

data_new_1.txt: it would contain (concatenated) data1.txt to data5.txt (like cat data1.txt data2.txt ...> data_new_1.txt )

data_new_2.txt: it would contain (concatenated) data6.txt to data10.txt
.....

I wonder what you think is the best way to do this, be it bash, python or another, like awk, perl, etc.

The best approach I have in mind is the simplest code.

thank

+3

python bash

flow Mar 12 '10 at 17:47

source share

7 answers

Python (2.6) ( Python 2.5, ,

from __future__ import with_statement

script)...:

import sys

def main(N):
   rN = range(N)
   for iout, iin in enumerate(xrange(1, 99999, N)):
       with open('data_new_%s.txt' % (iout+1), 'w') as out:
           for di in rN:
               try: fin = open('data%s.txt' % (iin + di), 'r')
               except IOError: return
               out.write(fin.read())
               fin.close()

if __name__ == '__main__':
    if len(sys.argv) > 1:
        N = int(sys.argv[1])
    else:
        N = 5
    main(N)

, - , Python ( ) , bash ( : sys, , , "" , , ); , fork/exec cat bash ; , - , . , .

+4

Alex Martelli 12 . '10 18:20

? Bash , Bash script, . - ?

, Bash:

 declare blocksize=5
 declare i=1
 declare blockstart=1
 declare blockend=$blocksize
 declare -a fileset 
 while [ -f data${i}.txt ] ; do
         fileset=("${fileset[@]}" $data${i}.txt)
         i=$(($i + 1))
         if [ $i -gt $blockend ] ; then
                  cat "${fileset[@]}" > data_new_${blockstart}.txt
                  fileset=() # clear
                  blockstart=$(($blockstart + $blocksize))
                  blockend=$(($blockend+ $blocksize))
         fi
 done

EDIT: , "" == " ", . Perl , Python, Awk , bash. , .

EDIT : dtmilano, , cat , "" 4000 .

+1

Sorpigal 12 . '10 18:04

, , 1

#! /bin/bash

N=5 # block size
S=1 # start
E=20000 # end

for n in $(seq $S $N $E)
do
    CMD="cat "
    i=$n
    while [ $i -lt $((n + N)) ]
    do
        CMD+="data$((i++)).txt "
    done
    $CMD > data_new_$((n / N + 1)).txt
done

+1

Diego Torres Milano 12 . '10 18:46

Since this is easy to do in any shell, I would just use it.

This should do it:

#!/bin/sh
FILES=$1
FILENO=1

for i in data[0-9]*.txt; do
    FILES=`expr $FILES - 1`
    if [ $FILES -eq 0 ]; then
        FILENO=`expr $FILENO + 1`
        FILES=$1
    fi

    cat $i >> "data_new_${FILENO}.txt"
done

Python version:

#!/usr/bin/env python

import os
import sys

if __name__ == '__main__':
    files_per_file = int(sys.argv[1])

    i = 0
    while True:
        i += 1
        source_file = 'data%d.txt' % i
        if os.path.isfile(source_file):
            dest_file = 'data_new_%d.txt' % ((i / files_per_file) + 1)
            file(dest_file, 'wa').write(file(source_file).read())
        else:
            break

0

Wolph Mar 12 '10 at 17:58

source share

Let's say if you have a simple script that merges files and stores a counter for you, for example:

#!/usr/bin/bash
COUNT=0
if [ -f counter ]; then
  COUNT=`cat counter`
fi
COUNT=$[$COUNT+1]
echo $COUNT > counter
cat $@ > $COUNT.data

The command line will do:

find -name "*" -type f -print0 | xargs -0 -n 5 path_to_the_script

0

Codism Mar 12 '10 at 18:29

source share

Simple enough?

make_cat.py

limit = 1000
n = 5
for i in xrange( 0, (limit+n-1)//n ):
     names = [ "data{0}.txt".format(j) for j in range(i*n,i*n+n) ]
     print "cat {0} >data_new_{1}.txt".format( " ".join(names), i )

Script

python make_cat.py | sh

0

S. Lott Mar 12 '10 at 19:34

source share

ghostdog74 · Accepted Answer · 2010-03-13T01:48:47+0000

how about one liner? :)

ls data[0-9]*txt|sort -nk1.5|awk 'BEGIN{rn=5;i=1}{while((getline _<$0)>0){print _ >"data_new_"i".txt"}close($0)}NR%rn==0{i++}'

What is the best python or bash for selectively concatenating a large number of files?

More articles: