How to split a large text file into even-sized parts without trimming the record?

I have a large text file (about 10 GB) that contains many stories. Each story begins with a marker $$. The following is an example file:

$$
AA This is story 1
BB 345

$$

AA This is story 2
BB 456

I want to split this file into parts of about 250 MB in size. But none of the stories should be split into two different files.

Can someone help me with Unix or Perl code for this?

+3
source share
3 answers
use strict;
use warnings;
use autodie;

$/ = "\$\$\n";
my $targetsize = 250*1024*1024;
my $fileprefix = 'chunk';
my $outfile = 0;
my $outfh;
my $outsize = 0;
while (my $story = <>) {
    chomp($story);
    next unless $story; # disregard initial empty chunk
    $story = "$/$story";

    # no file open yet, or this story takes us farther from the target size
    if ( ! $outfile || abs($outsize - $targetsize) < abs($outsize + length($story) - $targetsize) ) {
        ++$outfile;
        open $outfh, '>', "$fileprefix$outfile";
        $outsize = 0;
    }

    $outsize += length($story);
    print $outfh $story;
}
+5
source

csplit is what you want. It does the same thing as split, but is based on a template.

Alternative in C ++ (not tested):

#include <boost/shared_ptr.hpp>
#include <sstream>
#include <iostream>
#include <fstream>
#include <string>

void new_output_file(boost::shared_ptr<std::ofstream> &out, const char *prefix)
{
    static int i = 0;
    std::ostringstream filename;
    filename << prefix << "_" << i++;
    out.reset(new std::ofstream(filename));
}

int main(int argc, char **argv)
{
    std::ifstream in(argv[1]);
    int i = 0;
    long size = 0;
    const long max_size = 200 * 1024 * 1024;
    std::string line;
    boost::shared_ptr<std::ofstream> out(NULL);
    new_output_file(out, argv[2]);
    while(in.good())
    {
        std::getline(in,line);
        size += line.length() + 1 /* line termination char */;
        if(size >= max_size && line.length() && line[0] == '$' && line[1] == '$')
        {
            new_output_file(out, argv[2]);
            size = line.length() + 1;
        }
        out << line << std::endl;
    }
    return 0;
}
+1
source

I changed the ysth code and found that it works. Please suggest if you think you can change this to make it better.

use strict;
use warnings;

my $targetsize = 50*1024*1024;
my $fileprefix = 'chunk';
my $outfile = 0;
my $outsize = 0;
my $outfh;
my $temp='';
while (my $line = <>)  {
  chomp($line);
  next unless $line;
  # discard initial empty chunk  
  if($line =~ /^\$\$$/ || $outfile == 0){
        $outsize += length($temp);
        if ( $outfile == 0 || ($outsize - $targetsize) > 0)  { 
              ++$outfile; 
              if($outfh) {close($outfh);}
              open $outfh, '>', "$fileprefix$outfile"; 
              $outsize = 0;
        }
        $temp='';
    }
  $temp = $temp.$line;
  print $outfh "$line\n";  
} 
+1
source

Source: https://habr.com/ru/post/1789011/


All Articles