Pythonic way to send file contents to channel and count # lines in one step

Question

Pythonic way to send file contents to channel and count # lines in one step

given the file> 4gb myfile.gz, I need to run it in the channel for use with Teradata fastload. I also need to count the number of lines in a file. Ideally, I want to make only one pass through the file. I use awk to output the entire line ($ 0) to stdout and, using the awk END clause, writes the number of lines (awk NR variable) to another file descriptor (outfile).

I managed to do this with awk, but I would like to know if there is a more pythonic way.

#!/usr/bin/env python from subprocess import Popen, PIPE from os import path the_file = "/path/to/file/myfile.gz" outfile = "/tmp/%s.count" % path.basename(the_file) cmd = ["-c",'zcat %s | awk \'{print $0} END {print NR > "%s"} \' ' % (the_file, outfile)] zcat_proc = Popen(cmd, stdout = PIPE, shell=True)

The pipe is later consumed by a call to quickly load teradata, which reads from

 "/dev/fd/" + str(zcat_proc.stdout.fileno())

This works, but I would like to know if it is possible to skip it with awk and it is better to use python. I am also open to other methods. I have several large files that I need to handle this way.

+4

python bash shell awk bigdata

Neil kodner Dec 15 '11 at 14:45

source share

4 answers

No need for any of zcat or awk. Counting lines in a gzip file can be done using

 import gzip nlines = sum(1 for ln in gzip.open("/path/to/file/myfile.gz"))

If you want to do something else with strings, for example, pass them to another process, do

 nlines = 0 for ln in gzip.open("/path/to/file/myfile.gz"): nlines += 1 # pass the line to the other process

+7

Fred foo Dec 15 '11 at 15:04

source share

This can be done in one simple bash line:

 zcat myfile.gz | tee >(wc -l >&2) | fastload

This will print the line count to stderr. If you want it somewhere else, you can redirect the wc output as you like.

+1

ccarton Dec 15 '11 at 19:20

source share

In fact, it should not be possible to directly transfer data to Fastload, so it would be great if someone posted an exact example here, if they could.

From the Teradata documentation in the Fastload configuration http://www.info.teradata.com/htmlpubs/DB_TTU_14_00/index.html#page/Load_and_Unload_Utilities/B035_2411_071A/2411Ch03.026.028.html#ww1938556

FILE = file_name A keyword phrase that identifies the name of the data source that contains the input. fileid must reference a regular file. In particular, pipes are not supported .

0

Diego Mar 10 '15 at 17:51

source share

Sven marnach · Accepted Answer · 2011-12-15T15:04:56+0000

String counting and unpacking of compressed gzip files can be easily done using Python and the standard library. You can do everything in one go:

 import gzip, subprocess, os fifo_path = "path/to/fastload-fifo" os.mkfifo(fifo_path) fastload_fifo = open(fifo_path) fastload = subprocess.Popen(["fastload", "--read-from", fifo_path], stdin=subprocess.PIPE) with gzip.open("/path/to/file/myfile.gz") as f: for i, line in enumerate(f): fastload_fifo.write(line) print "Number of lines", i + 1 os.unlink(fifo_path)

I do not know how to call Fastload - substitute the correct parameters in the call.

Pythonic way to send file contents to channel and count # lines in one step

More articles: