given the file> 4gb myfile.gz, I need to run it in the channel for use with Teradata fastload. I also need to count the number of lines in a file. Ideally, I want to make only one pass through the file. I use awk to output the entire line ($ 0) to stdout and, using the awk END clause, writes the number of lines (awk NR variable) to another file descriptor (outfile).
I managed to do this with awk, but I would like to know if there is a more pythonic way.
#!/usr/bin/env python from subprocess import Popen, PIPE from os import path the_file = "/path/to/file/myfile.gz" outfile = "/tmp/%s.count" % path.basename(the_file) cmd = ["-c",'zcat %s | awk \'{print $0} END {print NR > "%s"} \' ' % (the_file, outfile)] zcat_proc = Popen(cmd, stdout = PIPE, shell=True)
The pipe is later consumed by a call to quickly load teradata, which reads from
"/dev/fd/" + str(zcat_proc.stdout.fileno())
This works, but I would like to know if it is possible to skip it with awk and it is better to use python. I am also open to other methods. I have several large files that I need to handle this way.
source share