Reading and parsing a TSV file and then managing it to save to CSV (* effective *)

My source data is in a TSV file, 6 columns and more than 2 million rows.

Here is what I am trying to accomplish:

  • I need to read data in 3 of the columns (3, 4, 5) in this source file
  • The fifth column is an integer. I need to use this integer value to duplicate a row entry using data in the third and fourth columns (by the number of integer times).
  • I want to write output # 2 to an output file in CSV format.

Below I came up with.

My question is: is this an effective way to do this? It looks like this could be intense when trying on 2 million lines.

First, I created a separate sample tab file for working with it and named it “sample.txt”. It is basic and has only four lines:

Row1_Column1 Row1-Column2 Row1-Column3 Row1-Column4 2 Row1-Column6 Row2_Column1 Row2-Column2 Row2-Column3 Row2-Column4 3 Row2-Column6 Row3_Column1 Row3-Column2 Row3-Column3 Row3-Column4 1 Row3-Column6 Row4_Column1 Row4-Column2 Row4-Column3 Row4-Column4 2 Row4-Column6 

then I have this code:

 import csv with open('sample.txt','r') as tsv: AoA = [line.strip().split('\t') for line in tsv] for a in AoA: count = int(a[4]) while count > 0: with open('sample_new.csv','ab') as csvfile: csvwriter = csv.writer(csvfile, delimiter=',') csvwriter.writerow([a[2], a[3]]) count = count - 1 
+62
python file csv tab-delimited-text
Dec 21 '12 at 15:38
source share
1 answer

You should use the csv module to read the file of values ​​separated by tabs. Do not read it in memory at a time. Each read line has all the information needed to write the lines to the output CSV file. Keep the output file open.

 import csv with open('sample.txt','rb') as tsvin, open('new.csv', 'wb') as csvout: tsvin = csv.reader(tsvin, delimiter='\t') csvout = csv.writer(csvout) for row in tsvin: count = int(row[4]) if count > 0: csvout.writerows([row[2:4] for _ in xrange(count)]) 

or using the itertools module to do the repetition using itertools.repeat() :

 from itertools import repeat import csv with open('sample.txt','rb') as tsvin, open('new.csv', 'wb') as csvout: tsvin = csv.reader(tsvin, delimiter='\t') csvout = csv.writer(csvout) for row in tsvin: count = int(row[4]) if count > 0: csvout.writerows(repeat(row[2:4], count)) 
+124
Dec 21
source share



All Articles