My source data is in a TSV file, 6 columns and more than 2 million rows.
Here is what I am trying to accomplish:
- I need to read data in 3 of the columns (3, 4, 5) in this source file
- The fifth column is an integer. I need to use this integer value to duplicate a row entry using data in the third and fourth columns (by the number of integer times).
- I want to write output # 2 to an output file in CSV format.
Below I came up with.
My question is: is this an effective way to do this? It looks like this could be intense when trying on 2 million lines.
First, I created a separate sample tab file for working with it and named it “sample.txt”. It is basic and has only four lines:
Row1_Column1 Row1-Column2 Row1-Column3 Row1-Column4 2 Row1-Column6 Row2_Column1 Row2-Column2 Row2-Column3 Row2-Column4 3 Row2-Column6 Row3_Column1 Row3-Column2 Row3-Column3 Row3-Column4 1 Row3-Column6 Row4_Column1 Row4-Column2 Row4-Column3 Row4-Column4 2 Row4-Column6
then I have this code:
import csv with open('sample.txt','r') as tsv: AoA = [line.strip().split('\t') for line in tsv] for a in AoA: count = int(a[4]) while count > 0: with open('sample_new.csv','ab') as csvfile: csvwriter = csv.writer(csvfile, delimiter=',') csvwriter.writerow([a[2], a[3]]) count = count - 1
CJH Dec 21 '12 at 15:38 2012-12-21 15:38
source share