I am trying to write a script that iterates over files through a specific template / variable, then it concatenates the 8th column of files, keeping the first 4 columns that are common to all files. The script works if I use the following command:
reader = csv.reader(open("1isoforms.fpkm_tracking.txt", 'rU'), delimiter='\t') #to read the header names so i can use them as index. all headers for the three files are the same header_row = reader.next() # Gets the header df1 = pd.read_csv("1isoforms.fpkm_tracking.txt", index_col=header_row[0:4], sep="\t") #file #1 with index as first 5 columns df2 = pd.read_csv("2isoforms.fpkm_tracking.txt", index_col=header_row[0:4], sep="\t") #file #2 with index as first 5 columns df3 = pd.read_csv("3isoforms.fpkm_tracking.txt", index_col=header_row[0:4], sep="\t") #file #3 with index as first 5 columns result = pd.concat([df1.ix[:,4], df2.ix[:,4]], keys=["Header1", "Header2", "Header3"], axis=1) #concatenates the 8th column of the files and changes the header result.to_csv("OutputTest.xls", sep="\t")
While this works, it is not practical for me to enter the file names one by one, since sometimes I have 100 files, so the type can not in df ... for each. Instead, I tried to use a for loop to do this, but I could not figure it out. here is what i still have:
k=0 for geneFile in glob.glob("*_tracking*"): while k < 3: reader = csv.reader(open(geneFile, 'rU'), delimiter='\t') header_row = reader.next() key = str(k) key = pd.read_csv(geneFile, index_col=header_row[0:1], sep="\t") result = pd.concat([key[:,5]], axis=1) result.to_csv("test2.xls", sep="\t")
However, this does not work.
The problems I am facing are as follows:
How can I iterate over the input files and generate different variable names for each, which I can then use in the pd.concat function one by one?
How can I use a for loop to create a string file name which is a combination of df and integer
How can I fix the above script to get my desired element.
A minor issue is how I use the col_index function: is there a way to use column # rather than column names? I know this works for index_col=0 or any single # . But I could not use integers for> 1 indexing column.
Note that all files have the same structure, and the index columns are the same.
Your feedback is highly appreciated.