Concatenate pandas dataframe in file loop

I am trying to write a script that iterates over files through a specific template / variable, then it concatenates the 8th column of files, keeping the first 4 columns that are common to all files. The script works if I use the following command:

reader = csv.reader(open("1isoforms.fpkm_tracking.txt", 'rU'), delimiter='\t') #to read the header names so i can use them as index. all headers for the three files are the same header_row = reader.next() # Gets the header df1 = pd.read_csv("1isoforms.fpkm_tracking.txt", index_col=header_row[0:4], sep="\t") #file #1 with index as first 5 columns df2 = pd.read_csv("2isoforms.fpkm_tracking.txt", index_col=header_row[0:4], sep="\t") #file #2 with index as first 5 columns df3 = pd.read_csv("3isoforms.fpkm_tracking.txt", index_col=header_row[0:4], sep="\t") #file #3 with index as first 5 columns result = pd.concat([df1.ix[:,4], df2.ix[:,4]], keys=["Header1", "Header2", "Header3"], axis=1) #concatenates the 8th column of the files and changes the header result.to_csv("OutputTest.xls", sep="\t") 

While this works, it is not practical for me to enter the file names one by one, since sometimes I have 100 files, so the type can not in df ... for each. Instead, I tried to use a for loop to do this, but I could not figure it out. here is what i still have:

 k=0 for geneFile in glob.glob("*_tracking*"): while k < 3: reader = csv.reader(open(geneFile, 'rU'), delimiter='\t') header_row = reader.next() key = str(k) key = pd.read_csv(geneFile, index_col=header_row[0:1], sep="\t") result = pd.concat([key[:,5]], axis=1) result.to_csv("test2.xls", sep="\t") 

However, this does not work.

The problems I am facing are as follows:

  • How can I iterate over the input files and generate different variable names for each, which I can then use in the pd.concat function one by one?

  • How can I use a for loop to create a string file name which is a combination of df and integer

  • How can I fix the above script to get my desired element.

  • A minor issue is how I use the col_index function: is there a way to use column # rather than column names? I know this works for index_col=0 or any single # . But I could not use integers for> 1 indexing column.

Note that all files have the same structure, and the index columns are the same.

Your feedback is highly appreciated.

+5
source share
1 answer

Consider a merge with the arguments right_index and left_index :

 import pandas as pd numberoffiles = 100 # FIRST IMPORT (CREATE RESULT DATA FRAME) result = pd.read_csv("1isoforms.fpkm_tracking.txt", sep="\t", index_col=[0,1,2,3], usecols=[0,1,2,3,7]) # ALL OTHER IMPORTS (MERGE TO RESULT DATA FRAME, 8TH COLUMN SUFFIXED ITERATIVELY) for i in range(2,numberoffiles+1): df = pd.read_csv("{}isoforms.fpkm_tracking.txt".format(i), sep="\t", index_col=[0,1,2,3], usecols=[0,1,2,3,7]) result = pd.merge(result, df, right_index=True, left_index=True, suffixes=[i-1, i]) result.to_excel("Output.xlsx") result.to_csv("Output.csv") 
+1
source

Source: https://habr.com/ru/post/1239564/


All Articles