Using pandas in python to add CSV files to one

Question

Using pandas in python to add CSV files to one

I have n files in a directory that I need to merge into one. They have the same number of columns, for example, the contents of test1.csv :

 test1,test1,test1 test1,test1,test1 test1,test1,test1

Similarly, the contents of test2.csv :

 test2,test2,test2 test2,test2,test2 test2,test2,test2

I want final.csv to look like this:

 test1,test1,test1 test1,test1,test1 test1,test1,test1 test2,test2,test2 test2,test2,test2 test2,test2,test2

But instead it turns out like this:

 test file 1,test file 1.1,test file 1.2,test file 2,test file 2.1,test file 2.2 ,,,test file 2,test file 2,test file 2 ,,,test file 2,test file 2,test file 2 test file 1,test file 1,test file 1,,, test file 1,test file 1,test file 1,,,

Can someone help me figure out what's going on here? I pasted my code below:

 import csv import glob import pandas as pd import numpy as np all_data = pd.DataFrame() #initializes DF which will hold aggregated csv files for f in glob.glob("*.csv"): #for all csv files in pwd df = pd.read_csv(f) #create dataframe for reading current csv all_data = all_data.append(df) #appends current csv to final DF all_data.to_csv("final.csv", index=None)

+5

python pandas csv

Jack bauer Dec 12 '15 at 18:11

source share

3 answers

You can concat . Let df1 be your first data framework, and df2 second, you can:

 df = pd.concat([df1,df2],ignore_index=True)

ignore_index is optional, you can set it to True if you don't mind the original single data indices.

+2

Fabio lamana Dec 12 '15 at 18:15

source share

pandas not a tool to use when all you need to do is create one csv file, you can just write each csv to a new file as you go:

 import glob with open("out.csv","w") as out: for fle in glob.glob("*.csv"): with open(fle) as f: out.writelines(f)

Or using csv lib if you prefer:

 import glob import csv with open("out.csv", "w") as out: wr = csv.writer(out) for fle in glob.glob("*.csv"): with open(fle) as f: wr.writerows(csv.reader(f))

Creating a large data framework to ultimately write to disk does not make any real sense, moreover, if you had many large files, it is not even possible.

+1

Padraic cunningham Dec 12 '15 at 18:36

source share

jezrael · Accepted Answer · 2015-12-12T21:33:23+0000

I think there are more problems:

I removed import csv and import numpy as np because they are not used in this demo (but maybe they are missing lines so they can be imported)
I created a list of all dfs data frames where dataframes are added by dfs.append(df) . Then I used the concat function to concat this list to the final data frame.
In the read_csv function read_csv I added the header=None parameter, because the main problem was that read_csv reads the first line as header ,
In the to_csv function to_csv I added the header=None parameter to exclude the header.
I added the test folder to the destination destination file, because if you use the glob.glob("*.csv") function glob.glob("*.csv") , you must read the output file as an input file.

Decision:

 import glob import pandas as pd all_data = pd.DataFrame() #initializes DF which will hold aggregated csv files #list of all df dfs = [] for f in glob.glob("*.csv"): #for all csv files in pwd #add parameters to read_csv df = pd.read_csv(f, header=None) #create dataframe for reading current csv #print df dfs.append(df) #appends current csv to final DF all_data = pd.concat(dfs, ignore_index=True) print all_data # 0 1 2 #0 test1 test1 test1 #1 test1 test1 test1 #2 test1 test1 test1 #3 test2 test2 test2 #4 test2 test2 test2 #5 test2 test2 test2 all_data.to_csv("test/final.csv", index=None, header=None)

The following solution is similar.
I add the header=None parameter to read_csv and to_csv and add the ignore_index=True parameter to the append .

 import glob import pandas as pd all_data = pd.DataFrame() #initializes DF which will hold aggregated csv files for f in glob.glob("*.csv"): #for all csv files in pwd df = pd.read_csv(f, header=None) #create dataframe for reading current csv all_data = all_data.append(df, ignore_index=True) #appends current csv to final DF print all_data # 0 1 2 #0 test1 test1 test1 #1 test1 test1 test1 #2 test1 test1 test1 #3 test2 test2 test2 #4 test2 test2 test2 #5 test2 test2 test2 all_data.to_csv("test/final.csv", index=None, header=None)

Using pandas in python to add CSV files to one

More articles: