Combining multiple csv files into one csv with the same header - Python

Question

Combining multiple csv files into one csv with the same header - Python

I am currently using the code below to import 6,000 csv files (with headers) and export them to a single csv file (with one header line).

#import csv files from folder path =r'data/US/market/merged_data' allFiles = glob.glob(path + "/*.csv") stockstats_data = pd.DataFrame() list_ = [] for file_ in allFiles: df = pd.read_csv(file_,index_col=None,) list_.append(df) stockstats_data = pd.concat(list_) print(file_ + " has been imported.")

This code works fine, but it is slow. This may take up to 2 days.

I was provided with one line of script for a terminal command line that does the same (but without headers). This script takes 20 seconds.

  for f in *.csv; do cat "`pwd`/$f" | tail -n +2 >> merged.csv; done

Does anyone know how I can speed up the first Python script? To shorten the time, I thought about not importing it into a DataFrame and just concatenating CSV, but I can't figure it out.

Thanks.

+11

python pandas terminal concatenation csv

mattblack Jun 27 '17 at 10:58

source share

4 answers

Do you need to do this in Python? If you are completely open for this in the shell, all you need to do is first cat the header line from a randomly selected input .csv file to merged.csv before starting your one-line line:

 cat a-randomly-selected-csv-file.csv | head -n1 > merged.csv for f in *.csv; do cat "`pwd`/$f" | tail -n +2 >> merged.csv; done

+6

Peter Leimbigler Jun 27 '17 at 23:17

source share

You do not need pandas for this, just a simple csv module will work fine.

 import csv df_out_filename = 'df_out.csv' write_headers = True with open(df_out_filename, 'wb') as fout: writer = csv.writer(fout) for filename in allFiles: with open(filename) as fin: reader = csv.reader(fin) headers = reader.next() if write_headers: write_headers = False # Only write headers once. writer.writerow(headers) writer.writerows(reader) # Write all remaining rows.

+1

Alexander Jun 27 '17 at 23:28

source share

Here's a simpler approach - you can use pandas (although I'm not sure how this will help using RAM) -

 import pandas as pd import glob path =r'data/US/market/merged_data' allFiles = glob.glob(path + "/*.csv") stockstats_data = pd.DataFrame() list_ = [] for file_ in allFiles: df = pd.read_csv(file_) stockstats_data = pd.concat((df, stockstats_data), axis=0)

0

markroxor May 06 '19 at 9:05

source share

Shadowranger · Accepted Answer · 2017-06-27T23:28:51+0000

If you do not need CSV in memory, just copying from input to output, it will be much cheaper to avoid parsing altogether and copy without accumulating in memory:

 import shutil #import csv files from folder path = r'data/US/market/merged_data' allFiles = glob.glob(path + "/*.csv") with open('someoutputfile.csv', 'wb') as outfile: for i, fname in enumerate(allFiles): with open(fname, 'rb') as infile: if i != 0: infile.readline() # Throw away header on all but first file # Block copy rest of file from input to output without parsing shutil.copyfileobj(infile, outfile) print(fname + " has been imported.")

It; shutil.copyfileobj efficiently copies data, significantly reducing work at the Python level for analysis and reinitialization.

It is assumed that all CSV files have the same format, encoding, line endings, etc., and the header does not contain embedded newline characters, but in this case it is much faster than alternatives.

Combining multiple csv files into one csv with the same header - Python

More articles: