Combining multiple csv files into one csv with the same header - Python

I am currently using the code below to import 6,000 csv files (with headers) and export them to a single csv file (with one header line).

#import csv files from folder path =r'data/US/market/merged_data' allFiles = glob.glob(path + "/*.csv") stockstats_data = pd.DataFrame() list_ = [] for file_ in allFiles: df = pd.read_csv(file_,index_col=None,) list_.append(df) stockstats_data = pd.concat(list_) print(file_ + " has been imported.") 

This code works fine, but it is slow. This may take up to 2 days.

I was provided with one line of script for a terminal command line that does the same (but without headers). This script takes 20 seconds.

  for f in *.csv; do cat "`pwd`/$f" | tail -n +2 >> merged.csv; done 

Does anyone know how I can speed up the first Python script? To shorten the time, I thought about not importing it into a DataFrame and just concatenating CSV, but I can't figure it out.

Thanks.

+11
source share
4 answers

If you do not need CSV in memory, just copying from input to output, it will be much cheaper to avoid parsing altogether and copy without accumulating in memory:

 import shutil #import csv files from folder path = r'data/US/market/merged_data' allFiles = glob.glob(path + "/*.csv") with open('someoutputfile.csv', 'wb') as outfile: for i, fname in enumerate(allFiles): with open(fname, 'rb') as infile: if i != 0: infile.readline() # Throw away header on all but first file # Block copy rest of file from input to output without parsing shutil.copyfileobj(infile, outfile) print(fname + " has been imported.") 

It; shutil.copyfileobj efficiently copies data, significantly reducing work at the Python level for analysis and reinitialization.

It is assumed that all CSV files have the same format, encoding, line endings, etc., and the header does not contain embedded newline characters, but in this case it is much faster than alternatives.

+12
source

Do you need to do this in Python? If you are completely open for this in the shell, all you need to do is first cat the header line from a randomly selected input .csv file to merged.csv before starting your one-line line:

 cat a-randomly-selected-csv-file.csv | head -n1 > merged.csv for f in *.csv; do cat "`pwd`/$f" | tail -n +2 >> merged.csv; done 
+6
source

You do not need pandas for this, just a simple csv module will work fine.

 import csv df_out_filename = 'df_out.csv' write_headers = True with open(df_out_filename, 'wb') as fout: writer = csv.writer(fout) for filename in allFiles: with open(filename) as fin: reader = csv.reader(fin) headers = reader.next() if write_headers: write_headers = False # Only write headers once. writer.writerow(headers) writer.writerows(reader) # Write all remaining rows. 
+1
source

Here's a simpler approach - you can use pandas (although I'm not sure how this will help using RAM) -

 import pandas as pd import glob path =r'data/US/market/merged_data' allFiles = glob.glob(path + "/*.csv") stockstats_data = pd.DataFrame() list_ = [] for file_ in allFiles: df = pd.read_csv(file_) stockstats_data = pd.concat((df, stockstats_data), axis=0) 
0
source

Source: https://habr.com/ru/post/1269272/


All Articles