Reading a huge .csv file

I am currently trying to read data from .csv files in Python 2.7 with row counts up to 1 million and 200 columns (file sizes from 100 MB to 1.6 GB). I can do this (very slowly) for files with less than 300,000 lines, but as soon as I go above, I get memory errors. My code looks like this:

def getdata(filename, criteria): data=[] for criterion in criteria: data.append(getstuff(filename, criteron)) return data def getstuff(filename, criterion): import csv data=[] with open(filename, "rb") as csvfile: datareader=csv.reader(csvfile) for row in datareader: if row[3]=="column header": data.append(row) elif len(data)<2 and row[3]!=criterion: pass elif row[3]==criterion: data.append(row) else: return data 

The reason for the else clause in the getstuff function is that all the elements that match the criteria will be listed together in the csv file, so I leave the loop when I go through them to save time.

My questions:

  1. How can I get this to work with large files?

  2. Is there any way to make this faster?

My computer has 8 GB of RAM, runs on 64-bit Windows 7, and the processor has a frequency of 3.40 GHz (I don’t know what information you need).

+86
python file csv
Jul 03 '13 at 9:44 on
source share
8 answers

You read all the lines in a list, and then process that list. Do not do this .

Process your lines as you produce them. If you need to filter the data first, use the generator function:

 import csv def getstuff(filename, criterion): with open(filename, "rb") as csvfile: datareader = csv.reader(csvfile) yield next(datareader) # yield the header row count = 0 for row in datareader: if row[3] == criterion: yield row count += 1 elif count: # done when having read a consecutive series of rows return 

I also simplified your filter test; the logic is the same, but more concise.

Since you match only one string sequence matching the criteria, you can also use:

 import csv from itertools import dropwhile, takewhile def getstuff(filename, criterion): with open(filename, "rb") as csvfile: datareader = csv.reader(csvfile) yield next(datareader) # yield the header row # first row, plus any subsequent rows that match, then stop # reading altogether # Python 2: use 'for row in takewhile(...): yield row' instead # instead of 'yield from takewhile(...)'. yield from takewhile( lambda r: r[3] == criterion, dropwhile(lambda r: r[3] != criterion, datareader)) return 

Now you can loop into getstuff() directly. Do the same in getdata() :

 def getdata(filename, criteria): for criterion in criteria: for row in getstuff(filename, criterion): yield row 

Now loop directly to getdata() in your code:

 for row in getdata(somefilename, sequence_of_criteria): # process row 

Now you keep in memory only one line instead of thousands of lines by criterion.

yield makes the function a generator function, which means that it will not do any work until you begin to execute it cyclically.

+134
Jul 03 '13 at 9:50
source share
β€” -

Although Martigine's answer is most likely. Here is a more intuitive way to handle large csv files for beginners. This allows you to process groups of rows or pieces at a time.

 import pandas as pd chunksize = 10 ** 8 for chunk in pd.read_csv(filename, chunksize=chunksize): process(chunk) 
+27
Apr 7 '17 at 19:51 on
source share

I analyze vibration quite a bit and look at large data sets (tens and hundreds of millions of points). My test showed that the pandas.read_csv () function is 20 times faster than numpy.genfromtxt (). And the genfromtxt () function works 3 times faster than numpy.loadtxt (). It seems you need pandas for large datasets.

I posted the code and datasets that I used in this testing on a blog that discussed MATLAB and Python for vibration analysis .

+12
Aug 23 '16 at 0:48
source share

what worked for me was and is superfast

 import pandas as pd import dask.dataframe as dd import time t=time.clock() df_train = dd.read_csv('../data/train.csv', usecols=[col1, col2]) df_train=df_train.compute() print("load train: " , time.clock()-t) 

Another working solution:

 import pandas as pd from tqdm import tqdm PATH = '../data/train.csv' chunksize = 500000 traintypes = { 'col1':'category', 'col2':'str'} cols = list(traintypes.keys()) df_list = [] # list to hold the batch dataframe for df_chunk in tqdm(pd.read_csv(PATH, usecols=cols, dtype=traintypes, chunksize=chunksize)): # Can process each chunk of dataframe here # clean_data(), feature_engineer(),fit() # Alternatively, append the chunk to list and merge all df_list.append(df_chunk) # Merge all dataframes into one dataframe X = pd.concat(df_list) # Delete the dataframe list to release memory del df_list del df_chunk 
+4
May 31 '18 at 12:42
source share

here is another solution for Python3:

 import csv with open(filename, "r") as csvfile: datareader = csv.reader(csvfile) count = 0 for row in datareader: if row[3] in ("column header", criterion): doSomething(row) count += 1 elif count > 2: break 

here datareader is a generator function.

+1
Jul 09 '18 at 13:30
source share

For those who land on this question. Using pandas with "chunksize and" usecols helped me read a huge zip file faster than other suggested options.

 import pandas as pd sample_cols_to_keep =['col_1', 'col_2', 'col_3', 'col_4','col_5'] # First setup dataframe iterator, 'usecols parameter filters the columns, and 'chunksize' sets the number of rows per chunk in the csv. (you can change these parameters as you wish) df_iter = pd.read_csv('../data/huge_csv_file.csv.gz', compression='gzip', chunksize=20000, usecols=sample_cols_to_keep) # this list will store the filtered dataframes for later concatenation df_lst = [] # Iterate over the file based on the criteria and append to the list for df_ in df_iter: tmp_df = (df_.rename(columns={col: col.lower() for col in df_.columns}) # filter eg. rows where 'col_1' value grater than one .pipe(lambda x: x[x.col_1 > 0] )) df_lst += [tmp_df.copy()] # And finally combine filtered df_lst into the final lareger output say 'df_final' dataframe df_final = pd.concat(df_lst) 
+1
Jun 01 '19 at 16:43 on
source share

I recently tried to solve the same problem, but found that the python pandas package is quite efficient.

Here you can check here http://pandas.pydata.org/

Pandas is a high performance data analysis library for big data.

0
Aug 16 '15 at 0:49
source share

Use pytables. "PyTables is a hierarchical data set management package and is designed to efficiently and easily manage extremely large amounts of data."

-one
Jul 03 '13 at 9:54 on
source share



All Articles