How to read a 6 GB csv file using pandas

I try to read a large csv file (aprox. 6 GB) in pandas and I get the following memory error:

MemoryError Traceback (most recent call last) <ipython-input-58-67a72687871b> in <module>() ----> 1 data=pd.read_csv('aphro.csv',sep=';') C:\Python27\lib\site-packages\pandas\io\parsers.pyc in parser_f(filepath_or_buffer, sep, dialect, compression, doublequote, escapechar, quotechar, quoting, skipinitialspace, lineterminator, header, index_col, names, prefix, skiprows, skipfooter, skip_footer, na_values, na_fvalues, true_values, false_values, delimiter, converters, dtype, usecols, engine, delim_whitespace, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_default_na, thousands, comment, decimal, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, nrows, iterator, chunksize, verbose, encoding, squeeze, mangle_dupe_cols, tupleize_cols, infer_datetime_format) 450 infer_datetime_format=infer_datetime_format) 451 --> 452 return _read(filepath_or_buffer, kwds) 453 454 parser_f.__name__ = name C:\Python27\lib\site-packages\pandas\io\parsers.pyc in _read(filepath_or_buffer, kwds) 242 return parser 243 --> 244 return parser.read() 245 246 _parser_defaults = { C:\Python27\lib\site-packages\pandas\io\parsers.pyc in read(self, nrows) 693 raise ValueError('skip_footer not supported for iteration') 694 --> 695 ret = self._engine.read(nrows) 696 697 if self.options.get('as_recarray'): C:\Python27\lib\site-packages\pandas\io\parsers.pyc in read(self, nrows) 1137 1138 try: -> 1139 data = self._reader.read(nrows) 1140 except StopIteration: 1141 if nrows is None: C:\Python27\lib\site-packages\pandas\parser.pyd in pandas.parser.TextReader.read (pandas\parser.c:7145)() C:\Python27\lib\site-packages\pandas\parser.pyd in pandas.parser.TextReader._read_low_memory (pandas\parser.c:7369)() C:\Python27\lib\site-packages\pandas\parser.pyd in pandas.parser.TextReader._read_rows (pandas\parser.c:8194)() C:\Python27\lib\site-packages\pandas\parser.pyd in pandas.parser.TextReader._convert_column_data (pandas\parser.c:9402)() C:\Python27\lib\site-packages\pandas\parser.pyd in pandas.parser.TextReader._convert_tokens (pandas\parser.c:10057)() C:\Python27\lib\site-packages\pandas\parser.pyd in pandas.parser.TextReader._convert_with_dtype (pandas\parser.c:10361)() C:\Python27\lib\site-packages\pandas\parser.pyd in pandas.parser._try_int64 (pandas\parser.c:17806)() MemoryError: 

Any help on this?

+153
python numpy pandas memory chunks csv
Sep 21 '14 at 17:46
source share
14 answers

The error indicates that there is not enough memory on the machine to read the entire CSV in the DataFrame at the same time. Assuming you don't need the entire dataset in memory at the same time, one way to avoid this problem is to process the CSV in chunks (by specifying the chunksize parameter):

 chunksize = 10 ** 6 for chunk in pd.read_csv(filename, chunksize=chunksize): process(chunk) 

The chucksize parameter specifies the number of lines in the block. ( chunksize , the last block may contain fewer lines than chunksize .)

+204
Sep 21 '14 at 17:54 on
source share

Chunking does not always have to be the first port of call for this problem.

  1. Large file size due to repeated non-numeric data or unwanted columns?

    If so, you can sometimes see significant memory savings by reading columns as categories and selecting the necessary columns using the pd.read_csv usecols parameter.

  2. Does your workflow require slicing, manipulation, export?

    If so, you can use dask.dataframe to slice , perform your calculations, and export multiple times. Chunking is performed by silent dask, which also supports a subset of the pandas API.

  3. If all else fails, read line by line through chunks.

    A slice through a panda or csv library as a last resort.

+57
Jan 23 '18 at 17:45
source share

I continued:

 chunks=pd.read_table('aphro.csv',chunksize=1000000,sep=';',\ names=['lat','long','rf','date','slno'],index_col='slno',\ header=None,parse_dates=['date']) df=pd.DataFrame() %time df=pd.concat(chunk.groupby(['lat','long',chunk['date'].map(lambda x: x.year)])['rf'].agg(['sum']) for chunk in chunks) 
+32
Sep 24 '14 at 12:46
source share

For big data, I recommend using the dask library
eg:

 # Dataframes implement the Pandas API import dask.dataframe as dd df = dd.read_csv('s3://.../2018-*-*.csv') 
+24
Apr 17 '18 at 11:21
source share

The above answer already satisfies the topic. In any case, if you need all the data in memory, look at bcolz . Its data compression in memory. I had a really good experience. But its missing number of pandas functions

Edit: I got a compression ratio of about 1/10 or the size of the original, I think, of course, depending on the type of data. Important missing features were aggregates.

+9
Sep 23 '14 at 8:44
source share

You can read data as chunks and save each chunk as a pickle.

 import pandas as pd import pickle in_path = "" #Path where the large file is out_path = "" #Path to save the pickle files to chunk_size = 400000 #size of chunks relies on your available memory separator = "~" reader = pd.read_csv(in_path,sep=separator,chunksize=chunk_size, low_memory=False) for i, chunk in enumerate(reader): out_file = out_path + "/data_{}.pkl".format(i+1) with open(out_file, "wb") as f: pickle.dump(chunk,f,pickle.HIGHEST_PROTOCOL) 

In the next step, you read in pickles and add each pickle to the desired data frame.

 import glob pickle_path = "" #Same Path as out_path ie where the pickle files are data_p_files=[] for name in glob.glob(pickle_path + "/data_*.pkl"): data_p_files.append(name) df = pd.DataFrame([]) for i in range(len(data_p_files)): df = df.append(pd.read_pickle(data_p_files[i]),ignore_index=True) 
+5
Oct 24 '18 at 8:40
source share

Solution 1:

Using pandas with big data

Solution 2:

 TextFileReader = pd.read_csv(path, chunksize=1000) # the number of rows per chunk dfList = [] for df in TextFileReader: dfList.append(df) df = pd.concat(dfList,sort=False) 
+4
Dec 05 '18 at 8:25
source share

The functions read_csv and read_table are almost the same. But you must assign the delimiter "," when you use the read_table function in your program.

 def get_from_action_data(fname, chunk_size=100000): reader = pd.read_csv(fname, header=0, iterator=True) chunks = [] loop = True while loop: try: chunk = reader.get_chunk(chunk_size)[["user_id", "type"]] chunks.append(chunk) except StopIteration: loop = False print("Iteration is stopped") df_ac = pd.concat(chunks, ignore_index=True) 
+3
Apr 26 '17 at 15:02
source share

You can try sframe, which has the same syntax as pandas, but allows you to manipulate files that are larger than your RAM.

+2
Jan 07 '17 at 13:22
source share

If you use pandas, read the large file in chunk and then cast line by line, this is what I did

 import pandas as pd def chunck_generator(filename, header=False,chunk_size = 10 ** 5): for chunk in pd.read_csv(filename,delimiter=',', iterator=True, chunksize=chunk_size, parse_dates=[1] ): yield (chunk) def _generator( filename, header=False,chunk_size = 10 ** 5): chunk = chunck_generator(filename, header=False,chunk_size = 10 ** 5) for row in chunk: yield row if __name__ == "__main__": filename = r'file.csv' generator = generator(filename=filename) while True: print(next(generator)) 
+1
Nov 13 '17 at 5:34 on
source share

Here is the following example:

 chunkTemp = [] queryTemp = [] query = pd.DataFrame() for chunk in pd.read_csv(file, header=0, chunksize=<your_chunksize>, iterator=True, low_memory=False): #REPLACING BLANK SPACES AT COLUMNS' NAMES FOR SQL OPTIMIZATION chunk = chunk.rename(columns = {c: c.replace(' ', '') for c in chunk.columns}) #YOU CAN EITHER: #1)BUFFER THE CHUNKS IN ORDER TO LOAD YOUR WHOLE DATASET chunkTemp.append(chunk) #2)DO YOUR PROCESSING OVER A CHUNK AND STORE THE RESULT OF IT query = chunk[chunk[<column_name>].str.startswith(<some_pattern>)] #BUFFERING PROCESSED DATA queryTemp.append(query) #! NEVER DO pd.concat OR pd.DataFrame() INSIDE A LOOP print("Database: CONCATENATING CHUNKS INTO A SINGLE DATAFRAME") chunk = pd.concat(chunkTemp) print("Database: LOADED") #CONCATENATING PROCESSED DATA query = pd.concat(queryTemp) print(query) 
+1
May 27 '19 at 6:12
source share

In addition to the answers above, for those who want to process CSV and then export to csv, parquet or SQL, d6tstack is another good option. You can upload multiple files and handle data schema changes (added / deleted columns). Basic support is already built in.

 def apply(dfg): # do stuff return dfg c = d6tstack.combine_csv.CombinerCSV([bigfile.csv], apply_after_read=apply, sep=',', chunksize=1e6) # or c = d6tstack.combine_csv.CombinerCSV(glob.glob('*.csv'), apply_after_read=apply, chunksize=1e6) # output to various formats, automatically chunked to reduce memory consumption c.to_csv_combine(filename='out.csv') c.to_parquet_combine(filename='out.pq') c.to_psql_combine('postgresql+psycopg2://usr:pwd@localhost/db', 'tablename') # fast for postgres c.to_mysql_combine('mysql+mysqlconnector://usr:pwd@localhost/db', 'tablename') # fast for mysql c.to_sql_combine('postgresql+psycopg2://usr:pwd@localhost/db', 'tablename') # slow but flexible 
0
Oct. 14 '18 at 22:44
source share

In case someone is still looking for something like this, I discovered that this new library called modin might help. It uses distributed computing that can help with reading. Here's a good article comparing its functionality with pandas. It essentially uses the same functions as pandas.

 import modin.pandas as pd pd.read_csv(CSV_FILE_NAME) 
0
Apr 11 '19 at 4:57
source share

I also have this memory issue when reading CSV data. The file has approx. 10,000,000 rows and 5 columns (this is a large 0.5 GB). I tried several variations of the code that can be found on the Internet, but nonetheless, I always get a memory error. This memory error always occurs at approx. 932 MB of command line use. It never crosses 1 GB, although the total memory usage is 70% or less. If I reduce the file manually to 1.019.000 lines of code, there will be no problems. And the problem is that I need a whole file to parse it.

Here is my latest implementation of reading a CSV file:

 fileName = r'data.csv' chunks = pd.read_csv(fileName, chunksize=1000) dsf = pd.concat(chunks) 

... and the error I get:

 Traceback (most recent call last): File "C:\Python\lib\site-packages\flask\app.py", line 2311, in wsgi_app response = self.full_dispatch_request() File "C:\Python\lib\site-packages\flask\app.py", line 1834, in full_dispatch_request rv = self.handle_user_exception(e) File "C:\Python\lib\site-packages\flask\app.py", line 1737, in handle_user_exception reraise(exc_type, exc_value, tb) File "C:\Python\lib\site-packages\flask\_compat.py", line 36, in reraise raise value File "C:\Python\lib\site-packages\flask\app.py", line 1832, in full_dispatch_request rv = self.dispatch_request() File "C:\Python\lib\site-packages\flask\app.py", line 1818, in dispatch_request return self.view_functions[rule.endpoint](**req.view_args) File "FirstDashboard.py", line 149, in change_features graphJSON= create_plot(feature) File "FirstDashboard.py", line 113, in create_plot name = 'ch4' File "C:\Python\lib\site-packages\plotly\graph_objs\__init__.py", line 37501, in __init__ self['x'] = x if x is not None else _v File "C:\Python\lib\site-packages\plotly\basedatatypes.py", line 3166, in __setitem__ self._set_prop(prop, value) File "C:\Python\lib\site-packages\plotly\basedatatypes.py", line 3402, in _set_prop val = validator.validate_coerce(val) File "C:\Python\lib\site-packages\_plotly_utils\basevalidators.py", line 372, in validate_coerce v = copy_to_readonly_numpy_array(v) File "C:\Python\lib\site-packages\_plotly_utils\basevalidators.py", line 127, in copy_to_readonly_numpy_array new_v = np.ascontiguousarray(v.copy()) MemoryError 

Are there any restrictions in Windows or Python (I am using the 64-bit version)? It is true that my computer is not very powerful, but the octave has no problems downloading such data (this may take 3 minutes).

What could be the solution to this problem?

0
Jun 19 '19 at 11:56 on
source share



All Articles