Sequential reading a huge CSV file in python

I have a 10gb CSV file containing some information that I need to use.

Since I have limited memory on my PC, I cannot read the entire file in memory in one batch. Instead, I would like to iteratively read only a few lines of this file.

Let's say that in the first iteration I want to read the first 100, and in the second - from 101 to 200, etc.

Is there an efficient way to accomplish this task in Python? May Pandas provide something useful for this? Or are there better (in terms of memory and speed) methods?

+11
source share
4 answers

Here is a short answer.

chunksize = 10 ** 6 for chunk in pd.read_csv(filename, chunksize=chunksize): process(chunk) 

Here is a very long answer.

To get started, you need to import pandas and sqlalchemy. Below are the commands below.

 import pandas as pd from sqlalchemy import create_engine 

Then set the variable pointing to your csv file. This is not necessary, but it helps in reuse.

 file = '/path/to/csv/file' 

With these three lines of code, we are ready to begin analyzing our data. Let's look at the “csv file chapter” to see how the content will look.

 print pd.read_csv(file, nrows=5) 

This command uses the pandas "read_csv" command to read only 5 lines (nrows = 5), and then prints these lines on the screen. This allows you to understand the structure of the csv file and make sure that the data is formatted in a way that makes sense for your work.

Before we can really work with data, we need to do something with it so that we can start filtering it to work with subsets of data. I usually use pandas dataframe, but with large data files we need to store the data somewhere else. In this case, configure the local sqlite database well, read the csv file in chunks, and then write these fragments in sqllite.

To do this, you first need to create the sqlite database using the following command.

 csv_database = create_engine('sqlite:///csv_database.db') 

Next, we need to iterate through the CSV file in pieces and save the data in sqllite.

 chunksize = 100000 i = 0 j = 1 for df in pd.read_csv(file, chunksize=chunksize, iterator=True): df = df.rename(columns={c: c.replace(' ', '') for c in df.columns}) df.index += j i+=1 df.to_sql('table', csv_database, if_exists='append') j = df.index[-1] + 1 

With this code, we set chunksize to 100,000 to preserve the size of the managed blocks by initializing a pair of iterators (i = 0, j = 0), and then run the for loop through the loop. The for loop reads a piece of data from the CSV file, removes the space from any column name, and then saves the fragment to the sqlite database (df.to_sql (...)).

This may take some time if your CSV file is large enough, but the time spent on it is worth it, because now you can use pandas' sql tools to extract data from the database without worrying about memory limitations.

To access the data now, you can run commands such as:

 df = pd.read_sql_query('SELECT * FROM table', csv_database) 

Of course, using 'select * ... will load all the data into memory, which is the problem we're trying to get away with, so you have to drop select filters from your filters to filter the data. For instance:

 df = pd.read_sql_query('SELECT COl1, COL2 FROM table where COL1 = SOMEVALUE', csv_database) 
+9
source

You can use pandas.read_csv() with the chuncksize parameter:

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html#pandas.read_csv

 for chunck_df in pd.read_csv('yourfile.csv', chunksize=100): # each chunck_df contains a part of the whole CSV 
+6
source

The method for transferring a huge CSV to the database is good, because we can easily use the SQL query. We must also consider two things.

FIRST ITEM: SQL is also not rubber, it will not be able to stretch memory.

For example, converted to a bd file :

https://nycopendata.socrata.com/Social-Services/311-Service-Requests- from 2010 to the present / erm2-nwe9

For this database file, SQL language:

 pd.read_sql_query("SELECT * FROM 'table'LIMIT 600000", Mydatabase) 

It can read a maximum of about 0.6 million records with no more than 16 GB of PC RAM (up to 15.8 seconds). It may be malicious to add that downloading directly from a CSV file is more efficient:

 giga_plik = 'c:/1/311_Service_Requests_from_2010_to_Present.csv' Abdul = pd.read_csv(giga_plik, nrows=1100000) 

(run time 16.5 seconds)

SECOND POINT: In order to effectively use the SQL data series converted from CSV, we need to think about a suitable date form. So I suggest adding the code for this to ryguy72:

 df['ColumnWithQuasiDate'] = pd.to_datetime(df['Date']) 

All the code for file 311 is about the same as I indicated:

 start_time = time.time() ### sqlalchemy create_engine plikcsv = 'c:/1/311_Service_Requests_from_2010_to_Present.csv' WM_csv_datab7 = create_engine('sqlite:///C:/1/WM_csv_db77.db') #---------------------------------------------------------------------- chunksize = 100000 i = 0 j = 1 ## -------------------------------------------------------------------- for df in pd.read_csv(plikcsv, chunksize=chunksize, iterator=True, encoding='utf-8', low_memory=False): df = df.rename(columns={c: c.replace(' ', '') for c in df.columns}) ## ----------------------------------------------------------------------- df['CreatedDate'] = pd.to_datetime(df['CreatedDate']) # to datetimes df['ClosedDate'] = pd.to_datetime(df['ClosedDate']) ## -------------------------------------------------------------------------- df.index += j i+=1 df.to_sql('table', WM_csv_datab7, if_exists='append') j = df.index[-1] + 1 print(time.time() - start_time) 

In the end, I would like to add: converting a CSV file directly from the Internet to a database seems like a bad idea to me. I suggest downloading the database and converting locally.

0
source

This code can help you in this task. It moves through a large CSV file and does not take up much memory, so you can do this in a standard laptop.

 import pandas as pd import os 

Here chunksize indicates the number of lines in the csv file you want to read later

 chunksize2 = 2000 path = './' data2 = pd.read_csv('ukb35190.csv', chunksize=chunksize2, encoding = "ISO-8859-1") df2 = data2.get_chunk(chunksize2) headers = list(df2.keys()) del data2 start_chunk = 0 data2 = pd.read_csv('ukb35190.csv', chunksize=chunksize2, encoding = "ISO-8859-1", skiprows=chunksize2*start_chunk) 

headers = []

 for i, df2 in enumerate(data2): try: print('reading cvs....') print(df2) print('header: ', list(df2.keys())) print('our header: ', headers) # Access chunks within data # for chunk in data: # You can now export all outcomes in new csv files file_name = 'export_csv_' + str(start_chunk+i) + '.csv' save_path = os.path.abspath( os.path.join( path, file_name ) ) print('saving ...') except Exception: print('reach the end') break 
0
source

Source: https://habr.com/ru/post/1265698/


All Articles