Here is a short answer.
chunksize = 10 ** 6 for chunk in pd.read_csv(filename, chunksize=chunksize): process(chunk)
Here is a very long answer.
To get started, you need to import pandas and sqlalchemy. Below are the commands below.
import pandas as pd from sqlalchemy import create_engine
Then set the variable pointing to your csv file. This is not necessary, but it helps in reuse.
file = '/path/to/csv/file'
With these three lines of code, we are ready to begin analyzing our data. Let's look at the “csv file chapter” to see how the content will look.
print pd.read_csv(file, nrows=5)
This command uses the pandas "read_csv" command to read only 5 lines (nrows = 5), and then prints these lines on the screen. This allows you to understand the structure of the csv file and make sure that the data is formatted in a way that makes sense for your work.
Before we can really work with data, we need to do something with it so that we can start filtering it to work with subsets of data. I usually use pandas dataframe, but with large data files we need to store the data somewhere else. In this case, configure the local sqlite database well, read the csv file in chunks, and then write these fragments in sqllite.
To do this, you first need to create the sqlite database using the following command.
csv_database = create_engine('sqlite:///csv_database.db')
Next, we need to iterate through the CSV file in pieces and save the data in sqllite.
chunksize = 100000 i = 0 j = 1 for df in pd.read_csv(file, chunksize=chunksize, iterator=True): df = df.rename(columns={c: c.replace(' ', '') for c in df.columns}) df.index += j i+=1 df.to_sql('table', csv_database, if_exists='append') j = df.index[-1] + 1
With this code, we set chunksize to 100,000 to preserve the size of the managed blocks by initializing a pair of iterators (i = 0, j = 0), and then run the for loop through the loop. The for loop reads a piece of data from the CSV file, removes the space from any column name, and then saves the fragment to the sqlite database (df.to_sql (...)).
This may take some time if your CSV file is large enough, but the time spent on it is worth it, because now you can use pandas' sql tools to extract data from the database without worrying about memory limitations.
To access the data now, you can run commands such as:
df = pd.read_sql_query('SELECT * FROM table', csv_database)
Of course, using 'select * ... will load all the data into memory, which is the problem we're trying to get away with, so you have to drop select filters from your filters to filter the data. For instance:
df = pd.read_sql_query('SELECT COl1, COL2 FROM table where COL1 = SOMEVALUE', csv_database)