Working with a 10 + GB dataset in Python Pandas

Question

Working with a 10 + GB dataset in Python Pandas

I have a very large .csv (which was originally from a SAS dataset) that has the following columns:

target_series metric_series month metric_1 metric_2 target_metric 1 1 1 #float #float #float 1 1 2 #float #float #float ... 1 1 60 #float #float #float 1 2 1 #float #float #float 1 2 2 #float #float #float ... 1 80000 60 #float #float #float 2 1 1 #float #float #float ... 50 80000 60 #float #float #float

As you can see, the file has 60 months of times 80000 independent serial series of 50 target series of row values and takes up more than 10 GB of space while saving as .csv need to execute and record the correlation between each metric_1 and metric_2 using target_metric .

I wrote the following code:

 import pandas as pd from datetime import datetime data = pd.read_csv("data.csv") # approximately 10 GB output = [] for target_num in range(1,50): for metric_number in range(1,80000): startTime = datetime.now() # Begin the timer current_df = data[(data['target_series'] == target_num) & (data['metric_series'] == metric_number)] # Select the current 60 months period that we want to perform the correlation on print('The process took: '+str(datetime.now() - startTime)+' hours:minutes:seconds to complete.') # Stop the timer results_amount_target = current_df[['metric_1','target_metric']].corr() # Perform metric_1 correlation results_count_target = current_df[['metric_2','target_metric']].corr() # Perform metric_2 correlation output.append([target_num, independent_number, results_amount_target.iat[0,1], results_count_target.iat[0,1]]) # Record the correlation in a Python list will be converted to a DataFrame later

The reason I have datetime code was to find out why this process has been taking so long. The timer code is wrapped around the current_df line, which is by far the slowest (I played with the datetime placement to find out why it took so long.

I found that selecting a piece of data using this line of code:

 current_df = data[(data['target_series'] == target_num) & (data['metric_series'] == metric_number)]

takes 1.5 seconds each time. This is too slow for me! Clearly, something needs to be changed!

I decided to try a different approach. Since I know that I want to iterate through a data set 60 rows at a time (for each pair of target_series and metric_series ), I would try one of two things:

Read in the top 60 lines from data.csv , do the correlation, and then read in the next 60 lines from data.csv with the following code data = pd.read_csv('data.csv', nrows=60,skiprows=60) . Although this happens faster for the first part of the data set, it becomes unbearably slow as I have to skip data. It took more than 10 minutes to read the last 60 lines in the dataset on my PC!
Read the top 60 lines from the DataFrame stored in memory with something like data.head(60) , and then remove that data from the framework using data = data.drop(data.head(60).index) , but it was even slower!

At this point, I am learning to use HDFStore or h5py to move a dataset from .csv to .h5 , but I'm not sure how to do this. The computer on which I am doing this analysis has only 16 GB of memory, and in the future I can expect to work with data that is even larger than this file.

What is the best way to solve this problem and how can I prepare to work with even bigger data in Python / Pandas?

UPDATE

Thanks to filmor , I rewrote my code as follows:

 import pandas as pd from datetime import datetime data = pd.read_csv("data.csv", chunksize=60) # data is now an iterable output = [] for chunk in data: results_amount_target = chunk[['metric_1','target_metric']].corr() # Perform metric_1 correlation results_count_target = chunk[['metric_2','target_metric']].corr() # Perform metric_2 correlation output.append([chunk['target_series'][0], chunk['independent_series'][0], results_amount_target.iat[0,1], results_count_target.iat[0,1]]) # Record the correlation in a Python list will be converted to a DataFrame later