I have a very large .csv (which was originally from a SAS dataset) that has the following columns:
target_series metric_series month metric_1 metric_2 target_metric 1 1 1 #float #float #float 1 1 2 #float #float #float ... 1 1 60 #float #float #float 1 2 1 #float #float #float 1 2 2 #float #float #float ... 1 80000 60 #float #float #float 2 1 1 #float #float #float ... 50 80000 60 #float #float #float
As you can see, the file has 60
months of times 80000
independent serial series of 50
target series of row values ββand takes up more than 10 GB of space while saving as .csv
need to execute and record the correlation between each metric_1
and metric_2
using target_metric
.
I wrote the following code:
import pandas as pd from datetime import datetime data = pd.read_csv("data.csv")
The reason I have datetime
code was to find out why this process has been taking so long. The timer code is wrapped around the current_df
line, which is by far the slowest (I played with the datetime
placement to find out why it took so long.
I found that selecting a piece of data using this line of code:
current_df = data[(data['target_series'] == target_num) & (data['metric_series'] == metric_number)]
takes 1.5 seconds each time. This is too slow for me! Clearly, something needs to be changed!
I decided to try a different approach. Since I know that I want to iterate through a data set 60 rows at a time (for each pair of target_series
and metric_series
), I would try one of two things:
Read in the top 60 lines from data.csv
, do the correlation, and then read in the next 60 lines from data.csv
with the following code data = pd.read_csv('data.csv', nrows=60,skiprows=60)
. Although this happens faster for the first part of the data set, it becomes unbearably slow as I have to skip data. It took more than 10 minutes to read the last 60 lines in the dataset on my PC!
Read the top 60 lines from the DataFrame
stored in memory with something like data.head(60)
, and then remove that data from the framework using data = data.drop(data.head(60).index)
, but it was even slower!
At this point, I am learning to use HDFStore
or h5py
to move a dataset from .csv
to .h5
, but I'm not sure how to do this. The computer on which I am doing this analysis has only 16 GB of memory, and in the future I can expect to work with data that is even larger than this file.
What is the best way to solve this problem and how can I prepare to work with even bigger data in Python / Pandas?
UPDATE
Thanks to filmor
, I rewrote my code as follows:
import pandas as pd from datetime import datetime data = pd.read_csv("data.csv", chunksize=60) # data is now an iterable output = [] for chunk in data: results_amount_target = chunk[['metric_1','target_metric']].corr() # Perform metric_1 correlation results_count_target = chunk[['metric_2','target_metric']].corr() # Perform metric_2 correlation output.append([chunk['target_series'][0], chunk['independent_series'][0], results_amount_target.iat[0,1], results_count_target.iat[0,1]]) # Record the correlation in a Python list will be converted to a DataFrame later
Now it is super fast and memory light! I would still appreciate if anyone could help me on how to do this using the HDFStore
or .h5
.