I have a very large .csv (which was originally from a SAS dataset) that has the following columns:
target_series metric_series month metric_1 metric_2 target_metric 1 1 1 #float #float #float 1 1 2 #float #float #float ... 1 1 60 #float #float #float 1 2 1 #float #float #float 1 2 2 #float #float #float ... 1 80000 60 #float #float #float 2 1 1 #float #float #float ... 50 80000 60 #float #float #float
As you can see, the file has 60 months of times 80000 independent serial series of 50 target series of row values ββand takes up more than 10 GB of space while saving as .csv need to execute and record the correlation between each metric_1 and metric_2 using target_metric .
I wrote the following code:
import pandas as pd from datetime import datetime data = pd.read_csv("data.csv")
The reason I have datetime code was to find out why this process has been taking so long. The timer code is wrapped around the current_df line, which is by far the slowest (I played with the datetime placement to find out why it took so long.
I found that selecting a piece of data using this line of code:
current_df = data[(data['target_series'] == target_num) & (data['metric_series'] == metric_number)]
takes 1.5 seconds each time. This is too slow for me! Clearly, something needs to be changed!
I decided to try a different approach. Since I know that I want to iterate through a data set 60 rows at a time (for each pair of target_series and metric_series ), I would try one of two things:
Read in the top 60 lines from data.csv , do the correlation, and then read in the next 60 lines from data.csv with the following code data = pd.read_csv('data.csv', nrows=60,skiprows=60) . Although this happens faster for the first part of the data set, it becomes unbearably slow as I have to skip data. It took more than 10 minutes to read the last 60 lines in the dataset on my PC!
Read the top 60 lines from the DataFrame stored in memory with something like data.head(60) , and then remove that data from the framework using data = data.drop(data.head(60).index) , but it was even slower!
At this point, I am learning to use HDFStore or h5py to move a dataset from .csv to .h5 , but I'm not sure how to do this. The computer on which I am doing this analysis has only 16 GB of memory, and in the future I can expect to work with data that is even larger than this file.
What is the best way to solve this problem and how can I prepare to work with even bigger data in Python / Pandas?
UPDATE
Thanks to filmor , I rewrote my code as follows:
import pandas as pd from datetime import datetime data = pd.read_csv("data.csv", chunksize=60) # data is now an iterable output = [] for chunk in data: results_amount_target = chunk[['metric_1','target_metric']].corr() # Perform metric_1 correlation results_count_target = chunk[['metric_2','target_metric']].corr() # Perform metric_2 correlation output.append([chunk['target_series'][0], chunk['independent_series'][0], results_amount_target.iat[0,1], results_count_target.iat[0,1]]) # Record the correlation in a Python list will be converted to a DataFrame later
Now it is super fast and memory light! I would still appreciate if anyone could help me on how to do this using the HDFStore or .h5 .