Working with a 10 + GB dataset in Python Pandas

I have a very large .csv (which was originally from a SAS dataset) that has the following columns:

target_series metric_series month metric_1 metric_2 target_metric 1 1 1 #float #float #float 1 1 2 #float #float #float ... 1 1 60 #float #float #float 1 2 1 #float #float #float 1 2 2 #float #float #float ... 1 80000 60 #float #float #float 2 1 1 #float #float #float ... 50 80000 60 #float #float #float 

As you can see, the file has 60 months of times 80000 independent serial series of 50 target series of row values ​​and takes up more than 10 GB of space while saving as .csv need to execute and record the correlation between each metric_1 and metric_2 using target_metric .

I wrote the following code:

 import pandas as pd from datetime import datetime data = pd.read_csv("data.csv") # approximately 10 GB output = [] for target_num in range(1,50): for metric_number in range(1,80000): startTime = datetime.now() # Begin the timer current_df = data[(data['target_series'] == target_num) & (data['metric_series'] == metric_number)] # Select the current 60 months period that we want to perform the correlation on print('The process took: '+str(datetime.now() - startTime)+' hours:minutes:seconds to complete.') # Stop the timer results_amount_target = current_df[['metric_1','target_metric']].corr() # Perform metric_1 correlation results_count_target = current_df[['metric_2','target_metric']].corr() # Perform metric_2 correlation output.append([target_num, independent_number, results_amount_target.iat[0,1], results_count_target.iat[0,1]]) # Record the correlation in a Python list will be converted to a DataFrame later 

The reason I have datetime code was to find out why this process has been taking so long. The timer code is wrapped around the current_df line, which is by far the slowest (I played with the datetime placement to find out why it took so long.

I found that selecting a piece of data using this line of code:

 current_df = data[(data['target_series'] == target_num) & (data['metric_series'] == metric_number)] 

takes 1.5 seconds each time. This is too slow for me! Clearly, something needs to be changed!

I decided to try a different approach. Since I know that I want to iterate through a data set 60 rows at a time (for each pair of target_series and metric_series ), I would try one of two things:

  • Read in the top 60 lines from data.csv , do the correlation, and then read in the next 60 lines from data.csv with the following code data = pd.read_csv('data.csv', nrows=60,skiprows=60) . Although this happens faster for the first part of the data set, it becomes unbearably slow as I have to skip data. It took more than 10 minutes to read the last 60 lines in the dataset on my PC!

  • Read the top 60 lines from the DataFrame stored in memory with something like data.head(60) , and then remove that data from the framework using data = data.drop(data.head(60).index) , but it was even slower!

At this point, I am learning to use HDFStore or h5py to move a dataset from .csv to .h5 , but I'm not sure how to do this. The computer on which I am doing this analysis has only 16 GB of memory, and in the future I can expect to work with data that is even larger than this file.

What is the best way to solve this problem and how can I prepare to work with even bigger data in Python / Pandas?

UPDATE

Thanks to filmor , I rewrote my code as follows:

 import pandas as pd from datetime import datetime data = pd.read_csv("data.csv", chunksize=60) # data is now an iterable output = [] for chunk in data: results_amount_target = chunk[['metric_1','target_metric']].corr() # Perform metric_1 correlation results_count_target = chunk[['metric_2','target_metric']].corr() # Perform metric_2 correlation output.append([chunk['target_series'][0], chunk['independent_series'][0], results_amount_target.iat[0,1], results_count_target.iat[0,1]]) # Record the correlation in a Python list will be converted to a DataFrame later 

Now it is super fast and memory light! I would still appreciate if anyone could help me on how to do this using the HDFStore or .h5 .

+2
source share

Source: https://habr.com/ru/post/1244990/


All Articles