Pandas: applying Lambda to multiple data frames

I am trying to understand how to simultaneously apply a lambda function to multiple data frames, without merging data frames together. I work with large data sets (> 60MM records) and I need to be more careful with memory management.

My hope is that there is a way to apply lambda only to the underlying data frameworks so that I can first avoid the cost of stitching them together and then delete this intermediate data frame from memory before moving on to the next step in the process.

I have experience evading memory issues using HDF5-based data frames, but I would prefer to try something else first.

I suggested a toy problem to demonstrate what I'm talking about.

import numpy as np import pandas as pd # Here an arbitrary function to use with lambda def someFunction(input1, input2, input3, input4): theSum = input1 + input2 theAverage = (input1 + input2 + input3 + input4) / 4 theProduct = input2 * input3 * input4 return pd.Series({'Sum' : theSum, 'Average' : theAverage, 'Product' : theProduct}) # Cook up some dummy dataframes df1 = pd.DataFrame(np.random.randn(6,2),columns=list('AB')) df2 = pd.DataFrame(np.random.randn(6,1),columns=list('C')) df3 = pd.DataFrame(np.random.randn(6,1),columns=list('D')) # Currently, I merge the dataframes together and then apply the lambda function dfConsolodated = pd.concat([df1, df2, df3], axis=1) # This works just fine, but merging the dataframes seems like an extra step dfResults = dfConsolodated.apply(lambda x: someFunction(x['A'], x['B'], x['C'], x['D']), axis = 1) # I want to avoid the concat completely in order to be more efficient with memory. I am hoping for something like this: # I am COMPLETELY making this syntax up for conceptual purposes, my apologies. dfResultsWithoutConcat = [df1, df2, df3].apply(lambda x: someFunction(df1['A'], df1['B'], df2['C'], df3['D']), axis = 1) 
+6
source share
2 answers

One option is to explicitly create the desired aggregations:

 theSum = df1.A + df1.B theAverage = (df1.A + df1.B + df2.C + df3.D) / 4. theProduct = df1.B * df2.C * df3.D theResult = pd.concat([theSum, theAverage, theProduct]) theResult.columns = ['Sum', 'Average', 'Product'] 

Another possibility is to use query , but it really depends on your use case and how you are going to collect your data. Here is an example for documents that may be applicable to you.

 map(lambda frame: frame.query(expr), [df, df2]) 
+1
source

I know this question is old, but here is the way I came up with. This is not nice, but it works.

The main idea is to request a second data frame inside the application function. Using the name of the series passed, you can identify the column / index and use it to retrieve the required value from another data frame.

 def func(x, other): other_value = other.loc[x.name] return your_actual_method(x, other_value) result = df1.apply(lambda x: func(x, df2)) 
0
source

Source: https://habr.com/ru/post/989821/


All Articles