How to create lazy_evaluated dataframe columns in Pandas

Many times I have a large dataframe df for storing the main data and you need to create many more columns to store the derived data computed by the underlying data columns.

I can do it in Pandas as:

 df['derivative_col1'] = df['basic_col1'] + df['basic_col2'] df['derivative_col2'] = df['basic_col1'] * df['basic_col2'] .... df['derivative_coln'] = func(list_of_basic_cols) 

etc .. Pandas will calculate and allocate memory for all derived columns at the same time.

Now I want the lazy evaluation engine to defer the computation and allocation of memory in the derived columns until the actual moment of need. Define lazy_eval_columns somewhat as:

 df['derivative_col1'] = pandas.lazy_eval(df['basic_col1'] + df['basic_col2']) df['derivative_col2'] = pandas.lazy_eval(df['basic_col1'] * df['basic_col2']) 

This will save time / memory, for example, the Python 'yield' generator, because if I issue the df['derivative_col2'] command, only the trigger for a specific calculation and memory allocation will be executed.

So how to do lazy_eval() in Pandas? Any feedback / thought / ref are welcome.

+9
python pandas lazy-evaluation
Oct 26 '13 at 10:20
source share
2 answers

Starting at 0.13 (soon release), you can do something like this. It uses generators to evaluate a dynamic formula. Introductory assignment via eval will be an extra feature in 0.13, see here

 In [19]: df = DataFrame(randn(5, 2), columns=['a', 'b']) In [20]: df Out[20]: ab 0 -1.949107 -0.763762 1 -0.382173 -0.970349 2 0.202116 0.094344 3 -1.225579 -0.447545 4 1.739508 -0.400829 In [21]: formulas = [ ('c','a+b'), ('d', 'a*c')] 

Create a generator that evaluates the formula using eval ; assigns the result, then gives the result.

 In [22]: def lazy(x, formulas): ....: for col, f in formulas: ....: x[col] = x.eval(f) ....: yield x ....: 

In action

 In [23]: gen = lazy(df,formulas) In [24]: gen.next() Out[24]: abc 0 -1.949107 -0.763762 -2.712869 1 -0.382173 -0.970349 -1.352522 2 0.202116 0.094344 0.296459 3 -1.225579 -0.447545 -1.673123 4 1.739508 -0.400829 1.338679 In [25]: gen.next() Out[25]: abcd 0 -1.949107 -0.763762 -2.712869 5.287670 1 -0.382173 -0.970349 -1.352522 0.516897 2 0.202116 0.094344 0.296459 0.059919 3 -1.225579 -0.447545 -1.673123 2.050545 4 1.739508 -0.400829 1.338679 2.328644 

Therefore, its user defined the evaluation procedure (and not on demand). Theoretically, numba will support this, so pandas probably supports this as a backend for eval (which currently uses numexpr for immediate evaluation).

my 2c.

lazy evaluation is good, but it can be easily obtained using python's own continuation / generation functions, so the possibility of creating it in pandas is, if possible, a rather difficult task, and in general it will be useful.

+8
Oct 26 '13 at 20:39 on
source share

You can subclass the DataFrame and add the column as property . For example,

 import pandas as pd class LazyFrame(pd.DataFrame): @property def derivative_col1(self): self['derivative_col1'] = result = self['basic_col1'] + self['basic_col2'] return result x = LazyFrame({'basic_col1':[1,2,3], 'basic_col2':[4,5,6]}) print(x) # basic_col1 basic_col2 # 0 1 4 # 1 2 5 # 2 3 6 

Access to the property (via x.derivative_col1 , below) calls the derivative_col1 function defined in LazyFrame. This function computes the result and adds a derived column to the LazyFrame instance:

 print(x.derivative_col1) # 0 5 # 1 7 # 2 9 print(x) # basic_col1 basic_col2 derivative_col1 # 0 1 4 5 # 1 2 5 7 # 2 3 6 9 

Note that if you change the base column:

 x['basic_col1'] *= 10 

the derived column is not automatically updated:

 print(x['derivative_col1']) # 0 5 # 1 7 # 2 9 

But if you access the property, the values ​​are recalculated:

 print(x.derivative_col1) # 0 14 # 1 25 # 2 36 print(x) # basic_col1 basic_col2 derivative_col1 # 0 10 4 14 # 1 20 5 25 # 2 30 6 36 
+5
Feb 05 '14 at 11:35
source share



All Articles