Most Pythonic way to create many new columns in Pandas

I have a large dataframe df (~ 100 columns and ~ 7 million rows), and I need to create ~ 50 new variables / columns, which are simple transformations of current variables. One way to continue would be with many .apply (I just use transform* as a placeholder for simple transforms like max or squaring):

 df['new_var1'] = df['old_var1'].apply(lambda x : transform1(x)) ... df['new_var50'] = df['old_var50'].apply(lambda x : transform50(x)) 

Another way would be to create a dictionary

 transform_dict = { 'new_var1' : lambda row : transform1(row), ..., 'new_var50' : lambda row : transform50(row) } 

and then write one .apply in conjunction with .concat :

 df = pd.concat([df, df.apply(lambda r: pd.Series({var : transform_dict[var](r) for var in transform_dict.keys()}), axis=1)], axis=1) 

Is one method preferable to another, both in how it is "Pythonic" and in efficiency, scalability, flexibility?

+5
source share
2 answers

Beginning with:

 df = pd.DataFrame(np.random.random((1000, 100))) 

Adding individual columns:

 def cols_via_apply(df): for i in range(100, 150): df[i] = df[i-100].apply(lambda x: x * i) return df %timeit cols_via_apply(df) 10 loops, best of 3: 29.6 ms per loop <class 'pandas.core.frame.DataFrame'> Int64Index: 1000 entries, 0 to 999 Columns: 150 entries, 0 to 149 dtypes: float64(150) memory usage: 1.2 MB None 

seems a bit more efficient than using pd.concat - presumably because the rows DataFrame loop is DataFrame . Thus, this difference will worsen as the length of the DataFrame increases:

 def cols_via_concat(df): df = pd.concat([df, df.apply(lambda row: pd.Series({i : i * row[i-100] for i in range(100, 150)}), axis=1)]) return df %timeit cols_via_concat(df) 1 loops, best of 3: 450 ms per loop <class 'pandas.core.frame.DataFrame'> Int64Index: 1000 entries, 0 to 999 Columns: 150 entries, 0 to 149 dtypes: float64(150) memory usage: 1.2 MB None 
+2
source

Continuing the @Stefan experiment, but with a size of 100k x 100 and with a new method that first selects the NaN block and combines it with the file frame. He then uses iloc to perform calculations for each column.

 def cols_via_iloc(df): df = pd.concat([df, pd.DataFrame(np.tile(np.nan, [len(df), 50]))], axis=1) for i in range(100, 150): df.iloc[:, i] = i * df.iloc[:, i - 100] def cols_via_apply(df): for i in range(100, 150): df[i] = df[i-100].apply(lambda x: x * i) return df def cols_via_concat(df): df = pd.concat([df, df.apply(lambda row: pd.Series({i : i * row[i - 100] for i in range(100, 150)}), axis=1)]) return df >>> %%timeit df = pd.DataFrame(np.random.randn(100000, 100)) cols_via_iloc(df) 1 loops, best of 3: 540 ms per loop >>> %%timeit df = pd.DataFrame(np.random.randn(100000, 100)) cols_via_apply(df) 1 loops, best of 3: 2.91 s per loop >>> %%timeit df = pd.DataFrame(np.random.randn(100000, 100)) cols_via_concat(df) 1 loops, best of 3: 55.8 s per loop 
+2
source

Source: https://habr.com/ru/post/1242212/


All Articles