I have a number of similar data frames where I would like to standardize nans across all data frameworks. For example, if nn exists in df1.loc [0, 'a'], then ALL other data frames should be set to nan for the same index location.
I know that I could group dataframes to create one large multi-indexed data framework, but sometimes it’s easier for me to work with a group of data blocks of the same structure.
Here is an example:
import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.reshape(np.arange(12), (4,3)), columns=['a', 'b', 'c'])
df2 = pd.DataFrame(np.reshape(np.arange(12), (4,3)), columns=['a', 'b', 'c'])
df3 = pd.DataFrame(np.reshape(np.arange(12), (4,3)), columns=['a', 'b', 'c'])
df1.loc[3,'a'] = np.nan
df2.loc[1,'b'] = np.nan
df3.loc[0,'c'] = np.nan
print df1
print ' '
print df2
print ' '
print df3
Conclusion:
a b c
0 0.0 1 2
1 3.0 4 5
2 6.0 7 8
3 NaN 10 11
a b c
0 0 1.0 2
1 3 NaN 5
2 6 7.0 8
3 9 10.0 11
a b c
0 0 1 NaN
1 3 4 5.0
2 6 7 8.0
3 9 10 11.0
However, I would like df1, df2 and df3 to have nans in the same places:
print df1
a b c
0 0.0 1.0 NaN
1 3.0 NaN 5.0
2 6.0 7.0 8.0
3 NaN 10.0 11.0
Using the answer provided by piRSquared, I was able to extend it for data with different sizes. Here is the function:
def set_nans_over_every_df(df_list):
complete_index = sorted(set([idx for df in df_list for idx in df.index]))
complete_columns = sorted(set([idx for df in df_list for idx in df.columns]))
df_list = [df.reindex(index=complete_index, columns=complete_columns) for df in df_list]
mask = np.isnan(np.stack([df.values for df in df_list])).any(0)
df_list = [df.mask(mask) for df in df_list]
return df_list
And an example of using different dimensional frames:
df1 = pd.DataFrame(np.reshape(np.arange(15), (5,3)), index=[0,1,2,3,4], columns=['a', 'b', 'c'])
df2 = pd.DataFrame(np.reshape(np.arange(12), (4,3)), index=[0,1,2,3], columns=['a', 'b', 'c'])
df3 = pd.DataFrame(np.reshape(np.arange(16), (4,4)), index=[0,1,2,3], columns=['a', 'b', 'c', 'd'])
df1.loc[3,'a'] = np.nan
df2.loc[1,'b'] = np.nan
df3.loc[0,'c'] = np.nan
df1, df2, df3 = set_nans_over_every_df([df1, df2, df3])
print df1
a b c d
0 0.0 1.0 NaN NaN
1 3.0 NaN 5.0 NaN
2 6.0 7.0 8.0 NaN
3 NaN 10.0 11.0 NaN
4 NaN NaN NaN NaN