how about using index.intersection () ?
In [69]: df[df.columns.intersection(col_list)]
Out[69]:
b c
0 1.0 1.0
1 1.0 1.0
In [70]: df.columns
Out[70]: Index(['a', 'b', 'c'], dtype='object')
Dates:
In [21]: df_ = pd.concat([df] * 10**5, ignore_index=True)
In [22]: df_.shape
Out[22]: (200000, 3)
In [23]: df.columns
Out[23]: Index(['a', 'b', 'c'], dtype='object')
In [24]: col_list = list('bcd')
In [28]: %timeit df_[df_.columns.intersection(col_list)]
100 loops, best of 3: 6.24 ms per loop
In [29]: %timeit df_[[col for col in col_list if col in df_.columns]]
100 loops, best of 3: 5.69 ms per loop
check it on transposed DF (3 rows, 200K columns):
In [30]: t = df_.T
In [31]: t.shape
Out[31]: (3, 200000)
In [32]: t
Out[32]:
0 1 2 3 4 ... 199995 199996 199997 199998 199999
a 1.0 1.0 1.0 1.0 1.0 ... 1.0 1.0 1.0 1.0 1.0
b 1.0 1.0 1.0 1.0 1.0 ... 1.0 1.0 1.0 1.0 1.0
c 1.0 1.0 1.0 1.0 1.0 ... 1.0 1.0 1.0 1.0 1.0
[3 rows x 200000 columns]
In [33]: col_list=[-10, -20, 10, 20, 100]
In [34]: %timeit t[t.columns.intersection(col_list)]
10 loops, best of 3: 52.8 ms per loop
In [35]: %timeit t[[col for col in col_list if col in t.columns]]
10 loops, best of 3: 103 ms per loop
Conclusion: the list almost always wins for smaller lists, and Pandas / NumPy wins for large datasets ...
source
share