Get a data fragment with a list of column names, where not all columns are in the data area

Question

Get a data fragment with a list of column names, where not all columns are in the data area

consider df

df = pd.DataFrame(np.ones((2, 3)), columns=list('abc'))
df

col_list = list('bcd')

df[col_list]

generates an error

KeyError: "['d'] not in index"

How do I get as many columns as possible?

+4

python numpy pandas

piRSquared Oct 25 '16 at 17:29

source share

3 answers

What about:

df[[col for col in list('bcd') if col in df.columns]]

this gives:

     b    c
0  1.0  1.0
1  1.0  1.0

+5

Vidhya g Oct 25 '16 at 17:40

source share

Index isin:

In [4]:    
col_list = list('bcd')
df.ix[:,df.columns.isin(col_list)]

Out[4]:
   b  c
0  1  1
1  1  1

,

In [5]:
df_ = pd.concat([df] * 10**5, ignore_index=True)
%timeit df_[df_.columns.intersection(col_list)]
%timeit df_[[col for col in col_list if col in df_.columns]]
%timeit df_.ix[:,df_.columns.isin(col_list)]

100 loops, best of 3: 12.8 ms per loop
100 loops, best of 3: 18.6 ms per loop
10 loops, best of 3: 26.6 ms per loop

, , ,

+1

EdChum 25 . '16 22:48

Maxu · Accepted Answer · 2016-10-25T17:30:42+0000

how about using index.intersection () ?

In [69]: df[df.columns.intersection(col_list)]
Out[69]:
     b    c
0  1.0  1.0
1  1.0  1.0

In [70]: df.columns
Out[70]: Index(['a', 'b', 'c'], dtype='object')  # <---------- Index

Dates:

In [21]: df_ = pd.concat([df] * 10**5, ignore_index=True)

In [22]: df_.shape
Out[22]: (200000, 3)

In [23]: df.columns
Out[23]: Index(['a', 'b', 'c'], dtype='object')

In [24]: col_list = list('bcd')

In [28]: %timeit df_[df_.columns.intersection(col_list)]
100 loops, best of 3: 6.24 ms per loop

In [29]: %timeit df_[[col for col in col_list if col in df_.columns]]
100 loops, best of 3: 5.69 ms per loop

check it on transposed DF (3 rows, 200K columns):

In [30]: t = df_.T

In [31]: t.shape
Out[31]: (3, 200000)

In [32]: t
Out[32]:
   0       1       2       3       4        ...    199995  199996  199997  199998  199999
a     1.0     1.0     1.0     1.0     1.0   ...       1.0     1.0     1.0     1.0     1.0
b     1.0     1.0     1.0     1.0     1.0   ...       1.0     1.0     1.0     1.0     1.0
c     1.0     1.0     1.0     1.0     1.0   ...       1.0     1.0     1.0     1.0     1.0

[3 rows x 200000 columns]

In [33]: col_list=[-10, -20, 10, 20, 100]

In [34]: %timeit t[t.columns.intersection(col_list)]
10 loops, best of 3: 52.8 ms per loop

In [35]: %timeit t[[col for col in col_list if col in t.columns]]
10 loops, best of 3: 103 ms per loop

Conclusion: the list almost always wins for smaller lists, and Pandas / NumPy wins for large datasets ...

Get a data fragment with a list of column names, where not all columns are in the data area

More articles: