Select a pandas data column from a list

in one of my scripts, I select multiple columns of the data frame, a list of column names. The following code works:

data = df[lst]

It works fine as long as all list items are included in the dataframe. If this is not the case, it will return the error "..." not in the index. "

Is it possible to still select all columns whose column names are included in this list, even if not all list items are included in the dataframe?

+10
source share
4 answers

I think you need Index.intersection:

df = pd.DataFrame({'A':[1,2,3],
                   'B':[4,5,6],
                   'C':[7,8,9],
                   'D':[1,3,5],
                   'E':[5,3,6],
                   'F':[7,4,3]})

print (df)
   A  B  C  D  E  F
0  1  4  7  1  5  7
1  2  5  8  3  3  4
2  3  6  9  5  6  3

lst = ['A','R','B']

print (df.columns.intersection(lst))
Index(['A', 'B'], dtype='object')

data = df[df.columns.intersection(lst)]
print (data)
   A  B
0  1  4
1  2  5
2  3  6

Another solution with numpy.intersect1d:

data = df[np.intersect1d(df.columns, lst)]
print (data)
   A  B
0  1  4
1  2  5
2  3  6
+25
source

A few other ways, and list comprehension is much faster.

In [1357]: df[df.columns & lst]
Out[1357]:
   A  B
0  1  4
1  2  5
2  3  6

In [1358]: df[[c for c in df.columns if c in lst]]
Out[1358]:
   A  B
0  1  4
1  2  5
2  3  6

Delay

In [1360]: %timeit [c for c in df.columns if c in lst]
100000 loops, best of 3: 2.54 µs per loop

In [1359]: %timeit df.columns & lst
1000 loops, best of 3: 231 µs per loop

In [1362]: %timeit df.columns.intersection(lst)
1000 loops, best of 3: 236 µs per loop

In [1363]: %timeit np.intersect1d(df.columns, lst)
10000 loops, best of 3: 26.6 µs per loop

More details

In [1365]: df
Out[1365]:
   A  B  C  D  E  F
0  1  4  7  1  5  7
1  2  5  8  3  3  4
2  3  6  9  5  6  3

In [1366]: lst
Out[1366]: ['A', 'R', 'B']
+3

:

a

Out[5]: 
    a  b   c
0   1  2   3
1  12  3  44

X -

x = ['a','b']

:

a[x]

Out[7]: 
    a  b
0   1  2
1  12  3

:

%timeit a[x]
333 µs ± 9.27 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
0
source

Use *with list

data = df[[*lst]]

This will give the desired result.

-1
source

Source: https://habr.com/ru/post/1660970/


All Articles