Find a set of column indices for non-zero values ​​in each row in pandas' data frame

Is there a good way to find a set of column indices for non-zero values ​​in each row in a pandas data frame? Should I navigate through rows of data row by row?

For example, a data frame

c1 c2 c3 c4 c5 c6 c7 c8 c9 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 1 1 1 1 1 0 2 1 5 5 0 0 1 0 4 6 4 3 0 1 1 1 1 5 10 3 5 2 4 1 2 2 1 3 6 4 0 1 0 0 0 0 0 3 9 1 0 1 0 2 1 0 

It is expected that the output will be

 ['c1','c2'] ['c1'] ['c2'] ... 
+5
source share
3 answers

It seems you need to cross the DataFrame line by line.

 cols = df.columns bt = df.apply(lambda x: x > 0) bt.apply(lambda x: list(cols[x.values]), axis=1) 

and you will receive:

 0 [c1, c2] 1 [c1] 2 [c2] 3 [c1] 4 [c2] 5 [] 6 [c2, c3, c4, c5, c6, c7, c9] 7 [c1, c2, c3, c6, c8, c9] 8 [c1, c2, c4, c5, c6, c7, c8, c9] 9 [c1, c2, c3, c4, c5, c6, c7, c8, c9] 10 [c1, c2, c4] 11 [c1, c2, c3, c5, c7, c8] dtype: object 

If performance matters, try passing raw=True to boolean Create a DataFrame, as shown below:

 %timeit df.apply(lambda x: x > 0, raw=True).apply(lambda x: list(cols[x.values]), axis=1) 1000 loops, best of 3: 812 Β΅s per loop 

This gives you better performance. The following is the result of raw=False (which is the default):

 %timeit df.apply(lambda x: x > 0).apply(lambda x: list(cols[x.values]), axis=1) 100 loops, best of 3: 2.59 ms per loop 
+5
source

How about this approach?

 #create a True / False data frame df_boolean = df>0 #a little helper method that uses boolean slicing internally def bar(x,columns): return ','.join(list(columns[x])) #use an apply along the column axis df_boolean['result'] = df_boolean.apply(lambda x: bar(x,df_boolean.columns),axis=1) # filter out the empty "rows" adn grab the result column df_result = df_boolean[df_boolean['result'] != '']['result'] #append an axis, just so each line will will output a list lst_result = df_result.values[:,np.newaxis] print '\n'.join([ str(myelement) for myelement in lst_result]) 

and it gives:

 ['c1,c2'] ['c1'] ['c2'] ['c1'] ['c2'] ['c2,c3,c4,c5,c6,c7,c9'] ['c1,c2,c3,c6,c8,c9'] ['c1,c2,c4,c5,c6,c7,c8,c9'] ['c1,c2,c3,c4,c5,c6,c7,c8,c9'] ['c1,c2,c4'] ['c1,c2,c3,c5,c7,c8'] 
+2
source

The potentially better data structure (rather than a series of lists) is in the stack:

 In [11]: res = df[df!=0].stack() In [12]: res Out[12]: 0 c1 1 c2 1 1 c1 1 2 c2 1 3 c1 1 ... 

And you can iterate over the source lines:

 In [13]: res.loc[0] Out[13]: c1 1 c2 1 dtype: float64 In [14]: res.loc[0].index Out[14]: Index(['c1', 'c2'], dtype='object') 

Note. I thought you could return the list in the application (to create a DataFrame with list elements), this no longer looks like this.

+1
source

Source: https://habr.com/ru/post/1232210/


All Articles