Count in each row of data, then create the column with the most frequent

Question

Count in each row of data, then create the column with the most frequent

I am trying to compare three floats in a row of a data frame (500000x3), I expect that the three values will be the same or at least 2 of them. I want to choose the value that most closely matches the assumption that they are not all different. My current attempt with a toy example is as follows:

mydf
   a  b  c
0  1  1  2
1  3  3  3
2  1  3  3
3  4  5  4
3  4  5  5



mydft = mydf.transpose()
    counts=[]
    for col in mydft:
        counts.append(mydft[col].value_counts())

Then I think about iterating over the number of samples and choosing the maximum value for each, but it is very slow and looks like anti pandas. I also tried this:

truth = mydf['a'] == mydf['b']

with the intention of preserving strings that evaluate to true and do something with those who don’t, but I have 1000 NaN values in the real thing and apparently NaN == NaNthere are False. Any suggestions?

+4

python vectorization pandas

seanysull 14 . '17 14:34

2

, @coldspeed i.e

dummies = pd.get_dummies(df.astype(str)).groupby(by=lambda x: x.split('_')[1], axis=1).sum()

df['new'] = dummies.idxmax(1)

   a  b  c new
0  1  1  2   1
1  3  3  3   3
2  1  3  3   3
3  4  5  4   4
3  4  5  5   5

:

, , pd.get_dummies, get_dummies , .

pd.get_dummies(df.astype(str))

   a_1  a_3  a_4  b_1  b_3  b_5  c_2  c_3  c_4  c_5
0    1    0    0    1    0    0    1    0    0    0
1    0    1    0    0    1    0    0    1    0    0
2    1    0    0    0    1    0    0    1    0    0
3    0    0    1    0    0    1    0    0    1    0
3    0    0    1    0    0    1    0    0    0    1

, , . iee

   1  2  3  4  5
0  2  1  0  0  0
1  0  0  3  0  0
2  1  0  2  0  0
3  0  0  0  2  1
3  0  0  0  1  2

idxmax(axis=1) , , .

0    1
1    3
2    3
3    4
3    5
dtype: object

:

, get_dummies, -, , scipy mode pandas mode

+2

Dark 14 . '17 14:47

Wen · Accepted Answer · 2017-12-14T15:07:45+0000

mode...

from scipy import stats


value,count=stats.mode(df.values,axis=1)
value
Out[180]: 
array([[1],
       [3],
       [3],
       [4],
       [5]], dtype=int64)


count
Out[181]: 
array([[2],
       [3],
       [2],
       [2],
       [2]])

df['new']=value
df
Out[183]: 
   a  b  c  new
0  1  1  2    1
1  3  3  3    3
2  1  3  3    3
3  4  5  4    4
3  4  5  5    5

Count in each row of data, then create the column with the most frequent

More articles: