The probability of the value of Pandas

I am trying to find the probability of a given word within a data frame, but I am getting an error AttributeError: 'Series' object has no attribute 'columns'with my current setting. Hope you can help me find where the error is.

I start with a data framework that looks like below and transforms it to find a common score for each individual word using the following function.

query          count
foo bar        10
super          8 
foo            4
super foo bar  2

Function below:

def _words(df):
    return df['query'].str.get_dummies(sep=' ').T.dot(df['count'])

The result is in the bottom df (the note 'foo' is 16 because it appears 16 times throughout df):

bar      12
foo      16
super    10

The problem occurs when trying to find the probability of a given keyword in df, which does not currently add the column name. Below is what I'm working with right now, but it throws an AttributeError object: 'Series' does not have a column attribute.

def _probability(df, query):
  return df[query] / df.groupby['count'].sum()

, _probability (df, 'foo') 0.421052632 (16/(12 + 16 + 10)). !

+4
5

:

df['query'].str.get_dummies(sep=' ').T.dot(df['count']).pipe(lambda x: x / x.sum())

bar      0.315789
foo      0.421053
super    0.263158
dtype: float64

:
,

from numpy.core.defchararray import count

q = df['query'].values
c = df['count'].values.repeat(count(q.astype(str), ' ') + 1)
f, u = pd.factorize(' '.join(q.tolist()).split())
b = np.bincount(f, c)
pd.Series(b / b.sum(), u)

foo      0.421053
bar      0.315789
super    0.263158
dtype: float64
+3

IIUC:

In [111]: w = df['query'].str.get_dummies(sep=' ').T.dot(df['count'])

In [112]: w
Out[112]:
bar      12
foo      16
super    10
dtype: int64

In [113]: w/df['count'].sum()
Out[113]:
bar      0.500000
foo      0.666667
super    0.416667
dtype: float64

- ( ):

In [135]: df.join(df['query'].str.get_dummies(sep=' ') \
            .mul(df['count'], axis=0).div(df['count'].sum()))
Out[135]:
           query  count       bar       foo     super
0        foo bar     10  0.416667  0.416667  0.000000
1          super      8  0.000000  0.000000  0.333333
2            foo      4  0.000000  0.166667  0.000000
3  super foo bar      2  0.083333  0.083333  0.083333
+3

Why not pass the new data structure to the function?

df1 = df['query'].str.get_dummies(sep=' ').T.dot(df['count'])

def _probability(df, query):
    return df[query] / df.sum()

_probability(df1, 'foo')

You get

0.42105263157894735
+3
source
df['query']=df['query'].str.split(' ')    
df.set_index('count')['query'].apply(pd.Series).stack().reset_index().groupby(0)['count'].sum()
Out[491]: 
0
bar      12
foo      16
super    10
Name: count, dtype: int64
+2
source

I think you are making a mistake in groupby (this is a function followed by a bracket)

to try:

def _probability(df, query): 
    return df[query] / df.groupby('count').sum()
0
source

Source: https://habr.com/ru/post/1687175/


All Articles