Rotate data frame to frequency list with two column variables in Python

Question

Rotate data frame to frequency list with two column variables in Python

I have a data framework consisting of node columns, a component, and a preceding word. Node contains many identical values (in alphabetical order), the component also contains many identical values, but scrambled, and the previous word can be any kind of words, but some are identical.

Now I want to create some list of transverse / frequency ranges that shows the frequency of the component and the previous word associated with the node.

Let's say this is my df:

node    precedingWord comp
banana  the           lel
banana  a             lel
banana  a             lal
coconut some          lal
coconut few           lil
coconut the           lel

I expect a list of frequencies that each unique node shows, and the time when a value is found in other columns that meet the matching criteria, for example

det1 = a
det2 = the
comp1 = lel
comp2 = lil
comp 3 = lal

expected output:

node    det1  det2 unspecified comp1 comp2 comp3
banana  2     1    0           2     0     1
coconut 0     1    0           1     1     1

, , comp :

det1 = ["a"]
det2 = ["the"]

df.loc[df.preceding_word.isin(det1), "determiner"] = "det1"
df.loc[df.preceding_word.isin(det2), "determiner"] = "det2"
df.loc[df.preceding_word.isin(det1 + det2) == 0, "determiner"] = "unspecified"

# Create crosstab of the node and gender
freqDf = pd.crosstab(df.node, df.determiner)

. - , loc, .

. , "beforeWord" "gender", neuter, non_neuter, gender.

def frequency_list():
    # Define content of gender classes
    neuter = ["het"]
    non_neuter = ["de"]

    # Add `gender` column to df
    df.loc[df.preceding_word.isin(neuter), "gender"] = "neuter"
    df.loc[df.preceding_word.isin(non_neuter), "gender"] = "non_neuter"
    df.loc[df.preceding_word.isin(neuter + non_neuter) == 0, "gender"] = "unspecified"

    g = df.groupby("node")

    # Create crosstab of the node, and gender and component
    freqDf = pd.concat([g["component"].value_counts().unstack(1), g["gender"].value_counts().unstack(1)])

    # Reset indices, starting from 1, not the default 0!
    """ Crosstabs don't come with index, so we first set the index with
    `reset_index` and then alter it. """
    freqDf.reset_index(inplace=True)
    freqDf.index = np.arange(1, len(freqDf) + 1)

    freqDf.to_csv("dataset/py-frequencies.csv", sep="\t", encoding="utf-8")

, , :

- , : comp (component), gender .
0.
, .

:

, . , !

+4

python pandas dataframe

Bram Vanroy 09 . '15 14:51

2

:

?
?
loc?

Pandas , , (. )

1. pandas:

df = pd.DataFrame({"det":["a","the","a","a","a", "the"], "word":["cat", "pet", "pet", "cat","pet", "pet"]})
"you will need a dummy variable:"
df["counts"] = 1
"you probably need to reset the index"
df_counts = df.groupby(["det","word"]).agg("count").reset_index()
#   det word  counts
#0    a  cat       2
#1    a  pet       3
#2  the  pet       1
"and pivot it"
df_counts.pivot( index = "word", columns = "det", values="counts").fillna(0)
#det   a  the
#word        
#cat   2    0
#pet   3    1

:

df = pd.DataFrame([['idee', 'het', 'lel', 1],
   ['idee', 'het', 'lel', 1],
   ['idee', 'de', 'lal', 1],
   ['functie', 'de', 'lal', 1],
   ['functie', 'de', 'lal', 1],
   ['functie', 'en', 'lil', 1],
   ['functie', 'de', 'lel', 1],
   ['functie', 'de', 'lel', 1]],
 columns = ['node', 'precedingWord', 'comp', 'counts'])
df["counts"] = 1
df_counts = df.groupby(["node","precedingWord", "comp"]).agg("count").reset_index()

df_counts
#      node precedingWord comp  counts
#0  functie            de  lal       2
#1  functie            de  lel       1
#2  functie            de  lil       1
#3  functie            en  lil       1
#4     idee            de  lal       1
#5     idee           het  lel       2

2. Counter

df = pd.DataFrame({"det":["a","the","a","a","a", "a"], "word":["cat", "pet", "pet", "cat","pet", "pet"]})
acounter = Counter( (tuple(x) for x in df.as_matrix()) )
#Counter({('a', 'cat'): 2, ('a', 'pet'): 2, ('the', 'pet'): 2})
df_counts = pd.DataFrame(list(zip([y[0] for y in acounter.keys()], [y[1] for y in acounter.keys()], acounter.values())), columns=["det", "word", "counts"])
#   det word  counts
#0    a  cat       2
#1  the  pet       1
#2    a  pet       3
df_counts.pivot( index = "word", columns = "det", values="counts").fillna(0)
#det   a  the
#word        
#cat   2    0
#pet   3    1

, pandas (52,6 92,9 , )

3. , . CountVectorizer sklearn ngram_range=(1, 2). - :

df = pd.DataFrame({"det":["a","the","a","a","a", "a"], "word":["cat", "pet", "pet", "cat","pet", "pet"]})

from sklearn.feature_extraction.text import CountVectorizer
listofpairs = []
for _, row in df.iterrows():
    listofpairs.append(" ".join(row))

countvect = CountVectorizer(ngram_range=(2,2), min_df = 0.0, token_pattern='(?u)\\b\\w+\\b')
sparse_counts = countvect.fit_transform(listofpairs)

print("* input list:\n",listofpairs)
print("* array of counts:\n",sparse_counts.toarray())
print("* vocabulary [order of columns in the sparse array]:\n",countvect.vocabulary_)

counter_keys = [x[1:] for x  in sorted([ tuple([v] + k.split(" ")) for k,v in countvect.vocabulary_.items()])]
counter_values = np.sum(sparse_counts.toarray(), 0)

df_counts = pd.DataFrame([(x[0], x[1], y) for x,y in  zip(counter_keys, counter_values)], columns=["det", "word", "counts"])

: 1. concat df1.set_index ( "" ) df2.set_index ( "" ) dfout = pd.concat([df1, df2], = 1)

2. merge

`loc`

( ) row,column . / ( ).

, in, :

df.loc[df.precedingWord.isin(neuter), "gender"] = "neuter"

indices_neutral = df["precedingWord"]=="de" 
df.loc[indices, "gender"] = "neuter"

,

df.loc[df["precedingWord"]=="de", "gender"] = "neuter"

+1

Dima Lituiev 27 . '15 6:30

Andy Hayden · Accepted Answer · 2015-11-09T15:57:56+0000

: crosstab:

In [11]: df1 = pd.crosstab(df['node'], df['precedingWord'])

In [12]: df1
Out[12]:
precedingWord  a  few  some  the
node
banana         2    0     0    1
coconut        0    1     1    1

In [13]: df2 = pd.crosstab(df['node'], df['comp'])

( ).

concat = 1 (.. , ).

In [14]: pd.concat([df1, df2], axis=1, keys=['precedingWord', 'comp'])
Out[14]:
        precedingWord              comp
                    a few some the  lal lel lil
node
banana              2   0    0   1    1   2   0
coconut             0   1    1   1    1   1   1

, , ( MultiIndex), , , ( ):

In [15]: pd.concat([df1, df2], axis=1)
Out[15]:
         a  few  some  the  lal  lel  lil
node
banana   2    0     0    1    1    2    0
coconut  0    1     1    1    1    1    1

, , concat , ( kwarg), ...

value_counts:

In [21]: g = df.groupby("node")

In [22]: g["comp"].value_counts()
Out[22]:
node     comp
banana   lel     2
         lal     1
coconut  lal     1
         lel     1
         lil     1
dtype: int64

In [23]: g["precedingWord"].value_counts()
Out[23]:
node     precedingWord
banana   a                2
         the              1
coconut  few              1
         some             1
         the              1
dtype: int64

:

In [24]: pd.concat([g["comp"].value_counts().unstack(1), g["precedingWord"].value_counts().unstack(1)])
Out[24]:
          a  few  lal  lel  lil  some  the
node
banana  NaN  NaN    1    2  NaN   NaN  NaN
coconut NaN  NaN    1    1    1   NaN  NaN
banana    2  NaN  NaN  NaN  NaN   NaN    1
coconut NaN    1  NaN  NaN  NaN     1    1

In [25]: pd.concat([g["comp"].value_counts().unstack(1), g["precedingWord"].value_counts().unstack(1)]).fillna(0)
Out[25]:
         a  few  lal  lel  lil  some  the
node
banana   0    0    1    2    0     0    0
coconut  0    0    1    1    1     0    0
banana   2    0    0    0    0     0    1
coconut  0    1    0    0    0     1    1

det1, det2 .. concat, , .

In [31]: res = g["comp"].value_counts().unstack(1)

In [32]: res
Out[32]:
comp     lal  lel  lil
node
banana     1    2  NaN
coconut    1    1    1

In [33]: res.columns = res.columns.map({"lal": "det1", "lel": "det2", "lil": "det3"}.get)

In [34]: res
Out[34]:
         det1  det2  det3
node
banana      1     2   NaN
coconut     1     1     1

( ):

In [41]: res = g["comp"].value_counts().unstack(1)

In [42]: res.columns = ['det%s' % i for i, _ in enumerate(df.columns)]

Rotate data frame to frequency list with two column variables in Python

loc

More articles:

`loc`