Pandas group work with missing data

In the pandas frame, I have a column that looks like this:

0 M 1 E 2 L 3 M.1 4 M.2 5 M.3 6 E.1 7 E.2 8 E.3 9 E.4 10 L.1 11 L.2 12 M.1.a 13 M.1.b 14 M.1.c 15 M.2.a 16 M.3.a 17 E.1.a 18 E.1.b 19 E.1.c 20 E.2.a 21 E.3.a 22 E.3.b 23 E.4.a 

I need to group the whole value, where the first elements are E, M, or L , and then for each group I need to create a subgroup, where is the index 1, 2, or 3 , which will contain an entry for each lowercase letter (a, BC, ...) Potentially, the solution should work for any number of levels uniting the elements (in this case, the number of levels is 3 (for example: A.1.a))

 0 1 2 E 1 a b c 2 a 3 a b 4 a L 1 2 M 1 a b c 2 a 3 a 

I tried:

 df.groupby([0,1,2]).count() 

But as a result, there is no level L, since there are no entries at the last sublevel

The workaround is to add a dummy variable and then delete it ... for example:

 df[2][(df[0]=='L') & (df[2].isnull()) & (df[1].notnull())]='x' df = df.replace(np.nan,' ', regex=True) df.sort_values(0, ascending=False, inplace=True) newdf = df.groupby([0,1,2]).count() 

which gives:

 0 1 2 E 1 a b c 2 a 3 a b 4 a L 1 x 2 x M 1 a b c 2 a 3 a 

Then I process the dummy x entry later in my code ...

how to avoid this trivial way to use groupby ?

+5
source share
2 answers

Assuming that the column in question will be represented by s , we can:

  • Separate the separator "." together with expand=True to create an advanced DF .

  • fnc : checks that all elements of the grouped frame consist of only None , then it replaces them with a fictitious notation "" , which is established by understanding the list. The series constructor is later called in the filtered list. Any None presented here is subsequently deleted using dropna .

  • Run groupby wrt 0 and 1 and apply fnc to 2.


 split_str = s.str.split(".", expand=True) fnc = lambda g: pd.Series(["" if all(x is None for x in g) else x for x in g]).dropna() split_str.groupby([0, 1])[2].apply(fnc) 

gives:

 0 1 E 1 1 a 2 b 3 c 2 1 a 3 1 a 2 b 4 1 a L 1 0 2 0 M 1 1 a 2 b 3 c 2 1 a 3 1 a Name: 2, dtype: object 

To get flattened DF , reset indices are the same as the levels used to group DF up to:

 split_str.groupby([0, 1])[2].apply(fnc).reset_index(level=[0, 1]).reset_index(drop=True) 

gives:

  0 1 2 0 E 1 a 1 E 1 b 2 E 1 c 3 E 2 a 4 E 3 a 5 E 3 b 6 E 4 a 7 L 1 8 L 2 9 M 1 a 10 M 1 b 11 M 1 c 12 M 2 a 13 M 3 a 
+1
source

You may need to find a way with regex.

 import pandas as pd df = pd.read_clipboard(header=None).iloc[:, 1] df2 = df.str.extract(r'([AZ])\.?([0-9]?)\.?([az]?)') print df2.set_index([0,1]) 

and the result:

  2 0 1 MELM 1 2 3 E 1 2 3 4 L 1 2 M 1 a 1 b 1 c 2 a 3 a E 1 a 1 b 1 c 2 a 3 a 3 b 4 a 
0
source

Source: https://habr.com/ru/post/1264142/


All Articles