Take the union of two columns, Python + Pandas

I have df as shown below:

   x    y    z
0  a   jj  Nan
1  b   ii   mm
2  c   kk   nn
3  d   ii  NaN
4  e  Nan   oo
5  f   jj   mm
6  g  Nan   nn

Required Result:

   x    y    z   w
0  a   jj  Nan   a
1  b   ii   mm   a
2  c   kk   nn   c
3  d   ii  NaN   a
4  e  Nan   oo   e
5  f   jj   mm   a
6  g  Nan   nn   c

Logics

  • take the union of column y and z: ii == jjsince in indices 1 and 5 they both have mmin column z

  • group this union: index 0,1,3,5 is a group, index 2,6 is another group

  • inside a group, randomly take one cell in column x and assign it to column w for the whole group

I have no information about this problem. Can someone help me?

EDITNOTE:

First, I posted a perfectly sorted column y and column z as follows:

   x    y    z   w
0  a   ii  NaN   a
1  b   ii   mm   a
2  c   jj   mm   a
3  d   jj  Nan   a
4  e   kk   nn   e
5  f  Nan   nn   e
6  g  Nan   oo   g

In this case, piRSquared works perfectly.

EDITNOTE2:

Nickil Maveli's solution is great for my problem. However, I noted that the situation that the solution cannot cope with is as follows:

   x   y   z
0  a  ii  mm
1  b  ii  nn
2  c  jj  nn
3  d  jj  oo
4  e  kk  oo

:

   0   1   2  w
0  a  ii  mm  a
1  b  ii  mm  a
2  c  jj  nn  c
3  d  jj  nn  c
4  e  kk  oo  e

w = ['a', 'a', 'a', 'a', 'a'].

+4
3

/ . , , , , .

scipy , , :

import scipy.sparse

def via_cc(df_in):
    df = df_in.copy()

    # work with ranked version
    dfr = df[["y","z"]].rank(method='dense')
    # give nans their own temporary rank
    dfr = dfr.fillna(dfr.max().fillna(0) + dfr.isnull().cumsum(axis=0))
    # don't let y and z get mixed up; have separate nodes per column
    dfr["z"] += dfr["y"].max() 

    # build the adjacency matrix
    size = int(dfr.max().max()) + 1
    m = scipy.sparse.coo_matrix(([1]*len(dfr), (dfr.y, dfr.z)),
                                (size, size))

    # do the work to find the groups
    _, cc = scipy.sparse.csgraph.connected_components(m)

    # get the group codes
    group = pd.Series(cc[dfr["y"].astype(int).values], index=dfr.index)
    # fill in w from x appropriately
    df["w"] = df["x"].groupby(group).transform(min)

    return df

In [230]: via_cc(df0)
Out[230]: 
   x    y    z  w
0  a   jj  NaN  a
1  b   ii   mm  a
2  c   kk   nn  c
3  d   ii  NaN  a
4  e  NaN   oo  e
5  f   jj   mm  a
6  g  NaN   nn  c

In [231]: via_cc(df1)
Out[231]: 
   x   y   z  w
0  a  ii  mm  a
1  b  ii  nn  a
2  c  jj  nn  a
3  d  jj  oo  a
4  e  kk  oo  a

, , .

( , , df0 "Nan" - NaNs. "Nan" ( , NaN), , "" .)

+2

!

, 'y', . , 'z', .
- .

y_chk = df.y.eq(df.y.shift())
z_chk = df.z.eq(df.z.shift())
grps = (~y_chk & ~z_chk).cumsum()
df['w'] = df.groupby(grps).x.transform(pd.Series.head, n=1)
df

enter image description here

+2

NaN, . "y" , , , "z".

'z', , , 'x' . , (, slice = 0).

Convert it to a dictionary to create a mapping, and finally assign it back to the new “w” column, as shown:

df_new = df.replace('Nan', np.NaN)
df_new['z'] = df_new.groupby('y')['z'].transform(lambda x: x.loc[x.first_valid_index()])
df['w'] = df_new['z'].map(df_new.groupby('z')['x'].apply(lambda x: x.sum()[0]).to_dict())
df

Picture

+1
source

Source: https://habr.com/ru/post/1656410/


All Articles