Filter the pandas data frame by requiring multiple elements at the MultiIndex level

I have a multi-index data table. The first multi-level - this is the name corresponding to the sequence (DNA), a second multi-level corresponds to a specific type of sequence variant wt, m1, m2, m3in the example below. Not all given sequences wtwill have all types of options (see seqAand seqCbelow).

df = pd.DataFrame(data={'A':range(1,9), 'B':range(1,9), 'C': range(1,9)},
     index=pd.MultiIndex.from_tuples([('seqA', 'wt'), ('seqA', 'm1'),
     ('seqA', 'm2'),  ('seqB', 'wt'), ('seqB', 'm1'), ('seqB', 'm2'),
     ('seqB', 'm3'), ('seqC', 'wt') ]))

df.index.rename(['seq_name','type'], inplace=True)
print df

               A  B  C
seq_name type         
seqA     wt    1  1  1
         m1    2  2  2
         m2    3  3  3
seqB     wt    4  4  4
         m1    5  5  5
         m2    6  6  6
         m3    7  7  7
seqC     wt    8  8  8

, (m1 m2 ). , , seq_name , list.

, IMO.

var_l = ['wt', 'm1', 'm2']
df1 = df[df.index.get_level_values('type').isin(var_l)] #Filter varaints not of interest

set_l = []
for v in var_l: #Filter for each variant individually, and store seq_names
    df2 = df[df.index.get_level_values('type').isin([v])]
    set_l.append(set(df2.index.get_level_values('seq_name')))

seq_s = set.intersection(*set_l) # Get seq_names that only have all three variants
df3 = df1[df1.index.get_level_values('seq_name').isin(seq_s)] #Filter based on seq_name
print df3

               A  B  C
seq_name type         
seqA     wt    1  1  1
         m1    2  2  2
         m2    3  3  3
seqB     wt    4  4  4
         m1    5  5  5
         m2    6  6  6

, , . - :

var_l = ['wt', 'm1', 'm2']
filtered_df = filterDataframe(df1, var_l)
print filtered_df

               A  B  C
seq_name type         
seqA     wt    1  1  1
         m1    2  2  2
         m2    3  3  3
seqB     wt    4  4  4
         m1    5  5  5
         m2    6  6  6

, any .

+6
2

query filter:

var_l = ['wt', 'm1', 'm2']

filtered_df=df.query('type in @var_l').groupby(level=0).filter(lambda x: len(x)==len(var_l))
print (filtered_df)
               A  B  C
seq_name type         
seqA     wt    1  1  1
         m1    2  2  2
         m2    3  3  3
seqB     wt    4  4  4
         m1    5  5  5
         m2    6  6  6

transform size, boolean indexing:

filtered_df = df.query('type in @var_l')
filtered_df = filtered_df[filtered_df.groupby(level=0)['A']
                                     .transform('size')
                                     .eq(len(var_l))
                                     .rename(None)]

print (filtered_df)
               A  B  C
seq_name type         
seqA     wt    1  1  1
         m1    2  2  2
         m2    3  3  3
seqB     wt    4  4  4
         m1    5  5  5
         m2    6  6  6

, :

print (filtered_df.groupby(level=0)['A'].transform('size'))
seq_name  type
seqA      wt      3
          m1      3
          m2      3
seqB      wt      3
          m1      3
          m2      3
seqC      wt      1
Name: A, dtype: int32

print (filtered_df.groupby(level=0)['A']
                  .transform('size')
                  .eq(len(var_l))
                  .rename(None))
seq_name  type
seqA      wt       True
          m1       True
          m2       True
seqB      wt       True
          m1       True
          m2       True
seqC      wt      False
dtype: bool
+2

1
query + stack
@jezrael, NaN, , .

df.query('type in @var_l').unstack().dropna().stack()

                 A    B    C
seq_name type               
seqA     m1    2.0  2.0  2.0
         m2    3.0  3.0  3.0
         wt    1.0  1.0  1.0
seqB     m1    5.0  5.0  5.0
         m2    6.0  6.0  6.0
         wt    4.0  4.0  4.0

dtypes

df.query('type in @var_l').unstack().dropna().stack().astype(df.dtypes)

               A  B  C
seq_name type         
seqA     m1    2  2  2
         m2    3  3  3
         wt    1  1  1
seqB     m1    5  5  5
         m2    6  6  6
         wt    4  4  4

2
filter
, var_l var_l

def correct_vars(df, v):
    x = set(v)
    n = df.name
    y = set(df.xs(n).index.intersection(v))
    return x == y

df.groupby(level=0).filter(correct_vars, v=var_l)

               A  B  C
seq_name type         
seqA     wt    1  1  1
         m1    2  2  2
         m2    3  3  3
seqB     wt    4  4  4
         m1    5  5  5
         m2    6  6  6
         m3    7  7  7
+2

Source: https://habr.com/ru/post/1015847/


All Articles