How to extract specific sequences from a data frame?

Question

How to extract specific sequences from a data frame?

This question is related to my previous question . Given the following data block:

df = 
    ID   TYPE   VD_0   VD_1   VD_2   VD_3   VD_4   VD_5
    1    ABC    V1234  aaa    bbb    456    123    564
    2    DBC    456    A45    aaa    V1234  bbb    564
    3    ABD    456    V1234  bbb    ccc    456    123
    4    ABD    ccc    aaa    123    V1234  SSW    123

The following is a list of target values VD_0- VD_5:

myList = [V1234,456,A45]

I want to get only those rows in dfwhich are 2 or more "sequential" values the appearance of myListthe columns VD_0- VD_5, but is allowed to have any other significance between them (any others who do not belong to myList). For example, these values may be permissible aaa, bbb, cccetc.

The result should be as follows:

result = 
    ID   TYPE   Col_0   Col_1   Col_2 
    1    ABC    V1234   456
    2    DBC    456     A45     V1234
    3    ABD    456     V1234   456

In resultI want to display only the values from myListin the columns Col_, ignoring the rest of the values.

, , , myList:

subset = df.filter(like='VD_')

df[subset.isin(myList).rolling(2, axis=1).sum().max(axis=1)>=2]

.

+1

python pandas

Dinosaurius 16 . '17 10:08

2

.

In [903]: df.apply(lambda x: [y for y in x if y in myList], axis=1)
Out[903]:
0         [V1234, 456]
1    [456, A45, V1234]
2    [456, V1234, 456]
3              [V1234]
dtype: object

In [904]: s = df.apply(lambda x: [y for y in x if y in myList], axis=1)

In [905]: s[s.apply(len) >= 2]
Out[905]:
0         [V1234, 456]
1    [456, A45, V1234]
2    [456, V1234, 456]
dtype: object

In [906]: s[s.apply(len) >= 2].apply(pd.Series)
Out[906]:
       0      1      2
0  V1234    456    NaN
1    456    A45  V1234
2    456  V1234    456

+1

Zero 16 . '17 10:26

jezrael · Accepted Answer · 2017-02-16T10:20:38+0000

, True DataFrame sum boolean DataFrame.

, df.columns reindex fill_values=True:

myList = ['V1234','456','A45']
subset = df.filter(like='VD_')

subset1 = subset.isin(myList)
mask1 = subset1.sum(axis=1) >= 2
print (mask1)

0     True
1     True
2     True
3    False
dtype: bool

mask2 = subset1.sum() >= 2
print (mask2)
VD_0     True
VD_1     True
VD_2    False
VD_3     True
VD_4    False
VD_5    False
dtype: bool

print (mask2.reindex(df.columns, fill_value=True))
ID       True
TYPE     True
VD_0     True
VD_1     True
VD_2    False
VD_3     True
VD_4    False
VD_5    False
dtype: bool

print (df.loc[mask1, mask2.reindex(df.columns, fill_value=True)])
   ID TYPE   VD_0   VD_1   VD_3
0   1  ABC  V1234    aaa    456
1   2  DBC    456    A45  V1234
2   3  ABD    456  V1234    ccc

How to extract specific sequences from a data frame?

More articles: