Pandas select rows by matching a column record with records in several other columns

In python2.7, I create a pandas framework like this:

import pandas as pd

df = pd.DataFrame({
'ID' : ['1','2','3'],
'sps1' : ['1001', '1111', '1000'],
'sps2' : ['1001','0001','NaN'],
'sps3' : ['1001','NaN','1000'],
'sps4' : ['1001','1101','0101']
})

So it looks like this:

     ID  sps1  sps2  sps3  sps4
0     1  1001  1001  1001  1001
1     2  1111  0001   NaN  1101
2     3  1000   NaN  1000  0101

Each line contains data on a different biological sequence, which has a unique identifier (1, 2, 3, etc.). Each sequence is present in 4 different species (sps1-4). The presence (1) or absence (0) of 4 different characters in each sequence is encoded as a 4-digit code. Sequence is absent in some species; therefore, NaN is recorded.

From this frame, I want to select the lines where the code for sps1 does not match the code for all other kinds.

, 0 ( 1001 sps), 2 ( 1000 sps1 sps3), 1 ( SP111 1111 ).

.

pandas. :

matches = df.loc[( (df['sps1'] != df['sps2']) & (df['sps1'] != df['sps3']) )].index
df_match = df.iloc[matches]

sps1 spsX, 12 , . , ?

+4
3

filter eq, , sps1 , axis = rows . , :

df[(df.filter(regex = "^sps").eq(df.sps1, axis="rows")).sum(axis=1) == 1]

#  ID   sps1    sps2    sps3    sps4
#1  2   1111    0001     NaN    1101
+3

Psidom , , , , any(), .

df[~df.filter(regex="^sps(?!1$)\d+$").eq(df.sps1, axis='rows').any(1)]
+1

You guessed correctly:

df.loc[[df.iloc[i,1:].duplicated().sum() == 0 for i in df.index]]

Result:

  ID  sps1  sps2 sps3  sps4
1  2  1111  0001  NaN  1101
0
source

Source: https://habr.com/ru/post/1670763/


All Articles