In python2.7, I create a pandas framework like this:
import pandas as pd
df = pd.DataFrame({
'ID' : ['1','2','3'],
'sps1' : ['1001', '1111', '1000'],
'sps2' : ['1001','0001','NaN'],
'sps3' : ['1001','NaN','1000'],
'sps4' : ['1001','1101','0101']
})
So it looks like this:
ID sps1 sps2 sps3 sps4
0 1 1001 1001 1001 1001
1 2 1111 0001 NaN 1101
2 3 1000 NaN 1000 0101
Each line contains data on a different biological sequence, which has a unique identifier (1, 2, 3, etc.). Each sequence is present in 4 different species (sps1-4). The presence (1) or absence (0) of 4 different characters in each sequence is encoded as a 4-digit code. Sequence is absent in some species; therefore, NaN is recorded.
From this frame, I want to select the lines where the code for sps1 does not match the code for all other kinds.
, 0 ( 1001 sps), 2 ( 1000 sps1 sps3), 1 ( SP111 1111 ).
.
pandas. :
matches = df.loc[( (df['sps1'] != df['sps2']) & (df['sps1'] != df['sps3']) )].index
df_match = df.iloc[matches]
sps1 spsX, 12 , . , ?