Removing multiple key strings in pandas

I have two data frames, df1 and df2 .

df1:

 contig position tumor_f t_ref_count t_alt_count 1 14599 0.000000 1 0 1 14653 0.400000 3 2 1 14907 0.333333 6 3 1 14930 0.363636 7 4 

df2:

 contig position 1 14599 1 14653 

I would like to remove lines from df1 with the corresponding contig, position values ​​in df2. Something similar to: df1[df1[['contig','position']].isin(df2[['contig','position']])] Also, this does not work.

+4
source share
3 answers

Version .13 adds the isin method to the DataFrame, which will do this. If you are using the current wizard, you can try:

 In [46]: df1[['contig', 'position']].isin(df2.to_dict(outtype='list')) Out[46]: contig position 0 True True 1 True True 2 True False 3 True False 

To not contain elements, use ~ for not and index

 In [45]: df1.ix[~df1[['contig', 'position']].isin(df2.to_dict(outtype='list')). all(axis=1)] Out[45]: contig position tumor_f t_ref_count t_alt_count 2 1 14907 0.333333 6 3 3 1 14930 0.363636 7 4 
+3
source

You can do this with the isin series twice (works in 0.12):

 In [21]: df1['contig'].isin(df2['contig']) & df1['position'].isin(df2['position']) Out[21]: 0 True 1 True 2 False 3 False dtype: bool In [22]: ~(df1['contig'].isin(df2['contig']) & df1['position'].isin(df2['position'])) Out[22]: 0 False 1 False 2 True 3 True dtype: bool In [23]: df1[~(df1['contig'].isin(df2['contig']) & df1['position'].isin(df2['position']))] Out[23]: contig position tumor_f t_ref_count t_alt_count 2 1 14907 0.333333 6 3 3 1 14930 0.363636 7 4 

Perhaps we can get a neat solution in 0.13 (using the DataFrame isin , as in Tom's answer).

There seems to be a neat way to do this using an internal merge ...

 In [31]: pd.merge(df1, df2, how="inner") Out[31]: contig position tumor_f t_ref_count t_alt_count 0 1 14599 0.0 1 0 1 1 14653 0.4 3 2 
+3
source

Here is a detailed approach:

 iter1 = df1[['contig', 'position']].itertuples() is_in_other_df = [] for row in iter1: tup2 = df2.itertuples() is_in_other_df.append(row in tup2) df1["InOtherDF"] = is_in_other_df 

Then just drop the lines where "InOtherDF" is True . You may need to adjust it slightly to ignore the index when returning tuples of rows.

I think this is a cleaner way using merge

 df2["FromDF2"] = True df1 = pandas.merge(df1, df2, left_on=["contig", "position"], right_on=["contig", "position"], how="left") df1[~df1.FromDF2] 
+1
source

Source: https://habr.com/ru/post/1494567/


All Articles