Finding rows in a Pandas DataFrame with columns that break one-to-one mapping

I have a DataFrame like this:

| index | col_1 | col_2 | | 0 | A | 11 | | 1 | B | 12 | | 2 | B | 12 | | 3 | C | 13 | | 4 | C | 13 | | 5 | C | 14 | 

where col_1 and col_2 may not always be one-to-one due to corrupted data.

How can I use Pandas to determine which rows have entries col_1 and col_2 that violate this one-to-one relationship?

In this case, it will be the last three lines, since C can display 13 or 14.

+6
source share
4 answers

You can use the transformation by counting the length of unique objects in each group. First look at a subset of only these columns, and then group one column at a time:

 In [11]: g = df[['col1', 'col2']].groupby('col1') In [12]: counts = g.transform(lambda x: len(x.unique())) In [13]: counts Out[13]: col2 0 1 1 1 2 1 3 2 4 2 5 2 

Columns for other columns (if not all)

 In [14]: (counts == 1).all(axis=1) Out[14]: 0 True 1 True 2 True 3 False 4 False 5 False dtype: bool 
+2
source

I tested g.transform (lambda x: len (x.unique ())) works well, but slower, especially when there are many groups. The code below is much faster, so I posted it here.

 df2 = pd.DataFrame(df[['col1', 'col2']].groupby(['col1','col2']).size(),columns=['count']) df2.reset_index(inplace=True) df3 = pd.DataFrame(df2.groupby('col1').size(), columns=['count']) df4 = df3[df3['count']>1] df_copy = df.copy() df_copy.set_index('col1', inplace=True) df_outlier = df_copy.ix[df4.index] 
+1
source

I would use collections.Counter because more than one instance of each item in a column breaks the one-to-one mapping:

 >>> import pandas >>> import numpy >>> import collections >>> df = pandas.DataFrame(numpy.array([['a', 1],['b', 2], ['b', 3], ['c', 3]])) >>> df 0 1 0 a 1 1 b 2 2 b 3 3 c 3 >>> collections.Counter(df[0]) Counter({'b': 2, 'a': 1, 'c': 1}) >>> violations1 = [k for k, v in collections.Counter(df[0]).items() if v > 1] >>> violations2 = [k for k, v in collections.Counter(df[1]).items() if v > 1] >>> violations1 ['b'] >>> violations2 ['3'] 
0
source

Im super new to python, but found a way to do this by collecting all the unique groups into a list and filtering out those that were not uniquely displayed:

 data = pd.DataFrame({'Col_1': ['A', 'B', 'B', 'C', 'C', 'C'], 'Col_2': [11,12,12,13,13,14]}) combos = [] for x, y in enumerate(range(len(data['Col_1']))): combo = '%s_%s' %(data['Col_1'][x], data['Col_2'][x]) combos.append(combo) data.index = data['Col_1'] for item in combos: if len([comb for comb in combos if item[2:] in comb[2:]]) != len([comb for comb in combos if item[0] in comb[0]]): data = data.drop(item[0]) data.reset_index(drop=True) 
0
source

Source: https://habr.com/ru/post/970214/


All Articles