Finding duplicates matching specific conditions in python

Question

Finding duplicates matching specific conditions in python

The following are examples of the data I'm working on.

sender  receiver    date    id
salman  akhtar  20161201    1111
akhtar  salman  20161201    1112
nabeel  ahmed   20161201    1113
salman  akhtar  20161201    1114
salman  akhtar  20161202    1115
nabeel  ahmed   20161202    1116
ahmed   nabeel  20161202    1117
nabeel  ahmed   20161202    1118
nabeel  ahmed   20161202    1119

What I'm trying to achieve is to find duplicate entries based on the condition the same sender and the same receiver within the same date.

For this, I wrote the following code.

import pandas as pd
import xlsxwriter

print 'Script for Finding duplicate entries\n'

path = raw_input('Enter file name: ')
print 'Loading file. Please wait...'

xlsx = pd.ExcelFile(path+'.xlsx')

print 'File loaded successfully.\n'
sheet = raw_input('Enter Sheet Name: ')
df = pd.read_excel(xlsx, sheet)

df['is_duplicated'] = df.duplicated(['sender', 'receiver','date'],keep=False)

df_dup = df.loc[df['is_duplicated'] == True]

print 'Found Below Duplicates'
print df_dup

writer = pd.ExcelWriter("pandas_column_formats.xlsx", engine='xlsxwriter')
df_dup.to_excel(writer, sheet_name='Sheet1')

writer.save()

print 'File created successfully.'

Now I want to include fuzzywuzzyit because the current code returns only EXACT duplicates, and I want all POSSIBLE duplicate rows based on the specified conditions.

Can anyone help?

+4

python python-2.7 pandas

Salman akhtar Jan 4 '17 at 16:25

source share

1 answer

Roman Pekar · Answer 1 · 2017-01-04T18:50:06+0000

Something like that?

>>> fuzz_ratio = 50
>>> df_rem = df[~df.duplicated(['sender', 'receiver','date'],keep=False)]
>>> df_possible_dup = pd.merge(df_rem, df, on='date', suffixes=['', '_j'])
>>> df_possible_dup.apply(lambda x: fuzz.ratio(x['sender'], x['sender_j']) >= 50 and x['id'] != x['id_j'], axis=1)

, , , , , . :

def worker(x, fuzz_ratio):
    if x['id'] != x['id_j']:
        return False

    if x['sender'] == x['sender_j'] and fuzz.ratio(x['receiver'], x['receiver_j']) > fuzz_ratio:
        return True

    if x['receiver'] == x['receiver_j'] and fuzz.ratio(x['sender'], x['sender_j']) > fuzz_ratio:
        return True

    return False

>>> df_possible_dup.apply(lambda x: worker(x, fuzz_ratio))

Finding duplicates matching specific conditions in python

More articles: