Finding duplicates matching specific conditions in python

The following are examples of the data I'm working on.

sender  receiver    date    id
salman  akhtar  20161201    1111
akhtar  salman  20161201    1112
nabeel  ahmed   20161201    1113
salman  akhtar  20161201    1114
salman  akhtar  20161202    1115
nabeel  ahmed   20161202    1116
ahmed   nabeel  20161202    1117
nabeel  ahmed   20161202    1118
nabeel  ahmed   20161202    1119

What I'm trying to achieve is to find duplicate entries based on the condition the same sender and the same receiver within the same date.

For this, I wrote the following code.

import pandas as pd
import xlsxwriter

print 'Script for Finding duplicate entries\n'

path = raw_input('Enter file name: ')
print 'Loading file. Please wait...'

xlsx = pd.ExcelFile(path+'.xlsx')

print 'File loaded successfully.\n'
sheet = raw_input('Enter Sheet Name: ')
df = pd.read_excel(xlsx, sheet)

df['is_duplicated'] = df.duplicated(['sender', 'receiver','date'],keep=False)

df_dup = df.loc[df['is_duplicated'] == True]

print 'Found Below Duplicates'
print df_dup

writer = pd.ExcelWriter("pandas_column_formats.xlsx", engine='xlsxwriter')
df_dup.to_excel(writer, sheet_name='Sheet1')

writer.save()

print 'File created successfully.'

Now I want to include fuzzywuzzyit because the current code returns only EXACT duplicates, and I want all POSSIBLE duplicate rows based on the specified conditions.

Can anyone help?

+4
source share
1 answer

Something like that?

>>> fuzz_ratio = 50
>>> df_rem = df[~df.duplicated(['sender', 'receiver','date'],keep=False)]
>>> df_possible_dup = pd.merge(df_rem, df, on='date', suffixes=['', '_j'])
>>> df_possible_dup.apply(lambda x: fuzz.ratio(x['sender'], x['sender_j']) >= 50 and x['id'] != x['id_j'], axis=1)

, , , , , . :

def worker(x, fuzz_ratio):
    if x['id'] != x['id_j']:
        return False

    if x['sender'] == x['sender_j'] and fuzz.ratio(x['receiver'], x['receiver_j']) > fuzz_ratio:
        return True

    if x['receiver'] == x['receiver_j'] and fuzz.ratio(x['sender'], x['sender_j']) > fuzz_ratio:
        return True

    return False

>>> df_possible_dup.apply(lambda x: worker(x, fuzz_ratio))
0

Source: https://habr.com/ru/post/1665765/


All Articles