The following are examples of the data I'm working on.
sender receiver date id
salman akhtar 20161201 1111
akhtar salman 20161201 1112
nabeel ahmed 20161201 1113
salman akhtar 20161201 1114
salman akhtar 20161202 1115
nabeel ahmed 20161202 1116
ahmed nabeel 20161202 1117
nabeel ahmed 20161202 1118
nabeel ahmed 20161202 1119
What I'm trying to achieve is to find duplicate entries based on the condition the same sender and the same receiver within the same date.
For this, I wrote the following code.
import pandas as pd
import xlsxwriter
print 'Script for Finding duplicate entries\n'
path = raw_input('Enter file name: ')
print 'Loading file. Please wait...'
xlsx = pd.ExcelFile(path+'.xlsx')
print 'File loaded successfully.\n'
sheet = raw_input('Enter Sheet Name: ')
df = pd.read_excel(xlsx, sheet)
df['is_duplicated'] = df.duplicated(['sender', 'receiver','date'],keep=False)
df_dup = df.loc[df['is_duplicated'] == True]
print 'Found Below Duplicates'
print df_dup
writer = pd.ExcelWriter("pandas_column_formats.xlsx", engine='xlsxwriter')
df_dup.to_excel(writer, sheet_name='Sheet1')
writer.save()
print 'File created successfully.'
Now I want to include fuzzywuzzyit because the current code returns only EXACT duplicates, and I want all POSSIBLE duplicate rows based on the specified conditions.
Can anyone help?
source
share