Counting duplicate values in a Pandas DataFrame

Question

Counting duplicate values in a Pandas DataFrame

There should be an easy way to do this, but I could not find an elegant solution for SO or figure it out myself.

I am trying to count the number of duplicate values based on a set of columns in a DataFrame.

Example:

print df

    Month   LSOA code   Longitude   Latitude    Crime type
0   2015-01 E01000916   -0.106453   51.518207   Bicycle theft
1   2015-01 E01000914   -0.111497   51.518226   Burglary
2   2015-01 E01000914   -0.111497   51.518226   Burglary
3   2015-01 E01000914   -0.111497   51.518226   Other theft
4   2015-01 E01000914   -0.113767   51.517372   Theft from the person

My workaround:

counts = dict()
for i, row in df.iterrows():
    key = (
            row['Longitude'],
            row['Latitude'],
            row['Crime type']
        )

    if counts.has_key(key):
        counts[key] = counts[key] + 1
    else:
        counts[key] = 1

And I get the counts:

{(-0.11376700000000001, 51.517371999999995, 'Theft from the person'): 1,
 (-0.111497, 51.518226, 'Burglary'): 2,
 (-0.111497, 51.518226, 'Other theft'): 1,
 (-0.10645299999999999, 51.518207000000004, 'Bicycle theft'): 1}

Besides the fact that this code could be improved (feel free to comment how), how can this be done using Pandas?

For those who are interested, I am working on a dataset from https://data.police.uk/

+4

python pandas duplicates count

tales Nov 30 '15 at 7:38

source share

1 answer

jezrael · Accepted Answer · 2015-11-30T07:47:28+0000

groupby size. reset index 0 count.

print df
  Month LSOA       code  Longitude   Latitude             Crime type
0    2015-01  E01000916  -0.106453  51.518207          Bicycle theft
1    2015-01  E01000914  -0.111497  51.518226               Burglary
2    2015-01  E01000914  -0.111497  51.518226               Burglary
3    2015-01  E01000914  -0.111497  51.518226            Other theft
4    2015-01  E01000914  -0.113767  51.517372  Theft from the person

df = df.groupby(['Longitude', 'Latitude', 'Crime type']).size().reset_index(name='count')
print df
   Longitude   Latitude             Crime type  count
0  -0.113767  51.517372  Theft from the person      1
1  -0.111497  51.518226               Burglary      2
2  -0.111497  51.518226            Other theft      1
3  -0.106453  51.518207          Bicycle theft      1

print df['count']
0    1
1    2
2    1
3    1
Name: count, dtype: int64

Counting duplicate values ​​in a Pandas DataFrame

More articles:

Counting duplicate values in a Pandas DataFrame