Grouping and comparing groups with pandas

I have data that looks like this:

Identifier  Category1 Category2 Category3 Category4 Category5
1000           foo      bat       678         a.x       ld
1000           foo      bat       78          l.o       op
1000           coo      cat       678         p.o       kt
1001           coo      sat       89          a.x       hd
1001           foo      bat       78          l.o       op
1002           foo      bat       678         a.x       ld
1002           foo      bat       78          l.o       op
1002           coo      cat       678         p.o       kt

What I'm trying to do is compare from 1000 to 1001 and to 1002 and so on. The result that I want to get for the code: 1000 is the same as 1002. Thus, the approach I wanted to use was as follows:

  • Transfer all identifier elements to separate data frames (maybe?). For example, df1 will be all lines related to identifier 1000, and df2 will be all lines related to identifier 1002. (** Please note that I want the code to do this myself, because there are millions of lines, and not I write the code manually compare identifiers **). I tried using the groupby pandas function, it performs part of the grouping well, but then I do not know how to compare groups.
  • Compare each frame of groups / sub-data.

One of the methods I was thinking about is reading each row of a specific identifier into an array / vector and comparing arrays / vectors using a comparison metric (Manhattan distance, similarity to cosine, etc.).

Any help is appreciated, I am very new to Python. Thanks in advance!

+4
2

- :

import pandas as pd

input_file = pd.read_csv("input.csv")
columns = ['Category1','Category2','Category3','Category4','Category5']

duplicate_entries = {}

for group in input_file.groupby('Identifier'):
    # transforming to tuples so that it can be used as keys on a dict
    lines = [tuple(y) for y in group[1].loc[:,columns].values.tolist()]    
    key = tuple(lines) 

    if key not in duplicate_entries:
        duplicate_entries[key] = []

    duplicate_entries[key].append(group[0])

duplicate_entries

duplicate_entries.values()
> [[1000, 1002], [1001]]

EDIT:

, - :

all_dup = [dup for dup in duplicate_entries if len(dup) > 1]

(, ): df.groupby , ( "" ) - . , , , [1], [0]. duplicate_entries , group[0], .

+1

groupby, ( , ) , "Identifier", :

, columns = ["Identifier", "Category1", "Category2", "Category3", "Category4", "Category5"]

:

groups = []
pure_groups = []
for name, group in df.groupby("Identifier"):
    pure_groups += [group]
    g_idfless = group[group.columns.difference(["Identifier"])]
    groups += [g_idfless.sort_values(columns[1:]).reset_index().drop("index", axis=1)]

:

for i in range(len(groups)):
    for j in range(i + 1, len(groups)):
        id1 = str(pure_groups[i]["Identifier"].iloc[0])
        id2 = str(pure_groups[j]["Identifier"].iloc[0])
        print(id1 + " and " + id2 + " equal?: " + str(groups[i].equals(groups[j])))

#-->1000 and 1001 equal?: False
#-->1000 and 1002 equal?: True
#-->1001 and 1002 equal?: False

EDIT: ,

+1

Source: https://habr.com/ru/post/1678966/


All Articles