Python: removing duplicates based on a unique combination of two functions and conditions for the third function

The task for the solution is as follows: I have two data sets that I want to combine into one. Datasets do not have a shared key column. I would like to exclude duplicates based on unique combinations of column 1 and column 2 and the similarity of column 3. By similarity, I mean that the values ​​of column 3 in dataset A are slightly less \ less than in dataset B, for example. for value 20, values ​​from the range [18,22] are valid. Here is an example:

Dataset A:

 Col1 | Col2 | Col3 |
1 A   | A    | 10   |
2 B   | A    | 20   |
3 A   | B    | 10   |
4 B   | B    | 20   |

Dataset B:

 Col1 | Col2 | Col3 |
1 A   | A    | 10   |
2 B   | A    | 21   |
3 A   | B    | 100  |
  • Row 1 is exactly the same in both datasets, so I want to include only one row in my final dataset.
  • 1 2 2. 3 : 20 21. , A
  • 1 2 3. 3 extreme : 10 100. .
  • 4 A B, .

:

 Col1 | Col2 | Col3 |
1 A   | A    | 10   |
2 B   | A    | 20   |
3 A   | B    | 10   |
4 A   | B    | 100  |
5 B   | B    | 20   |

O (n ^ 2) [ ]?

+4
1

(Col1, Col2) -> Col3 A. B , , Col3 A Col1 Col2 .

Pseudocode, familiair pandas:

from collections import defaultdict

def is_close(a, b):
    return abs(a-b) < some_value

d = defaultdict(list)
for col1, col2, col3 in A:
    d[(col1, col2)].append(col3)

for col1, col2, col3 in B:
    if not any(is_close(col3, x) for x in d[(col1, col2)]):
        add_to_result(col1, col2, col3)

defaultdict , Col1 Col2. , O(n), B A Col1 Col2.

0

Source: https://habr.com/ru/post/1679642/


All Articles