How to determine if a record in each source represents the same person

I have several sources of tables with personal data, for example:

SOURCE 1
ID, FIRST_NAME, LAST_NAME, FIELD1, ...
1, jhon, gates ...

SOURCE 2
ID, FIRST_NAME, LAST_NAME, ANOTHER_FIELD1, ...
1, jon, gate ...

SOURCE 3
ID, FIRST_NAME, LAST_NAME, ANOTHER_FIELD1, ...
2, jhon, ballmer ...

So, assuming the records with ID 1 from sources 1 and 2 are the same person, my problem is how to determine if the record in each source is the same person . In addition, not all records exist in all sources. All names are written in Spanish mostly.

In this case, the exact agreement should be weakened, since we assume that the data sources were not subjected to strict checks against the official country identification office. We also need to assume that typos are common because of the nature of the processes for collecting data. Moreover, the number of records is about 2 or 3 million in each source ...

Our team thought of something like this: first, adjust the exact matching in the selected fields, such as ID NUMBER, and NAMES, to know how difficult the problem can be. Secondly, weakening the eligibility criteria and counting the number of records that can be matched, but here, where the problem arises: how to do to relax the eligibility criteria without generating too much noise, without limiting too much?

? , - ? , soundex , ?

!

.

+3
7

, , , . , , , .

. , .

, , - , . , . , .

, , , , . , .

, , . .

, .

+3

- , , .

, /, , . ( , ). , , , , ..

. , Soundex , .

+3

SSIS,

+2

, Postgresql 8.3

+1

You can try changing the names by comparing them with dicionary.
This will allow you to identify some typical typos and correct them.

0
source

It sounds to me that you have a record linkage problem . You can use the links in the link.

0
source

Source: https://habr.com/ru/post/1697408/


All Articles