Data Algorithm

I am currently working on a project in which I have to execute a data matching algorithm. The external system transfers all the data that it knows about the client, and the system that I design should return the client. Thus, the external system knows the correct client identifier, and also receives additional data or can update its own data for a specific client.

The following fields are transferred to the field:

  • Name
  • Name2
  • Street
  • City
  • Zipcode
  • BankAccountNumber
  • Bankname
  • Bankcode
  • Email
  • Phone
  • Fax
  • Web

The data can be of high quality and there is a lot of information, but often the data is crappy, and only the name and address are available and can have spelling.

I am implementing a project in .Net. What I'm doing right now looks something like this:

public bool IsMatch(Customer customer)
{
    // CanIdentify just checks if the info is provided and has a specific length (e.g. > 1)
    if (CanIdentifyByStreet() && CanIdentifyByBankAccountNumber())
    {
        // some parsing of strings done before (substring, etc.)
        if(Street == customer.Street && AccountNumber == customer.BankAccountNumber) return true;
    }
    if (CanIdentifyByStreet() && CanIdentifyByZipCode() &&CanIdentifyByName())
    {
        ...
    }
}

. , (), .

, , , - . , . :

public bool IsMatch(Customer customer)
{
    int matchingScore = 0;
    if (CanIdentifyByStreet())
    {
        if(....)
            matchingScore += 10;
    }
    if (CanIdentifyByName())
    {
        if(....)
            matchingScore += 10;
    }
    if (CanIdentifyBankAccountNumber())
    {
        if(....)
            matchingScore += 10;
    }

    if(matchingScore > iDontKnow)
        return true;
}

, . , .

, : - , , ..? !

+3
5

Levenshtein. .

, . , , 1920 . - , 192 East Pine Road, .

+1

, , , . , . , , , , , , .

, , - , . , , , , (.. ), . , , ( matchPhoneNumber ..), .

, , , . :    Match   {        ( c1, c2);   }

class BankAccountMatch implements Match
{
    public boolean matches(Customer c1, Customer c2)
    {
        return c1.getBankAccountNumber() == c2.getBankAccountNumber();
    }
}

static Match BANK_ACCOUNT_MATCH = new BankAccountMatch();

Match[][] validMatches = new Match[] [] {
        {BANK_ACCOUNT_MATCH, NAME_MATCH},
        {NAME_MATCH, ADDRESS_MATCH, FAX_MATCH}, ...
};

, , validMatches , , . . .

+2
0

, , . trie , , 10.

0

. . .

This will become your entry space. Build a training set on the right groups based on these distances. Launch your favorite driver. Get your parameters for the decision function that reflects the strength of the match. Tune. Apply to new cases. Go to the bank.

0
source

Source: https://habr.com/ru/post/1736710/


All Articles