What is the fastest way to find similar records from different (large) data sources

I work at a public health agency that has many different demographic datasets - it is stored in the SQL segment, Access and Excel. I wrote an application that allows people to find โ€œmatchesโ€ in these datasets based on various criteria configured using a graphical interface. For example, one coincidence may be that the coincidence of the first, last and DOB in both datasets - but the SSN is โ€œoff by 1โ€ (determined by the Levenshtein algorithm).

These are large data sets. Appropriate criteria can become very complex. Right now, I find a match by pulling both datasets into data tables in memory, and then iterating through the rows after row through the first table and see if there are any rows in the second table that match (using LINQ). So my code looks something like this:

For each table1Row in TableOne/DatasourceOne table2Options=from l in table2rows where Levenshtein(table1Row.first, l.first)<2 //first name off by one table2Options=from l in table2rows where Levenshtein(table1Row.last, l.last)<2 //last name off by one if table2Options.count>1 then the row in table1 'matches' table 2 Next 

The code outputs the correct result (finds matches), but it is SLOWOW. I know that line-crossing should be slower, but using LINQ to find all records goes even slower right away.

 From l in table1, k in table2 where Levenshtein(l.first, k.first)<2 and Levenshtein(l.last, k.last)<2 select l //this takes forever because it calculates the function for l rows * k rows 

Any ideas on how to make this core faster?

+4
source share
4 answers

The Wikipedia article on Levenshtein lists a number of possible improvements that should help. In your case, since you are only interested in distances below the threshold value, you should notice that if the lines are equal in size (for example, social security numbers), you can use the distance from interference instead. For rows of unequal size, you just need to calculate the diagonal strip 2k + 1 in the matrix, where k is your threshold.

Running two lists of strings against each other should provide some optimizations in the Levenshtein algorithm. For example, a substitution comparison for a word starting with โ€œaโ€ will apply to all words starting with โ€œaโ€. I will see if I can find a reference algorithm that already optimizes this method.

+2
source

Add and if expression does not check FirstName if LastName does not match ..

99% of LastName do not match. Therefore, you rarely do a name check. This will reduce processing time by almost half.


Another thing you can do is add additional rules (of course, you must do this while maintaining the intention of being unclean) .. for example: if you add a rule to say โ€œThe first letter of LastName must matchโ€, then you can filter it first , thereby giving you a huge boost in efficiency.

+3
source

Could you collect all the data needed for in-memory analysis and then analyze it? I replaced the code as follows:

 for each day in months: for each customer in customers: computed = computeSalesInDatabase(customer,day) saveComputed(computed) 

With this code:

 for each day in months: dailyData = getFullDay(day) for each customer in customers: computed = computeSalesInMemory(customer,day,dailyData) saveComputed(computed) 

And I found that it is more revealing.

+1
source

I would look at porting this to some kind of non-SQL implementation. You can see this StackOverflow article on NoSQL and Linq for more information.

If you cannot complete the full implementation of NoSQL, you may want to take some ideas from it, for example, work with key-value pairs to form data connections.

+1
source

Source: https://habr.com/ru/post/1400018/


All Articles