I work at a public health agency that has many different demographic datasets - it is stored in the SQL segment, Access and Excel. I wrote an application that allows people to find โmatchesโ in these datasets based on various criteria configured using a graphical interface. For example, one coincidence may be that the coincidence of the first, last and DOB in both datasets - but the SSN is โoff by 1โ (determined by the Levenshtein algorithm).
These are large data sets. Appropriate criteria can become very complex. Right now, I find a match by pulling both datasets into data tables in memory, and then iterating through the rows after row through the first table and see if there are any rows in the second table that match (using LINQ). So my code looks something like this:
For each table1Row in TableOne/DatasourceOne table2Options=from l in table2rows where Levenshtein(table1Row.first, l.first)<2 //first name off by one table2Options=from l in table2rows where Levenshtein(table1Row.last, l.last)<2 //last name off by one if table2Options.count>1 then the row in table1 'matches' table 2 Next
The code outputs the correct result (finds matches), but it is SLOWOW. I know that line-crossing should be slower, but using LINQ to find all records goes even slower right away.
From l in table1, k in table2 where Levenshtein(l.first, k.first)<2 and Levenshtein(l.last, k.last)<2 select l
Any ideas on how to make this core faster?
source share