Let's say that I have an MDM system (Master Data Management), the main application of which is the detection and prevention of duplicate records.
Each time a sales representative introduces a new customer into the system, my MDM platform checks existing records, calculates the Levenshtein or Jaccard or XYZ distance between two words or phrases or attributes, considers weights and coefficients, and displays a similarity score, etc.
Typical fuzzy match scenario.
I would like to know if it makes sense to apply machine learning methods at all to optimize consistent output, i.e. Find duplicates with maximum accuracy.
And where exactly does that make sense.
- optimization of attribute weights?
- increase confidence in the algorithm by predicting the outcome of a match?
- find out the matching rules that I would otherwise have configured in the algorithm?
- something else?
There is also this excellent answer on the topic, but I did not quite understand if the guy really used ML or not.
Also, my understanding is that a weighted fuzzy comparison is already a good enough solution, perhaps even from a financial point of view, since whenever you deploy such an MDM system, you need to do some analysis and preprocessing in any case, whether it’s either manually coding compliance rules or learning the ML algorithm.
So, I’m not sure that adding ML will be an essential suggestion.
Any thoughts are appreciated.