I have 20,000 company addresses for various documents, they are all formatted differently. For instance:
Company A 12345 USA Street
CompanyA, Inc box2, 12345 street WA, US
Company B Company Ltd 123 happy street of Great Britain
Company B, Ltd 123, Happy Street, London, S1 1AA
I would like to be able to combine entries for each company (i.e. split up into 2 categories, one per company).
I have no idea how to do this. I assume that any clustering will be probabilistic in nature and probably works well for more convenient matches, but then requires a manual review for less likely / more uncertain matches.
Can I name any methods suitable for this type of task?
many thanks!
source
share