Probabilistic clustering methods for similar text data?

I have 20,000 company addresses for various documents, they are all formatted differently. For instance:

  • Company A 12345 USA Street

  • CompanyA, Inc box2, 12345 street WA, US

  • Company B Company Ltd 123 happy street of Great Britain

  • Company B, Ltd 123, Happy Street, London, S1 1AA

I would like to be able to combine entries for each company (i.e. split up into 2 categories, one per company).

I have no idea how to do this. I assume that any clustering will be probabilistic in nature and probably works well for more convenient matches, but then requires a manual review for less likely / more uncertain matches.

Can I name any methods suitable for this type of task?

many thanks!

+3
source share
1 answer

Perhaps automatic grammar induction is a method that will produce results here. You can try to derive grammars for your text, and then use some kind of comparative metrics to cluster the output grammars.

+2
source

Source: https://habr.com/ru/post/1759786/


All Articles