Probabilistic clustering methods for similar text data?

Question

Probabilistic clustering methods for similar text data?

I have 20,000 company addresses for various documents, they are all formatted differently. For instance:

Company A 12345 USA Street
CompanyA, Inc box2, 12345 street WA, US
Company B Company Ltd 123 happy street of Great Britain
Company B, Ltd 123, Happy Street, London, S1 1AA

I would like to be able to combine entries for each company (i.e. split up into 2 categories, one per company).

I have no idea how to do this. I assume that any clustering will be probabilistic in nature and probably works well for more convenient matches, but then requires a manual review for less likely / more uncertain matches.

Can I name any methods suitable for this type of task?

many thanks!

+3

text-processing cluster-analysis

Airtiger Aug 15 '10 at 18:04

source share

1 answer

Gian · Answer 1 · 2010-08-15T18:08:19+0000

Perhaps automatic grammar induction is a method that will produce results here. You can try to derive grammars for your text, and then use some kind of comparative metrics to cluster the output grammars.

Probabilistic clustering methods for similar text data?

More articles: