The best distance metric besides Levenshtein for ordered sets of words and subsequent clustering

Question

The best distance metric besides Levenshtein for ordered sets of words and subsequent clustering

I am trying to solve a problem that involves comparing a large number of word sets, each of which contains a large ordered number of words from a word set (total about 600+, very large dimension!) For similarity, and then grouping them into separate groups. The solution should be as uncontrollable as possible.

Data look like

[Apple, Banana, Orange ...]
[Apple, Banana, Grape ...]
[Jelly, Anise, Orange ...]
[Strawberry, Banana, Orange ...]
... etc.

The word order in each set matters ([Apple, Banana, Orange] is different from [Apple, Orange, Banana]

The approach that I have used so far has been to use Levenshtein’s distance (limited by the distance threshold) as a metric calculated in a Python script, each word being a unique identifier, generate a similarity matrix from distances, and throw this matrix into k- KNIME Mediods for groupings.

My questions:

Is Levenshtein the most suitable distance metric to use for this problem?
Is a medium / honeyoid prototype a cluster the best way to group?
I have not thought much about how to check the selection for "k" in clustering. Would evaluating the SSE clustering curve be the best way to do this?
Are there any flaws in my methodology?
, , - - ? , 1 80% 1 ..

, , .

!

+3

algorithm nlp cluster-analysis data-mining levenshtein-distance

don 01 . '10 23:12

2

SimMetrics sourceforge , , .

K- K-Now.co.uk.

0

Sam Chapman 01 . '10 23:19

j_random_hacker · Accepted Answer · 2010-12-02T07:05:06+0000

, - . , , - , " " ( k- k) .

: , ([Apple, Banana], [Carrot, Banana]) "" , ([Apple, Banana, Widget, Xylophone], [Carrot, Banana, Yam, Xylophone]), , .

, " " : , ( ) , , . , , . ( , , , , .)

The best distance metric besides Levenshtein for ordered sets of words and subsequent clustering

More articles: