The best distance metric besides Levenshtein for ordered sets of words and subsequent clustering

I am trying to solve a problem that involves comparing a large number of word sets, each of which contains a large ordered number of words from a word set (total about 600+, very large dimension!) For similarity, and then grouping them into separate groups. The solution should be as uncontrollable as possible.

Data look like

[Apple, Banana, Orange ...]
[Apple, Banana, Grape ...]
[Jelly, Anise, Orange ...]
[Strawberry, Banana, Orange ...]
... etc.

The word order in each set matters ([Apple, Banana, Orange] is different from [Apple, Orange, Banana]

The approach that I have used so far has been to use Levenshtein’s distance (limited by the distance threshold) as a metric calculated in a Python script, each word being a unique identifier, generate a similarity matrix from distances, and throw this matrix into k- KNIME Mediods for groupings.

My questions:

  • Is Levenshtein the most suitable distance metric to use for this problem?
  • Is a medium / honeyoid prototype a cluster the best way to group?
  • I have not thought much about how to check the selection for "k" in clustering. Would evaluating the SSE clustering curve be the best way to do this?
  • Are there any flaws in my methodology?
  • , , - - ? , 1 80% 1 ..

, , .

!

+3
2

, - . , , - , " " ( k- k) .

: , ([Apple, Banana], [Carrot, Banana]) "" , ([Apple, Banana, Widget, Xylophone], [Carrot, Banana, Yam, Xylophone]), , .

, " " : , ( ) , , . , , . ( , , , , .)

+3

SimMetrics sourceforge , , .

K- K-Now.co.uk.

0

Source: https://habr.com/ru/post/1778008/


All Articles