I am trying to solve a problem that involves comparing a large number of word sets, each of which contains a large ordered number of words from a word set (total about 600+, very large dimension!) For similarity, and then grouping them into separate groups. The solution should be as uncontrollable as possible.
Data look like
[Apple, Banana, Orange ...]
[Apple, Banana, Grape ...]
[Jelly, Anise, Orange ...]
[Strawberry, Banana, Orange ...]
... etc.
The word order in each set matters ([Apple, Banana, Orange] is different from [Apple, Orange, Banana]
The approach that I have used so far has been to use Levenshtein’s distance (limited by the distance threshold) as a metric calculated in a Python script, each word being a unique identifier, generate a similarity matrix from distances, and throw this matrix into k- KNIME Mediods for groupings.
My questions:
- Is Levenshtein the most suitable distance metric to use for this problem?
- Is a medium / honeyoid prototype a cluster the best way to group?
- I have not thought much about how to check the selection for "k" in clustering. Would evaluating the SSE clustering curve be the best way to do this?
- Are there any flaws in my methodology?
- , , - - ? , 1 80% 1 ..
, , .
!