Clustering simply counts each instance of the data array according to some metric, sorting the data array according to this calculated estimate, then slicing into a number of segments, assigning a label to each of them.
In other words, you can group any data for which you can formulate any meaningful function for calculating the similarity of each data point w / r / t to others; this is commonly called a similarity metric .
There are many of them, but only a small part of them is useful for evaluating strings. Of these, perhaps the most commonly used is the Levenshtein Distance (aka Edit Distance).
This metric is expressed as an integer, and it increases one unit (+1) for each "edit" - insert, delete or change the letter needed to convert one word to another. Summing up these individual changes (one for each letter) gives you Levenshtein distance.
The R vwr package includes an implementation:
> library(vwr) > levenshtein.distance('cat', 'hat') hat 1 > levenshtein.distance('cat', 'catwalk') catwalk 4 > levenshtein.distance('catwalk', 'sidewalk') sidewalk 4 >
While Levenshtein Distance can be used to cluster your data, whether it should be used for your data, this is a question I will give you (i.e. the main use case for L / D is clearly plain text data).
(Perhaps the next most common similarity metric that works with strings is the Hamming distance. The Hamming distance (unlike Levenshtein) requires two lines to be the same length, so it wonβt work for your data.)
source share