Clustering strings in R (is this possible?)

Question

Clustering strings in R (is this possible?)

I have a dataset with a column that is currently considered as a factor with 1000+ levels. These are the values for the column. I would like to clear this data. Some values are strings like "-18 + 5 = -13" and "5 - 18 = -13", I would like clustering to group them differently than "R3no4".

Is this possible in R? I looked at the task of the natural language processing task http://cran.r-project.org/web/views/NaturalLanguageProcessing.html , but I need to click in the right direction.

dataset from kdd 2010 cup I would like to create meaningful new columns from this column to help create a predictive model. for example, it would be nice to know if the string contains a specific operation or does not contain any operations and instead describes the problem.

my data frame is as follows:

str(data1) 'data.frame': 809694 obs. of 19 variables: $ Row : int 1 2 3 4 5 6 7 8 9 10 ... $ Anon.Student.Id : Factor w/ 574 levels "02i5jCrfQK","02ZjVTxC34",..: 7 7 7 7 7 7 7 7 7 7 ... $ Problem.Hierarchy : Factor w/ 138 levels "Unit CTA1_01, Section CTA1_01-1",..: 80 80 80 80 80 80 80 80 80 80 ... $ Problem.Name : Factor w/ 1084 levels "1PTB02","1PTB03",..: 377 377 378 378 378 378 378 378 378 378 ... $ Problem.View : int 1 1 1 1 2 2 3 3 4 4 ... $ Step.Name : Factor w/ 187539 levels "-(-0.24444444-y) = -0.93333333",..: 116742 177541 104443 64186 58776 58892 153246 153078 45114 163923 ...

The Step.Name function is most interesting to me, since it contains the largest number of unique factor values.

and some sample values for the step name:

 [97170] (1+7)/4 = x [97171] (1-sqrt(1^2-4*2*-6))/4 = x [97172] (1-sqrt(1^2-(-48)))/4 = x [97173] (1-sqrt(1-(-48)))/4 = x [97174] (1-sqrt(49))/4 = x [97175] (1-7)/4 = x [97176] x^2+15x+44 = 0 [97177] a-factor-node [97178] b-factor-node [97179] c-factor-node [97180] num1-factor-node [97181] num2-factor-node [97182] den1-factor-node [97183] (-15?sqrt((-15)^2-4*1*44))/2 = x [97184] (-15+sqrt((-15)^2-4*1*44))/2 = x [97185] (-15+sqrt((-15)^2-176))/2 = x [97186] (-15+sqrt(225-176))/2 = x [97187] (-15+sqrt(49))/2 = x [97188] (-15+7)/2 = x [97189] (-15-sqrt((-15)^2-4*1*44))/2 = x [97190] (-15-sqrt((-15)^2-176))/2 = x [97191] (-15-sqrt(225-176))/2 = x [97192] (-15-sqrt(49))/2 = x [97193] (-15-7)/2 = x [97194] 2x^2+x = 0 [97195] a-factor-node [97196] b-factor-node [97197] c-factor-node [97198] num1-factor-node [97199] num2-factor-node [97200] den1-factor-node [97201] (-1?sqrt((-1)^2-4*2*0))/4 = x [97202] (-1+sqrt((-1)^2-4*2*0))/4 = x [97203] (-1+sqrt((-1)^2-0))/4 = x [97204] (-1+sqrt((-1)^2))/4 = x [97205] (-1+1)/4 = x [97206] (-1-sqrt((-1)^2-4*2*0))/4 = x [97207] (-1-sqrt((-1)^2-0))/4 = x [97208] (-1-sqrt((-1)^2))/4 = x [97209] (-1-1)/4 = x [97210] x^2-6x = 0 [97211] a-factor-node [97212] b-factor-node

+4

r csv machine-learning cluster-analysis

Harry moreno Dec 12 '11 at 0:35

source share

2 answers

May be:

 > grepl("[[:alpha:]]", c("-18 + 5 = -13", "5 - 18 = -13","R3no4") ) [1] FALSE FALSE TRUE

+1

42- Dec 12 '11 at 3:05

source share

doug · Accepted Answer · 2011-12-12T04:12:42+0000

Clustering simply counts each instance of the data array according to some metric, sorting the data array according to this calculated estimate, then slicing into a number of segments, assigning a label to each of them.

In other words, you can group any data for which you can formulate any meaningful function for calculating the similarity of each data point w / r / t to others; this is commonly called a similarity metric .

There are many of them, but only a small part of them is useful for evaluating strings. Of these, perhaps the most commonly used is the Levenshtein Distance (aka Edit Distance).

This metric is expressed as an integer, and it increases one unit (+1) for each "edit" - insert, delete or change the letter needed to convert one word to another. Summing up these individual changes (one for each letter) gives you Levenshtein distance.

The R vwr package includes an implementation:

 > library(vwr) > levenshtein.distance('cat', 'hat') hat 1 > levenshtein.distance('cat', 'catwalk') catwalk 4 > levenshtein.distance('catwalk', 'sidewalk') sidewalk 4 > # using a data set supplied with the vmr library > EW = english.words > ew1 = sample(EW, 20) # random select 20 words from EW > # the second argument is a vector of words, returns a vector of distances > dx = levenshtein.distance('cat', ew1) > dx furriers graves crooned cursively gabled caparisons drainpipes 8 5 6 8 5 8 9 patricians medially beholder chirpiness fluttered bobolink lamentably 8 7 8 9 8 8 8 depredations alights unearthed thimbles supersede dissembler 10 6 7 8 9 10

While Levenshtein Distance can be used to cluster your data, whether it should be used for your data, this is a question I will give you (i.e. the main use case for L / D is clearly plain text data).

(Perhaps the next most common similarity metric that works with strings is the Hamming distance. The Hamming distance (unlike Levenshtein) requires two lines to be the same length, so it won’t work for your data.)

Clustering strings in R (is this possible?)

More articles: