How to find similar sentences / phrases in R?

Question

How to find similar sentences / phrases in R?

For example, I have billions of short phrases, and I want their clusters to be similar.

> strings.to.cluster <- c("Best Toyota dealer in bay area. Drive out with a new car today", "Largest Selection of Furniture. Stock updated everyday" , " Unique selection of Handcrafted Jewelry", "Free Shipping for orders above $60. Offer Expires soon", "XXXX is where smart men buy anniversary gifts", "2012 Camrys on Sale. 0% APR for select customers", "Closing Sale on office desks. All Items must go" )

suppose this vector contains hundreds of thousands of lines. Is there a package in R for grouping these phrases in meaning? or someone may suggest a way to rank “similar” phrases within the meaning of this phrase.

+6

r statistics nlp

sgt pepper Jan 26 '12 at 5:35

source share

2 answers

Perhaps look at this document: http://www.inside-r.org/howto/mining-twitter-airline-consumer-sentiment can help, it uses R and looks at market sentiment for airlines using twitter.

+1

aatrujillob Jan 26 '12 at 5:42

source share

Vincent zoonekynd · Accepted Answer · 2012-01-26T06:19:58+0000

You can view your phrases as “word bags”, i.e. build a matrix (matrix "term-document"), with one row per phrase, one column per word, with 1 if the word appears in the phrase and 0 otherwise. (You can replace 1 with some weight, which would take into account the length of the phrase and the frequency of the word). Then you can apply any clustering algorithm. The tm package can help you build this matrix.

 library(tm) library(Matrix) x <- TermDocumentMatrix( Corpus( VectorSource( strings.to.cluster ) ) ) y <- sparseMatrix( i=x$i, j=x$j, x=x$v, dimnames = dimnames(x) ) plot( hclust(dist(t(y))) )

How to find similar sentences / phrases in R?

More articles: