How to find similar sentences / phrases in R?

For example, I have billions of short phrases, and I want their clusters to be similar.

> strings.to.cluster <- c("Best Toyota dealer in bay area. Drive out with a new car today", "Largest Selection of Furniture. Stock updated everyday" , " Unique selection of Handcrafted Jewelry", "Free Shipping for orders above $60. Offer Expires soon", "XXXX is where smart men buy anniversary gifts", "2012 Camrys on Sale. 0% APR for select customers", "Closing Sale on office desks. All Items must go" ) 

suppose this vector contains hundreds of thousands of lines. Is there a package in R for grouping these phrases in meaning? or someone may suggest a way to rank โ€œsimilarโ€ phrases within the meaning of this phrase.

+6
source share
2 answers

You can view your phrases as โ€œword bagsโ€, i.e. build a matrix (matrix "term-document"), with one row per phrase, one column per word, with 1 if the word appears in the phrase and 0 otherwise. (You can replace 1 with some weight, which would take into account the length of the phrase and the frequency of the word). Then you can apply any clustering algorithm. The tm package can help you build this matrix.

 library(tm) library(Matrix) x <- TermDocumentMatrix( Corpus( VectorSource( strings.to.cluster ) ) ) y <- sparseMatrix( i=x$i, j=x$j, x=x$v, dimnames = dimnames(x) ) plot( hclust(dist(t(y))) ) 
+7
source

Perhaps look at this document: http://www.inside-r.org/howto/mining-twitter-airline-consumer-sentiment can help, it uses R and looks at market sentiment for airlines using twitter.

+1
source

Source: https://habr.com/ru/post/906910/


All Articles