Using Cosine Similarity in a String Vector to Filter Similar Strings

I have a row vector. Some lines (maybe more than two) of the vector are similar to each other in terms of the words that they contain. I want to filter out lines that are similar to a cosine of more than 30% with any other line in the vector. Of the two compared strings, I want to save a string with a lot of words. That is, I want, as a result, only those lines that have less than 30% similarity with any line in the original vector. My goal is to filter out similar lines in order to keep only approximately individual lines.

Ex. Vector:

x <- c("Dan is a good man and very smart", "A good man is rare", "Alex can be trusted with anything", "Dan likes to share his food", "Rare are man who can be trusted", "Please share food")

The result should give (provided the similarity is less than 30%):

c("Dan is a good man and very smart", "Dan likes to share his food", "Rare are man who can be trusted")

The above result is not verified.

Cosine code I use:

CSString_vector <- c("String One","String Two")
    corp <- tm::VCorpus(VectorSource(CSString_vector))
    controlForMatrix <- list(removePunctuation = TRUE,wordLengths = c(1, Inf),
    weighting = weightTf)
    dtm <- DocumentTermMatrix(corp,control = controlForMatrix)
    matrix_of_vector = as.matrix(dtm)
    res <- lsa::cosine(matrix_of_vector[1,], matrix_of_vector[2,])

I work in RStudio.

+4
source
1

:

, , : . , , . , , . ?

, , :

  • (), igraph

NB: 0,4, .


, , tidyverse, , , , .

library(tm)
library(lsa)
library(tidyverse)

get_cos_sim <- function(corpus) {
  # pre-process corpus
  doc <- corpus %>%
    VectorSource %>%
    tm::VCorpus()
  # get term frequency matrix
  tfm <- doc %>%
    DocumentTermMatrix(
      control = corpus %>% list(
        removePunctuation = TRUE,
        wordLengths = c(1, Inf),
        weighting = weightTf)) %>%
    as.matrix()
  # get row-wise similarity
  sim <- NULL
  for(i in 1:nrow(tfm)) {
    sim_i <- apply(
      X = tfm, 
      MARGIN = 1, 
      FUN = lsa::cosine, 
      tfm[i,])
    sim <- rbind(sim, sim_i)
  }
  # set identity diagonal to zero
  diag(sim) <- 0
  # label and return
  rownames(sim) <- corpus
  return(sim)
}

# example corpus
strings <- c(
  "Dan is a good man and very smart", 
  "A good man is rare", 
  "Alex can be trusted with anything", 
  "Dan likes to share his food", 
  "Rare are man who can be trusted", 
  "Please share food")

# get pairwise similarities
sim <- get_cos_sim(strings)
# binarize (using a different threshold to make your example work)
sim <- sim > .4  

! , Chalermsook Chuzhoy: Maximum Independent Rectangles, igraph. , ,

library(igraph)

# create graph from adjacency matrix
cliques <- sim %>% 
  dplyr::as_data_frame() %>%
  mutate(from = row_number()) %>% 
  gather(key = 'to', value = 'edge', -from) %>% 
  filter(edge == T) %>%
  graph_from_data_frame(directed = FALSE) %>%
  max_cliques()

vertices longes . Caveat:, , . . igraph , , , - -

# get the string indices per vertex clique first
string_cliques_index <- cliques %>% 
  unlist %>%
  names %>%
  as.numeric
# find the indices that are distinct but not in a clique
# (i.e. unconnected vertices)
string_uniques_index <- colnames(sim)[!colnames(sim) %in% string_cliques_index] %>%
  as.numeric
# get a list with all indices
all_distict <- cliques %>% 
  lapply(names) %>% 
  lapply(as.numeric) %>%
  c(string_uniques_index)
# get a list of distinct strings
lapply(all_distict, find_longest, strings)  

:

:

strings <- c(
  "Dan is a good man and very smart", 
  "A good man is rare", 
  "Alex can be trusted with anything", 
  "Dan likes to share his food", 
  "Rare are man who can be trusted", 
  "Please share food",
  "NASA is a government organisation",
  "The FBI organisation is part of the government of USA",
  "Hurricanes are a tragedy",
  "Mangoes are very tasty to eat ",
  "I like to eat tasty food",
  "The thief was caught by the FBI")

:

Dan is a good man and very smart                      FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
A good man is rare                                     TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Alex can be trusted with anything                     FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Dan likes to share his food                           FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
Rare are man who can be trusted                       FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Please share food                                     FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
NASA is a government organisation                     FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
The FBI organisation is part of the government of USA FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
Hurricanes are a tragedy                              FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Mangoes are very tasty to eat                         FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
I like to eat tasty food                              FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
The thief was caught by the FBI                       FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE

, :

# included
Dan is a good man and very smart
Alex can be trusted with anything
Dan likes to share his food
NASA is a government organisation
The FBI organisation is part of the government of USA
Hurricanes are a tragedy
Mangoes are very tasty to eat

# omitted
A good man is rare
Rare are man who can be trusted
Please share food
I like to eat tasty food
The thief was caught by the FBI

, . , ,

[[1]]
[1] "The FBI organisation is part of the government of USA"

[[2]]
[1] "Dan is a good man and very smart"

[[3]]
[1] "Alex can be trusted with anything"

[[4]]
[1] "Dan likes to share his food"

[[5]]
[1] "Mangoes are very tasty to eat "

[[6]]
[1] "NASA is a government organisation"

[[7]]
[1] "Hurricanes are a tragedy"

! , , , , .

0

Source: https://habr.com/ru/post/1696305/


All Articles