How to remove duplicate terms

Question

How to remove duplicate terms

My problem is this:

If I have a string with words sorted by their value (separated by comma ):

text = "light, device emitting, light emitting, optical, light emitting, diode, electrode, photodetector, semiconductor, device emitter, device photodetector, resin, seal, device light, semiconductor device, light emitting device, compact light emitting device, compact light emitting device , compact device for lighting a light-emitting device, LED diode of a device, device for a photocell of a device, tightness of a device, emitting type, emitting light t, emitting light-emitting light-emitting light, light emission, sealing of the light device, optical transmitter, package assembly, photocell device, photosensitive, semiconductor electrode device, semiconductor photocell device, transmitting, transmitter, type of light,type of light emitting, light emitting diode "

The terms in the variable text can be divided by function or by function of the stringr package. strsplit str_split

library(stringr)
str_split = strsplit(text[1], ", ")

As we see, the object str_splitconsists of 40 divided terms.

Now I would like to extract the first 10 non-duplicate terms.

Let pocket = {light, device emitting, emitting light, optical, light, diode, electrode, photodetector, semiconductor}

In the 1st iteration: light, device emitting, emitting light , optical, light, diode, electrode, photoconverter, semiconductor.

The term “light” is a subset of “light emission”, so we remove the term “light” and add the 11th term to the variable text , that is, emit a device.

: = {, , , , , , , , , )

2- : , , , , , , , , ,

"" " ", "" 12- , .

: = {, , , , , , , , ,

: , , , , , , , , ,

"" "", "" 13- , .

: = {, , , , , , , , , }

4- : , , , , , , , , ,

"" " ", " " 14- , .

: = {, , , , , , , , , }

5- : , , , , , , , , ,

"" " ", "" 15- , .

: = {, , , , , , , , , }

6- : , , , , , , , , ,

" " " ", " " 16- , .. .

: = {, , , , , , , , , }

.

R.

- ?

+3

algorithm r

vvkid 11 . '11 13:26

2

Joris Meys · Answer 1 · 2011-01-11T14:13:14+0000

grepl. , . : "" "lightemitting". , ( ).

Remove <- function(x){
    tmp <- paste(x,"")
    id <- colSums(sapply(tmp,grepl,tmp))==1
    x[id]
}

Txt <- "light, device, emitting, light emitting, optical, lightemitting, diode, 
        electrode, photocoupler, semiconductor, device light emitting, 
        device photocoupler, resin, sealing, device light, semiconductor device,
        lightemitting device, device electrode, compact lightemitting"

Txt_split <- unlist(strsplit(Txt[1], ", "))

> Remove(Txt_split)
 [1] "optical"               "diode"                 "device light emitting"
 "device photocoupler"  
 [5] "resin"                 "sealing"               "semiconductor device" 
 "lightemitting device" 
 [9] "device electrode"      "compact lightemitting"

EDIT: , , ( R - ).

Richie Cotton · Answer 2 · 2011-01-11T13:43:59+0000

: , , . , .

text <- "light, device, emitting, light emitting, optical, lightemitting, diode, electrode, photocoupler, semiconductor, device light emitting, device photocoupler, resin, sealing, device light, semiconductor device, lightemitting device, device electrode, compact lightemitting"

vars <- str_split(text, ", ")[[1]]

matches <- "__something_not_in_your_list_"
for(i in seq_along(vars))
{
  if(!any(str_detect(vars[i], matches))) matches <- c(matches, vars[i])
}
matches[-1]

, str_detect , .

: , - .

vars <- str_split(text, ", ")[[1]]
all_words <- unlist(str_split(vars, " "))
unique(all_words)

How to remove duplicate terms

More articles: