Extract the difference ("relative complement") between two lines in r

I cannot find a way to do this ...

raw_string <- "\"+001\", la bonne surprise de M. Jenn M. Ayache http://goo.gl/3EXxy6 via @MYTF1News" clean_string <- "+001, la bonne surprise de Jenn Ayache" desired_string <- "\"\"MM http://goo.gl/3EXxy6 via @MYTF1News" 

I am not sure what to call this transformation. I would say “difference” (as in set theory, against “union” and “intersection”). A better name might be a "relative complement" ( http://en.wikipedia.org/wiki/Complement_(set_theory)#Relative_complement ).

My search string has only and all the characters that are not in clean_string, in good order, once for every time they appear, including spaces, punctuation, and all.

The best I managed to do is not good enough:

 > a <- paste(Reduce(setdiff, strsplit(c(raw_string, clean_string), split = " ")), collapse = " ") > a [1] "\"+001\", M. http://goo.gl/3EXxy6 via @MYTF1News" 
+6
source share
3 answers

I do not know if there is an implemented function for this in one of the string processing packages (I have not come across this). This is an implementation that (I think) works

 raw_string <- "\"+001\", la bonne surprise de M. Jenn M. Ayache http://goo.gl/3EXxy6 via @MYTF1News" clean_string <- "+001, la bonne surprise de Jenn Ayache" raw <- strsplit(raw_string, "")[[1]] clean <- strsplit(clean_string, "")[[1]] dif <- vector("list") j <- 1 while(length(clean) > 0) { i <- match(clean[1], raw) if (i > 1) { dif[[j]] <- raw[seq_len(i - 1)] j <- j + 1 } clean <- clean[-1] raw <- raw[-seq_len(i)] } dif[[j]] <- raw paste(unlist(dif), collapse = "") #[1] "\"\"MM http://goo.gl/3EXxy6 via @MYTF1News" 
+1
source

I would use a loop too:

 x <- strsplit(raw_string, "")[[1]] y <- strsplit(clean_string, "")[[1]] res <- character(length(x)) j <- 1 for(i in seq_along(x)) { if (j > length(y)) { res[i:length(x)] <- x[i:length(x)] break } if (x[i] != y[j]) { res[i] <- x[i] } else { j <- j + 1 } } paste(res, collapse = "") #[1] "\"\"MM http://goo.gl/3EXxy6 via @MYTF1News" 

Pay attention to the extra space compared to the expected result. I think you just missed it.

If it is too slow, it should be easy to implement using Rcpp.

+3
source

Here is a slightly more concise way using sub , which requires you to consider characters.

 str_relative_complement <- function(raw_string, clean_string){ words <- strsplit(clean_string, "")[[1]] cur_str <- raw_string for(i in words){ cur_str <- sub(ifelse(grepl("[[:punct:]]", i), paste0("\\", i), i), "", cur_str) } return(cur_str) } raw_string <- '\"+001\", la bonne surprise de M. Jenn M. Ayache http://goo.gl/3EXxy6 via @MYTF1News' clean_string <- "+001, la bonne surprise de Jenn Ayache" str_relative_complement(raw_string, clean_string) [1] "\"\"MM http://goo.gl/3EXxy6 via @MYTF1News" 
+1
source

Source: https://habr.com/ru/post/985354/


All Articles