Removing similar but longer duplicates from a vector

To clean the database, I have a vector of, say, dishes, and I want to delete all variants of the “base” dish, keeping only the base dish. For example, if I have ...

dishes <- c("DAL BHAT", "DAL BHAT-(SPICY)", "DAL BHAT WITH EXTRA RICE", 
            "HAMBURGER", "HAMBURGER-BIG", "HAMBURGER2", "PIZZA", 
            "PIZZA (PROSCIUTO)", "PIZZA_BOLOGNESE")

... I want to delete all entries that already have a shorter matching version in the vector. Thus, the resulting vector will include: "DAL BHAT", "HAMBURGER", "PIZZA".

Using a nested loop forand checking everyone against everyone else will work for this example, but it will take a lot of time for a large dataset at hand and, in addition, ugly coding, I would say.

It can be assumed that all entries are in headers and that the vector is already sorted. It cannot be assumed that the first element of the next base dish is always shorter than the previous one.

Any suggestions for effectively resolving this issue?

BONUS QUESTION: Ideally, I only want to remove elements from the initial vector if they are 3 characters longer than their shorter copy. In the above case, this would mean that "HAMBURGER2" would also be stored in the resulting vector.

+4
source share
4 answers

Here's the approach I will take with this. I would create a function with some of the conditions that I would need to consider and use them in a tab. I added comments to explain what happens in the function.

The function has 4 arguments:

  • invec: character input vector.
  • thresh: "" . = 5.
  • minlen: "". = 3.
  • strict: . nchar thresh, , ? = FALSE. . , strict .

myfun <- function(invec, thresh = 5, minlen = 3, strict = FALSE) {
  # Bookkeeping -- sort, unique, all upper case
  invec <- sort(unique(toupper(invec)))
  # More bookkeeping -- min should not be longer 
  # than min base dish unless strict = TRUE
  thresh <- if (isTRUE(strict)) thresh else min(min(nchar(invec)), thresh)
  # Use `thresh` to get the `stubs``
  stubs <- invec[!duplicated(substr(invec, 1, thresh))]
  # loop through the stubs and do two things:
  #   - Match the dish with the stub
  #   - Return the base dish and any dishes within the minlen
  unlist(
    lapply(stubs, function(x) {
      temp <- grep(x, invec, value = TRUE, fixed = TRUE)
      temp[temp == x | nchar(temp) <= nchar(x) + minlen]
      }), 
    use.names = FALSE)
}

:

dishes <- c("DAL BHAT", "DAL BHAT-(SPICY)", "DAL BHAT WITH EXTRA RICE", 
            "HAMBURGER", "HAMBURGER-BIG", "HAMBURGER2", "PIZZA", 
            "PIZZA (PROSCIUTO)", "PIZZA_BOLOGNESE")    

:

myfun(dishes, minlen = 0)
# [1] "DAL BHAT"  "HAMBURGER" "PIZZA" 

myfun(dishes)
# [1] "DAL BHAT"   "HAMBURGER"  "HAMBURGER2" "PIZZA" 

. , "dish2" , "DAL", "dish3" .

dishes2 <- c("DAL BHAT", "DAL BHAT-(SPICY)", "DAL BHAT WITH EXTRA RICE", 
             "HAMBURGER", "HAMBURGER-BIG", "HAMBURGER2", "PIZZA", 
             "PIZZA (PROSCIUTO)", "PIZZA_BOLOGNESE", "DAL")

dishes3 <- c("DAL BHAT", "DAL BHAT-(SPICY)", "DAL BHAT WITH EXTRA RICE", 
             "HAMBURGER", "HAMBURGER-BIG", "HAMBURGER2", "PIZZA", 
             "PIZZA (PROSCIUTO)", "PIZZA_BOLOGNESE", "DAL", "pizza!!")

:

myfun(dishes2, 4)
# [1] "DAL"        "HAMBURGER"  "HAMBURGER2" "PIZZA"   

myfun(dishes3)
# [1] "DAL"        "HAMBURGER"  "HAMBURGER2" "PIZZA"      "PIZZA!!"  

myfun(dishes3, strict = TRUE)
# [1] "DAL"        "DAL BHAT"   "HAMBURGER"  "HAMBURGER2" "PIZZA"      "PIZZA!!"  
+5

OP , . , OP , 3 .

, , . n x (n-1) .

, . , , grepl().

library(data.table)
# prepare data
DT <- data.table(dish = dishes)[, len := nchar(dish)][order(len)]
DT
                        dish len
 1:                      NAN   3
 2:                    PIZZA   5
 3:                 DAL BHAT   8
 4:                HAMBURGER   9
 5:               HAMBURGER2  10
 6:            HAMBURGER-BIG  13
 7:           SLICE OF PIZZA  14
 8:          PIZZA_BOLOGNESE  15
 9:         DAL BHAT-(SPICY)  16
10:        PIZZA (PROSCIUTO)  17
11: DAL BHAT WITH EXTRA RICE  24
# use non-equi join to find row numbers of "duplicate" entries
tmp <- DT[.(len + 3L, dish), on = .(len > V1), nomatch = 0L, allow = TRUE,
          by = .EACHI, .I[grepl(V2, dish)]]
tmp
   len V1
1:   8  7
2:   8  8
3:   8 10
4:  11  9
5:  11 11
6:  12  6
# anti-join to remove "duplicates"
DT[!tmp$V1, dish]
[1] "NAN"        "PIZZA"      "DAL BHAT"   "HAMBURGER"  "HAMBURGER2"

- DT:

delta_len <- 3L
DT <- data.table(dish = dishes)[, len := nchar(dish)]
DT[!DT[.(len + delta_len, dish), on = .(len > V1), nomatch = 0L, allow = TRUE,
       by = .EACHI, .I[grepl(V2, dish)]]$V1, dish]
[1] "DAL BHAT"   "HAMBURGER"  "HAMBURGER2" "PIZZA"      "NAN"

, dishes ( "" ).

dishes <- c("DAL BHAT", "DAL BHAT-(SPICY)", "DAL BHAT WITH EXTRA RICE", 
            "HAMBURGER", "HAMBURGER-BIG", "HAMBURGER2", "PIZZA", 
            "PIZZA (PROSCIUTO)", "PIZZA_BOLOGNESE", "NAN", "SLICE OF PIZZA")

, .

+2

sapply grepl colSums:

dishes[colSums(sapply(dishes, function(x) grepl(x, setdiff(dishes, x)))) > 0]

:

[1] "DAL BHAT"  "HAMBURGER" "PIZZA"

:

  • sapply(dishes, function(x) grepl(x, setdiff(dishes, x))) dishes grepl, .
  • , TRUE , :

         DAL BHAT DAL BHAT-(SPICY) DAL BHAT WITH EXTRA RICE HAMBURGER HAMBURGER-BIG HAMBURGER2 PIZZA PIZZA (PROSCIUTO) PIZZA_BOLOGNESE
    [1,]     TRUE            FALSE                    FALSE     FALSE         FALSE      FALSE FALSE             FALSE           FALSE
    [2,]     TRUE            FALSE                    FALSE     FALSE         FALSE      FALSE FALSE             FALSE           FALSE
    [3,]    FALSE            FALSE                    FALSE     FALSE         FALSE      FALSE FALSE             FALSE           FALSE
    [4,]    FALSE            FALSE                    FALSE      TRUE         FALSE      FALSE FALSE             FALSE           FALSE
    [5,]    FALSE            FALSE                    FALSE      TRUE         FALSE      FALSE FALSE             FALSE           FALSE
    [6,]    FALSE            FALSE                    FALSE     FALSE         FALSE      FALSE FALSE             FALSE           FALSE
    [7,]    FALSE            FALSE                    FALSE     FALSE         FALSE      FALSE  TRUE             FALSE           FALSE
    [8,]    FALSE            FALSE                    FALSE     FALSE         FALSE      FALSE  TRUE             FALSE           FALSE
    
  • colSums, , :

    DAL BHAT   DAL BHAT-(SPICY) DAL BHAT WITH EXTRA RICE  HAMBURGER  HAMBURGER-BIG    HAMBURGER2    PIZZA    PIZZA (PROSCIUTO)   PIZZA_BOLOGNESE 
           2                  0                        0          2              0             0        2                    0                 0 
    
  • . , , .

  • As an alternative to using, > 0you can also use the double negation sign ( !!) before colSums. It also selects the elements with a number not equal to zero: dishes[!!colSums(sapply(dishes, function(x) grepl(x, setdiff(dishes, x))))].

If you want to take into account the maximum difference in the length of a character, you can use agreplinstead grepl, where you can specify the maximum change in the difference in characters using the max.distance-parameter:

dishes[colSums(sapply(dishes, function(x) agrepl(x, setdiff(dishes, x), max.distance = 3))) > 0]

which gives:

[1] "DAL BHAT"   "HAMBURGER"  "HAMBURGER2" "PIZZA"
+2
source
unlist(sapply(split(dishes, substr(dishes, 1, 5)), function(x){
    N = nchar(x)
    x[(N - N[1]) < 3]
}))
#       DAL B       HAMBU1       HAMBU2        PIZZA 
#  "DAL BHAT"  "HAMBURGER" "HAMBURGER2"      "PIZZA"
+1
source

Source: https://habr.com/ru/post/1691065/


All Articles