R: How to choose which dplyr :: distinct () string is stored based on the value in another variable?

The problem with real life: I have items with MRI scan data. Some of them were scanned several times (single lines). Some of them were checked each time using different protocols. I want to save all unique lines by the identifier of the object, and if the object was scanned under two different protocols, I want it to prefer one over the other.

Toy example:

library(dplyr)  
df <- tibble(
        id = c("A", "A", "B", "C", "C", "D"), 
        protocol = c("X", "Y", "X", "X", "X", "Y"),
        date = c(seq(as.Date("2018-01-01"), as.Date("2018-01-06"), 
                 by="days")),
        var = 1:6)

I want to return a data frame with all unique objects by id. When it comes to double value, instead of automatically saving the first record, I want it to save the record with "Y" as the protocol, if it has this choice, but otherwise don't get rid of the lines with "X".

2, 3, 4 6.

dplyr, .

, , :

df %>% distinct(id, .keep_all = TRUE) #Nope! 

df %>% distinct(id, protocol == "Y", .keep_all = TRUE) #Nope!  

df$protocol <- factor(df$protocol, levels = c("Y", "X"))
df %>% distinct(id, .keep_all = TRUE) #Nope!  

df %>% group_by(id) %>% filter(protocol == "Y") #Nope!

: @RobJensen

df %>% arrange(id, desc(protocol == 'Y')) %>% distinct(id, .keep_all = TRUE)  

, , , , @joran

df %>% group_by(id) %>% arrange(desc(protocol),var) %>% slice(1)  

!

+4
4

, , protocol_preference, , , Y , "Y" , .

@davechilders @Nathan Werth , " "

order_of_importance <- c("Y", "Z", "X")

    df2 %>%
      mutate(protocol = factor(protocol, order_of_importance)) %>%
      arrange(id, protocol) %>%
      distinct(id, .keep_all = TRUE)

, "Y" , , "Y" avaialable,

df %>% 
    arrange(id, desc(protocol == 'Y')) %>% 
    distinct(id, .keep_all = TRUE)
+3

, ( data.table), dplyr. :

df %>% group_by(id) %>% arrange(desc(protocol),var) %>% do(head(.,1))

(Gregor ) ( ), slice(1), , do(head(.,1)).

+3

You can achieve this without using group_by()if you want the result to be Tibet, which is not grouped_df.

df %>% arrange(id, desc(protocol)) %>% distinct(id, .keep_all = TRUE)
+1
source

You can break the process into two stages: capture the required objects, capture everything for other identifiers, and combine it.

distinct_y <- df %>%
  filter(protocol == "Y") %>%
  distinct(id, .keep_all = TRUE)

distinct_other <- df %>%
  anti_join(distinct_y, "id") %>%
  distinct(id, .keep_all = TRUE)

distinct_combined <- rbind(distinct_y, distinct_other)

If you want to generalize it from “one above all” to ordering value, I suggest making it a protocolfactor.

For example, suppose there are three protocols: X, Y, and Z. Y is the best, Z is better than X, and you only need X if there is nothing better.

# Only difference is the best protocol for C will now be Z.
df2 <- tibble(
  id = c("A", "A", "B", "C", "C", "D"),
  protocol = c("X", "Y", "X", "X", "Z", "Y"),
  date = c(seq(as.Date("2018-01-01"), as.Date("2018-01-06"),
               by="days")),
  var = 1:6
)

order_of_importance <- c("Y", "Z", "X")

df2 %>%
  mutate(protocol = factor(protocol, order_of_importance)) %>%
  group_by(id) %>%
  arrange(protocol) %>%
  slice(1)
# # A tibble: 4 x 4
# # Groups: id [4]
#   id    protocol date         var
#   <chr> <fctr>   <date>     <int>
# 1 A     Y        2018-01-02     2
# 2 B     X        2018-01-03     3
# 3 C     Z        2018-01-05     5
# 4 D     Y        2018-01-06     6
0
source

Source: https://habr.com/ru/post/1692549/


All Articles