Remove least complete duplicate rows in R or SQL

Question

Remove least complete duplicate rows in R or SQL

I have a dataset like this:

id_1 <- c(1, 1, 1)
id_2 <- c(2, NA, NA)
day <- c("Mon", "Mon", "Mon")
month <- c("May", NA, "May")
year <- c("2017", NA, NA)

df <- cbind(id_1, id_2, day, month, year)

These lines are repetitive observations in my data. I would like to keep only the most complete line (i.e. Line 1). My real data has 15 columns, so use

duplicated(df[, <some combination of columns>])

seems complicated. Is there a function for this? Or some simple answer that I am missing? Answers in R are preferred, but SQL is also an option. Thank you in advance!

EDIT: id_1 and id_2 are both identifiers for observation. id_1 should definitely be unique in this data, but it is suitable for id_2 as NA or repeated for some lines. In the end, I will merge this data table with another data table using id_2. Therefore, I would like to delete lines that repeat information already captured by a line that includes id_2.

+4

sql r duplicates

Maddie May 26, '17 at 16:58

source share

3 answers

eipi10, , , R.

 df[     apply(df, 1, function(x) length(na.omit(x))) == 
    max( apply(df, 1, function(x) length(na.omit(x))) )
        ,  ]
 #---------------- 
  id_1   id_2    day  month   year 
   "1"    "2"  "Mon"  "May" "2017"

, id1, , eipi10 group_by lapply( split(df, df$id1) , ...function). @MikeH. rowSums(!is.na(df)) , . , ?

+1

42- 26 '17 17:21

data.frame, Reduce data.table

library(data.table)
setDT(df)[, .SD[which.min(Reduce(`+`, lapply(.SD, is.na)))], id_1]
#   id_1 id_2 day month year
#1:    1    2 Mon   May 2017

df <- data.frame(id_1, id_2, day, month, year, stringsAsFactors=FALSE)

0

akrun 26 '17 17:33

eipi10 · Accepted Answer · 2017-05-26T17:08:42+0000

If id_1is an identifier for each "subject", you can do this:

library(tidyverse)

df %>% 
  group_by(id_1) %>%
  filter(rowSums(is.na(.)) == min(rowSums(is.na(.))))

, , ? , , .

: @docendodiscimus :

df %>% 
  group_by(id_1) %>%
  slice(which.min(rowSums(is.na(.))))

Remove least complete duplicate rows in R or SQL

More articles: