View all column names with any NA in R

Question

View all column names with any NA in R

I need to get the name of columns that have at least 1 NA.

df<-data.frame(a=1:3,b=c(NA,8,6), c=c('t',NA,7))

I need to get "b, c".

I found this code:

 sapply(df, function(x) any(is.na(x)))

But I need only variables that have NA.

I tried this:

 sapply(df, function(x) colnames(df[,any(is.na(x))]))

But I get all the column names.

+6

r sapply

Gabylp Sep 28 '14 at 13:32

source share

5 answers

You were very close. Your first attempt gives a boolean vector, which you can use to index names df :

 contains_any_na = sapply(df, function(x) any(is.na(x))) names(df)[contains_any_na] # [1] "b" "c"

January 14, 2017 anyNA() : In version R version 3.1.0, anyNA() can be used as an alternative to any(is.na(.)) , And the code above can be simplified to

 names(df)[sapply(df, anyNA)] # [1] "b" "c"

+7

Paul hiemstra Sep 28 '14 at 13:37

source share

  names(df)[!!colSums(is.na(df))] #[1] "b" "c"

Explanation

 colSums(is.na(df)) #gives you the number of missing value per each columns #abc #0 1 1

Using ! we create a logical index

 !colSums(is.na(df)) #here the value of `0` will be `TRUE` and all other values `>0` FALSE # abc #TRUE FALSE FALSE

But we need to select those columns that have at least one NA , therefore ! cancel again

 !!colSums(is.na(df)) # abc #FALSE TRUE TRUE

and use this boolean index to get code names that have at least one NA

Benchmarks

  set.seed(49) df1 <- as.data.frame(matrix(sample(c(NA,1:200), 1e4*5000, replace=TRUE), ncol=5000)) library(microbenchmark) f1 <- function() {contains_any_na = sapply(df1, function(x) any(is.na(x))) names(df1)[contains_any_na]} f2 <- function() {colnames(df1)[!complete.cases(t(df1))] } f3 <- function() { names(df1)[!!colSums(is.na(df1))] } microbenchmark(f1(), f2(), f3(), unit="relative") #Unit: relative #expr min lq median uq max neval #f1() 1.000000 1.000000 1.000000 1.000000 1.000000 100 #f2() 8.921109 7.289053 6.852122 6.210826 4.889684 100 #f3() 3.248072 3.105798 2.984453 2.774513 2.599745 100

Explanation of the effectiveness of EDIT:

Perhaps the amazing sapply based sapply is the winner here because, as noted in @flodel's comment below, 2 other solutions created matrix behind the scenes ( t(df) and is.na(df) ), creating the matrix.

+4

akrun Sep 28 '14 at 13:32

source share

Try the data.table version:

 library(data.table) setDT(df) names(df)[df[,sapply(.SD, function(x) any(is.na(x))),]] [1] "b" "c"

Microbenchmarking using @akrun code:

 set.seed(49) df1 <- as.data.frame(matrix(sample(c(NA,1:200), 1e4*5000, replace=TRUE), ncol=5000)) setDT(df1) f1 <- function() {contains_any_na = sapply(df1, function(x) any(is.na(x))) names(df1)[contains_any_na]} f2 <- function() {colnames(df1)[!complete.cases(t(df1))] } f3 <- function() { names(df1)[!!colSums(is.na(df1))] } f4 <- function() { names(df1)[df1[,sapply(.SD, function(x) any(is.na(x))),]] } microbenchmark(f1(), f2(), f3(), f4(), unit="relative") # Unit: relative # expr min lq median uq max neval # f1() 1.000000 1.000000 1.000000 1.000000 1.000000 100 # f2() 10.459124 10.928821 10.955986 9.858967 7.069066 100 # f3() 3.323144 3.805183 4.159624 3.775549 2.797329 100 # f4() 10.108998 10.242207 10.121022 9.117067 6.576976 100

@agstudy: this solution is similar to speed up to colnames(df1)[!complete.cases(t(df1))] .

+4

rnso Sep 28 '14 at 14:59

source share

A simple one liner for this:

 colnames(df[,sapply(df, function(x) any(is.na(x)))])

Explanation:

 sapply(df, function(x) any(is.na(x)))

returns True / False for columns with the smallest value of 1 NA. df[,sapply(df, function(x) any(is.na(x)))] gets a subset of the data that has all its columns with at least 1 NA. And colnames gives the names of these columns.

0

Abhimanu kumar Jan 14 '17 at 19:07

source share

agstudy · Accepted Answer · 2014-09-28T13:38:48+0000

Another acrobatic solution (just for fun):

 colnames(df)[!complete.cases(t(df))] [1] "b" "c"

Idea: Retrieving columns from A having at least 1 NA is equivalent to retrieving rows having at least NA for t (A). complete.cases by definition (very efficient as it is just a call to the C function) yields strings without any missing value.

View all column names with any NA in R

Explanation

Benchmarks

Explanation of the effectiveness of EDIT:

More articles: