How to find columns with missing data in sparklyr

Sample data examples

Si      K       Ca      Ba  Fe  Type
71.78   0.06    8.75    0   0   1
72.73   0.48    7.83    0   0   1
72.99   0.39    7.78    0   0   1
72.61   0.57    na  0   0   na
73.08   0.55    8.07    0   0   1
72.97   0.64    8.07    0   na  1
73.09   na  8.17    0   0   1
73.24   0.57    8.24    0   0   1
72.08   0.56    8.3 0   0   1
72.99   0.57    8.4 0   0.11    1
na  0.67    8.09    0   0.24    1

we can load data in sparklyrwith the following code

sdf_copy_to(sc,sampledata)

I am looking for a query that returns columns with NA values ​​e.g.

si k ca fe
1  1  1 2
+2
source share
1 answer

This problem is actually a bit complicated due to the implementation tbl_sparkand incompatibility in the semantics of Spark and R. Even if applied colSums, Spark SQL does not allow implicit conversions between Boolean and numerical. This means that you must explicitly apply as.numeric:

library(dplyr)

sampledata <- copy_to(sc, data.frame(x=c(1, NA, 2), y=c(NA, 2, NA), z=42))

sampledata %>% 
  mutate_all(is.na) %>% 
  mutate_all(as.numeric) %>%
  summarize_all(sum)
# Source:   lazy query [?? x 3]
# Database: spark_connection
      x     y     z
  <dbl> <dbl> <dbl>
1     1     2     0
0
source

Source: https://habr.com/ru/post/1689980/


All Articles