How to delete rows with too many null values?

Question

How to delete rows with too many null values?

I want to do some preprocessing of my data, and I want to delete rows that are sparse (for some threshold value).

For example, I have a dataframe table with 10 functions, and I have a row with 8 zero value, and then I want to delete it.

I found some related topics, but I cannot find useful information for my purpose.

stack overflow

Examples like in the link above will not work for me because I want to do this preprocessing automatically. I cannot write column names and do something accordingly.

So, is there a way to perform a delete operation without using column names in Apache Spark using scala?

+5

scala apache-spark apache-spark-sql spark-dataframe

Meerve bozo Mar 17 '16 at 14:05

source share

4 answers

Cleaner with UDF:

 import org.apache.spark.sql.functions.udf def countNulls = udf((v: Any) => if (v == null) 1; else 0;)) df.registerTempTable("foo") sqlContext.sql( "select " + df.columns.mkString(", ") + ", " + df.columns.map(c => { "countNulls(" + c + ")" }).mkString(" + ") + "as nullCount from foo" ).filter($"nullCount" > 8).show

If the query string makes you nervous, you can try the following:

 var countCol: org.apache.spark.sql.Column = null df.columns.foreach(c => { if (countCol == null) countCol = countNulls(col(c)) else countCol = countCol + countNulls(col(c)) }); df.select(Seq(countCol as "nullCount") ++ df.columns.map(c => col(c)):_*) .filter($"nullCount" > 8)

+2

David Griffin Mar 17 '16 at 15:20

source share

Here is an alternative in Spark 2.0:

 val df = Seq((null,"A"),(null,"B"),("1","C")) .toDF("foo","bar") .withColumn("foo", 'foo.cast("Int")) df.show() +----+---+ | foo|bar| +----+---+ |null| A| |null| B| | 1| C| +----+---+ df.where('foo.isNull).groupBy('foo).count().show() +----+-----+ | foo|count| +----+-----+ |null| 2| +----+-----+

+1

ulrich Oct 28 '16 at 21:23

source share

I am surprised that none of the answers indicated that Spark SQL comes with several standard features that meet the requirements:

For example, I have a dataframe table with 10 functions, and I have a row with 8 zero value, and then I want to delete it.

You can use one of the variants of the DataFrameNaFunctions.drop method with minNonNulls accordingly, say 2.

drop (minNonNulls: Int, cols: Seq [String]): DataFrame Returns a new DataFrame that places rows containing non-zero and non-NaN values less than minNonNulls in the specified columns.

And to satisfy the variability of column names, as in the requirement:

I cannot write column names and do something accordingly.

You can simply use Dataset.columns :

: Array [String] Returns all column names as an array.

Let's say you have the following dataset with 5 functions (columns) and several rows, almost all null s.

 val ns: String = null val features = Seq(("0","1","2",ns,ns), (ns, ns, ns, ns, ns), (ns, "1", ns, "2", ns)).toDF scala> features.show +----+----+----+----+----+ | _1| _2| _3| _4| _5| +----+----+----+----+----+ | 0| 1| 2|null|null| |null|null|null|null|null| |null| 1|null| 2|null| +----+----+----+----+----+ // drop rows with more than (5 columns - 2) = 3 nulls scala> features.na.drop(2, features.columns).show +----+---+----+----+----+ | _1| _2| _3| _4| _5| +----+---+----+----+----+ | 0| 1| 2|null|null| |null| 1|null| 2|null| +----+---+----+----+----+

+1

Jacek laskowski Jan 05 '18 at 13:41

source share

Michael Lloyd Lee mlk · Accepted Answer · 2016-03-17T14:54:47+0000

Test Date:

case class Document( a: String, b: String, c: String) val df = sc.parallelize(Seq(new Document(null, null, null), new Document("a", null, null), new Document("a", "b", null), new Document("a", "b", "c"), new Document(null, null, "c"))).df

With UDF

Repeating David's answer and my RDD version below, you can do this using UDF, which takes a string:

 def nullFilter = udf((x:Row) => {Range(0, x.length).count(x.isNullAt(_)) < 2}) df.filter(nullFilter(struct(df.columns.map(df(_)) : _*))).show

With RDD

You can turn it into an rdd loop from columns in a row and count how many of them are null.

 sqlContext.createDataFrame(df.rdd.filter( x=> Range(0, x.length).count(x.isNullAt(_)) < 2 ), df.schema).show

How to delete rows with too many null values?

More articles: