How to delete rows with too many null values?

I want to do some preprocessing of my data, and I want to delete rows that are sparse (for some threshold value).

For example, I have a dataframe table with 10 functions, and I have a row with 8 zero value, and then I want to delete it.

I found some related topics, but I cannot find useful information for my purpose.

stack overflow

Examples like in the link above will not work for me because I want to do this preprocessing automatically. I cannot write column names and do something accordingly.

So, is there a way to perform a delete operation without using column names in Apache Spark using scala?

+5
source share
4 answers

Test Date:

case class Document( a: String, b: String, c: String) val df = sc.parallelize(Seq(new Document(null, null, null), new Document("a", null, null), new Document("a", "b", null), new Document("a", "b", "c"), new Document(null, null, "c"))).df 

With UDF

Repeating David's answer and my RDD version below, you can do this using UDF, which takes a string:

 def nullFilter = udf((x:Row) => {Range(0, x.length).count(x.isNullAt(_)) < 2}) df.filter(nullFilter(struct(df.columns.map(df(_)) : _*))).show 

With RDD

You can turn it into an rdd loop from columns in a row and count how many of them are null.

 sqlContext.createDataFrame(df.rdd.filter( x=> Range(0, x.length).count(x.isNullAt(_)) < 2 ), df.schema).show 
+3
source

Cleaner with UDF:

 import org.apache.spark.sql.functions.udf def countNulls = udf((v: Any) => if (v == null) 1; else 0;)) df.registerTempTable("foo") sqlContext.sql( "select " + df.columns.mkString(", ") + ", " + df.columns.map(c => { "countNulls(" + c + ")" }).mkString(" + ") + "as nullCount from foo" ).filter($"nullCount" > 8).show 

If the query string makes you nervous, you can try the following:

 var countCol: org.apache.spark.sql.Column = null df.columns.foreach(c => { if (countCol == null) countCol = countNulls(col(c)) else countCol = countCol + countNulls(col(c)) }); df.select(Seq(countCol as "nullCount") ++ df.columns.map(c => col(c)):_*) .filter($"nullCount" > 8) 
+2
source

Here is an alternative in Spark 2.0:

 val df = Seq((null,"A"),(null,"B"),("1","C")) .toDF("foo","bar") .withColumn("foo", 'foo.cast("Int")) df.show() +----+---+ | foo|bar| +----+---+ |null| A| |null| B| | 1| C| +----+---+ df.where('foo.isNull).groupBy('foo).count().show() +----+-----+ | foo|count| +----+-----+ |null| 2| +----+-----+ 
+1
source

I am surprised that none of the answers indicated that Spark SQL comes with several standard features that meet the requirements:

For example, I have a dataframe table with 10 functions, and I have a row with 8 zero value, and then I want to delete it.

You can use one of the variants of the DataFrameNaFunctions.drop method with minNonNulls accordingly, say 2.

drop (minNonNulls: Int, cols: Seq [String]): DataFrame Returns a new DataFrame that places rows containing non-zero and non-NaN values ​​less than minNonNulls in the specified columns.

And to satisfy the variability of column names, as in the requirement:

I cannot write column names and do something accordingly.

You can simply use Dataset.columns :

: Array [String] Returns all column names as an array.


Let's say you have the following dataset with 5 functions (columns) and several rows, almost all null s.

 val ns: String = null val features = Seq(("0","1","2",ns,ns), (ns, ns, ns, ns, ns), (ns, "1", ns, "2", ns)).toDF scala> features.show +----+----+----+----+----+ | _1| _2| _3| _4| _5| +----+----+----+----+----+ | 0| 1| 2|null|null| |null|null|null|null|null| |null| 1|null| 2|null| +----+----+----+----+----+ // drop rows with more than (5 columns - 2) = 3 nulls scala> features.na.drop(2, features.columns).show +----+---+----+----+----+ | _1| _2| _3| _4| _5| +----+---+----+----+----+ | 0| 1| 2|null|null| |null| 1|null| 2|null| +----+---+----+----+----+ 
+1
source

Source: https://habr.com/ru/post/1245237/


All Articles