Removing columns by data type in Scala Spark

df1.printSchema() displays the column names and the type of data they possess.

df1.drop($"colNames") will decrease by column name.

Is there a way to adapt this command to delete by data type?

+4
source share
2 answers

if you want to drop specific columns in the dataframe based on its types .. then the below snippet will help. In this example, I have a dataframe with two fields of type String and Int. And I delete the String string (all fields of type String will be deleted) from the schema.

import sqlContext.implicits._

val df = sc.parallelize(('a' to 'l').map(_.toString) zip (1 to 10)).toDF("c1","c2")

val fields = df.schema.fields filter {
x => x.dataType match { 
      case x: org.apache.spark.sql.types.StringType => true
      case _ => false 
      } 
    } map { x => x.name }

val newDf = fields.foldLeft(df){ case(dframe,field) => dframe.drop(field) }

NewDf schema org.apache.spark.sql.DataFrame = [c2: int]

+5
source

Here is an example in scala:

var categoricalFeatColNames = df.schema.fields filter { _.dataType.isInstanceOf[org.apache.spark.sql.types.StringType] } map { _.name }
+2
source

Source: https://habr.com/ru/post/1668165/


All Articles