I encountered a problem while trying to use Spark to simply read a CSV file. After such an operation, I would like to make sure that:
- data types are correct (using the provided schema)
- the headers are correct with respect to the scheme provided
What is the code that I use and have problems with:
val schema = Encoders.product[T].schema
val df = spark.read
.schema(schema)
.option("header", "true")
.csv(fileName)
Type T
has type Product
, i. e. case class. This works, but does not validate the column names., so I can provide another file and as long as the data types are correct, an error does not occur, and I do not know that the user provided the wrong file, but by some coincidence with the correct data types with proper ordering.
I tried to use the parameters that pass the schema, and then use the method .as[T]
in the dataset, but if any column other than String
is only null, it is interpreted by Spark as a column String
, but in my schema it is Integer
. Therefore, a cast exception occurs, but the column names have been checked correctly.
To summarize: I found a solution that I can provide the correct data types, but without headers and another solution that I can check the headers, but problems with data types. Is there a way to achieve both, i.e. e. a complete check of headers and types?
I am using Spark 2.2.0.
source
share