Correct reading of types from a file in PySpark

I have a tab delimited file containing lines like

id1 name1 ['a', 'b'] 3.0 2.0 0.0 1.0 

i.e. id, name, list with some lines and a series of 4 float attributes. I read this file as

 rdd = sc.textFile('myfile.tsv') \ .map(lambda row: row.split('\t')) df = sqlc.createDataFrame(rdd, schema) 

where i give the circuit as

 schema = StructType([ StructField('id', StringType(), True), StructField('name', StringType(), True), StructField('list', ArrayType(StringType()), True), StructField('att1', FloatType(), True), StructField('att2', FloatType(), True), StructField('att3', FloatType(), True), StructField('att4', FloatType(), True) ]) 

The problem is that both the list and attributes are not read correctly, judging by the collect in the DataFrame. In fact, I get None for all of them:

 Row(id=u'id1', brand_name=u'name1', list=None, att1=None, att2=None, att3=None, att4=None) 

What am I doing wrong?

+2
source share
1 answer

It reads correctly, it just doesn't work as you expect. A schema argument declares which types avoid expensive schema output, not data. Providing input that conforms to the declared pattern is your responsibility.

It can also be processed either by a data source (see spark-csv and inferSchema ). It will not handle complex types such as an array.

Since your schema is mostly flat and you know the types, you can try something like this:

 df = rdd.toDF([f.name for f in schema.fields]) exprs = [ # You should excluding casting # on other complex types as well col(f.name).cast(f.dataType) if f.dataType.typeName() != "array" else col(f.name) for f in schema.fields ] df.select(*exprs) 

and handle complex types separately using string processing or UDF functions. Also, since you're reading data in Python anyway, just create the types you need before creating the DF.

+3
source

Source: https://habr.com/ru/post/1245027/


All Articles