I have a tab delimited file containing lines like
id1 name1 ['a', 'b'] 3.0 2.0 0.0 1.0
i.e. id, name, list with some lines and a series of 4 float attributes. I read this file as
rdd = sc.textFile('myfile.tsv') \ .map(lambda row: row.split('\t')) df = sqlc.createDataFrame(rdd, schema)
where i give the circuit as
schema = StructType([ StructField('id', StringType(), True), StructField('name', StringType(), True), StructField('list', ArrayType(StringType()), True), StructField('att1', FloatType(), True), StructField('att2', FloatType(), True), StructField('att3', FloatType(), True), StructField('att4', FloatType(), True) ])
The problem is that both the list and attributes are not read correctly, judging by the collect
in the DataFrame. In fact, I get None
for all of them:
Row(id=u'id1', brand_name=u'name1', list=None, att1=None, att2=None, att3=None, att4=None)
What am I doing wrong?