RDD in DataFrame in pyspark (columns from the first rdd element)

Question

RDD in DataFrame in pyspark (columns from the first rdd element)

I created rdd from the csv file, and the first line is the header line in this csv file. Now I want to create a dataframe from this rdd and save the column from the 1st rdd element.

The problem is that I can create a dataframe with a column from rdd.first () too, but the created dataframe has the first row as the headers themselves. How to remove this?

lines = sc.textFile('/path/data.csv')
rdd = lines.map(lambda x: x.split('#####'))  ###multiple char sep can be there #### or #@# , so can't directly read csv to a dataframe
#rdd: [[u'mailid', u'age', u'address'], [u'satya', u'23', u'Mumbai'], [u'abc', u'27', u'Goa']]  ###first element is the header
df = rdd.toDF(rdd.first())  ###retaing te column from rdd.first()
df.show()
#mailid  age  address
 mailid  age  address   ####I don't want this as dataframe data
 satya    23  Mumbai
 abc      27  Goa

Avoiding that the first element moves to the dataframe data. Can I give any option in rdd.toDF (rdd.first ()) to do this?

Note. I cannot compile rdd to create a list, and then remove the first element from this list, and then again arrange this list back into the rdd form, and then toDF () ...

Please offer !!! thank

+4

python-2.7 apache-spark pyspark rdd pyspark-sql

Satya 26 . '16 6:26

1

eliasah · Accepted Answer · 2016-10-26T06:54:05+0000

RDD. : RDD:

>>> header = rdd.first()
>>> header
# ['mailid', 'age', 'address']
>>> data = rdd.filter(lambda row : row != header).toDF(header)
>>> data.show()
# +------+---+-------+
# |mailid|age|address|
# +------+---+-------+
# | satya| 23| Mumbai|
# |   abc| 27|    Goa|
# +------+---+-------+

RDD in DataFrame in pyspark (columns from the first rdd element)

More articles: