Pyspark error: DataFrame object does not have 'map' attribute

Question

Pyspark error: DataFrame object does not have 'map' attribute

I am using pyspark 2.0 to create a DataFrame object while reading csv using:

data = spark.read.csv('data.csv', header=True)

I find the data type using

type(data)

Result

pyspark.sql.dataframe.DataFrame

I am trying to convert some data columns to LabeledPoint in order to apply the classification.

from pyspark.sql.types import *    
from pyspark.sql.functions import loc
from pyspark.mllib.regression import LabeledPoint

data.select(['label','features']).
              map(lambda row:LabeledPoint(row.label, row.features))

I ran into this problem:

AttributeError: 'DataFrame' object has no attribute 'map'

Any idea of a bug? Is there a way to generate LabelPoint from a DataFrame to do the classification?

+4

apache-spark spark-dataframe apache-spark-2.0

Xi liang Sep 08 '16 at 1:26

source share

1 answer

user6022341 · Answer 1 · 2016-09-08T01:29:04+0000

Use .rdd.map:

>>> data.select(...).rdd.map(...)

DataFrame.map was removed in Spark 2.

Pyspark error: DataFrame object does not have 'map' attribute

More articles: