Pyspark decision trees (Spark 2.0.0)

Question

Pyspark decision trees (Spark 2.0.0)

I am new to spark (using pyspark). I tried to run the decision tree tutorial from here (link) . I am executing the code:

from pyspark.ml import Pipeline
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.feature import StringIndexer, VectorIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.mllib.util import MLUtils

# Load and parse the data file, converting it to a DataFrame.
data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt").toDF()
labelIndexer = StringIndexer(inputCol="label", outputCol="indexedLabel").fit(data)

# Now this line fails
featureIndexer =\
    VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(data)

I get the error message: IllegalArgumentException: u'requirement error: column functions must be of type org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 , but actually there were org.apache.spark.mllib.linalg.VectorUDT@f71b0bce . ''

While searching for this error, I found an answer that says:

use from pyspark.ml.linalg import Vectors, VectorUDT 
instead of 
from pyspark.mllib.linalg import Vectors, VectorUDT

which is odd since I haven't used it. Also, adding this import to my code does not solve anything, and I still get the same error.

I do not quite understand how to debug this situation. When viewing raw data, I see:

data.show()
+--------------------+-----+
|            features|label|
+--------------------+-----+
|(692,[127,128,129...|  0.0|
|(692,[158,159,160...|  1.0|
|(692,[124,125,126...|  1.0|
|(692,[152,153,154...|  1.0|

, '('.

, ... , ?

+4

dataframe apache-spark decision-tree pyspark

Ruslan 30 . '16 13:04

1

Yaron · Accepted Answer · 2016-10-30T14:59:58+0000

, , 1.5.2. 2.0.0 (. 2.0).

spark.ml spark.mllib

Spark 2.0, API- RDD spark.mllib . API Spark API- DataFrame spark.ml.

: http://spark.apache.org/docs/latest/ml-guide.html

2.0, Spark 2.0.0 (https://spark.apache.org/docs/2.0.0/mllib-decision-tree.html)

from pyspark.mllib.tree import DecisionTree, DecisionTreeModel
from pyspark.mllib.util import MLUtils

# Load and parse the data file into an RDD of LabeledPoint.
data = MLUtils.loadLibSVMFile(sc, 'data/mllib/sample_libsvm_data.txt')
# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])

# Train a DecisionTree model.
#  Empty categoricalFeaturesInfo indicates all features are continuous.
model = DecisionTree.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={},
                                     impurity='gini', maxDepth=5, maxBins=32)

# Evaluate model on test instances and compute test error
predictions = model.predict(testData.map(lambda x: x.features))
labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
testErr = labelsAndPredictions.filter(lambda (v, p): v != p).count() / float(testData.count())
print('Test Error = ' + str(testErr))
print('Learned classification tree model:')
print(model.toDebugString())

# Save and load model
model.save(sc, "target/tmp/myDecisionTreeClassificationModel")
sameModel = DecisionTreeModel.load(sc, "target/tmp/myDecisionTreeClassificationModel")

"examples/src/main/python/mllib/decision_tree_classification_example.py" Spark repo.

Pyspark decision trees (Spark 2.0.0)

More articles: