I am new to spark (using pyspark). I tried to run the decision tree tutorial from here (link) . I am executing the code:
from pyspark.ml import Pipeline
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.feature import StringIndexer, VectorIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.mllib.util import MLUtils
data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt").toDF()
labelIndexer = StringIndexer(inputCol="label", outputCol="indexedLabel").fit(data)
featureIndexer =\
VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(data)
I get the error message: IllegalArgumentException: u'requirement error: column functions must be of type org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 , but actually there were org.apache.spark.mllib.linalg.VectorUDT@f71b0bce . ''
While searching for this error, I found an answer that says:
use from pyspark.ml.linalg import Vectors, VectorUDT
instead of
from pyspark.mllib.linalg import Vectors, VectorUDT
which is odd since I haven't used it. Also, adding this import to my code does not solve anything, and I still get the same error.
I do not quite understand how to debug this situation. When viewing raw data, I see:
data.show()
+--------------------+-----+
| features|label|
+--------------------+-----+
|(692,[127,128,129...| 0.0|
|(692,[158,159,160...| 1.0|
|(692,[124,125,126...| 1.0|
|(692,[152,153,154...| 1.0|
, '('.
, ...
, ?