Cannot convert type <class 'pyspark.ml.linalg.SparseVector'> to vector

Given my Pyspark Row object:

>>> row Row(clicked=0, features=SparseVector(7, {0: 1.0, 3: 1.0, 6: 0.752})) >>> row.clicked 0 >>> row.features SparseVector(7, {0: 1.0, 3: 1.0, 6: 0.752}) >>> type(row.features) <class 'pyspark.ml.linalg.SparseVector'> 

However, the row.features parameters did not pass the isinstance test (row.features, Vector).

 >>> isinstance(SparseVector(7, {0: 1.0, 3: 1.0, 6: 0.752}), Vector) True >>> isinstance(row.features, Vector) False >>> isinstance(deepcopy(row.features), Vector) False 

This strange mistake made me run into huge problems. Without passing "isinstance (row.features, Vector)" "I cannot generate LabeledPoint using the map function. I will be very grateful if anyone can solve this problem.

+6
source share
2 answers

This is probably a mistake. You did not provide the code necessary to reproduce the problem , but most likely you are using Spark 2.0 with ML transformers and comparing the wrong objects.

We illustrate this with an example. Simple data

 from pyspark.ml.feature import OneHotEncoder row = OneHotEncoder(inputCol="x", outputCol="features").transform( sc.parallelize([(1.0, )]).toDF(["x"]) ).first() 

Now you can import various vector classes:

 from pyspark.ml.linalg import Vector as MLVector, Vectors as MLVectors from pyspark.mllib.linalg import Vector as MLLibVector, Vectors as MLLibVectors from pyspark.mllib.regression import LabeledPoint 

and run the tests:

 isinstance(row.features, MLLibVector) 
 False 
 isinstance(row.features, MLVector) 
 True 

As you can see, we have pyspark.ml.linalg.Vector not pyspark.mllib.linalg.Vector , which is incompatible with the old API:

 LabeledPoint(0.0, row.features) 
 TypeError Traceback (most recent call last) ... TypeError: Cannot convert type <class 'pyspark.ml.linalg.SparseVector'> into Vector 

You can convert the ML object to MLLib one:

 from pyspark.ml import linalg as ml_linalg def as_mllib(v): if isinstance(v, ml_linalg.SparseVector): return MLLibVectors.sparse(v.size, v.indices, v.values) elif isinstance(v, ml_linalg.DenseVector): return MLLibVectors.dense(v.toArray()) else: raise TypeError("Unsupported type: {0}".format(type(v))) LabeledPoint(0, as_mllib(row.features)) 
 LabeledPoint(0.0, (1,[],[])) 

or simply:

 LabeledPoint(0, MLLibVectors.fromML(row.features)) 
 LabeledPoint(0.0, (1,[],[])) 

but, generally speaking, you should avoid situations where necessary.

+11
source

If you just want to convert SparseVectors from pyspark.ml to pyspark.mllib SparseVectors, you can use MLUtils. Let's say df is your data frame, and the column with SparseVectors is called "features". Then the following few lines allow you to do the following:

 from pyspark.mllib.utils import MLUtils df = MLUtils.convertVectorColumnsFromML(df, "features") 

This problem arose for me because when using the CountVectorizer from pyspark.ml.feature I could not create LabeledPoints due to incompatibility with the SparseVector from pyspark.ml

I wonder why their latest CountVectorizer documentation does n't use the β€œnew” SparseVector class. Since classification algorithms need LabeledPoints, that makes no sense to me ...

UPDATE : I misunderstood that the ml library is for DataFrame objects, and the mllib library is for RDD objects. DataFrame-Datastructure is recommended starting with Spark> 2.0 because SparkSession is more compatible than SparkContext (but saves a SparkContext) and delivers a DataFrame instead of RDD. I found this post that gave me the "aha" effect: mllib and ml . Thanks Alberto Bonsanto :).

To use fe NaiveBayes from mllib, I had to convert my DataFrame to LabeledPoint objects for NaiveBayes from mllib.

But it’s easier to use NaiveBayes from ml because you don’t need LabeledPoints, but you can just specify a set of functions and classes for your frame.

PS: I struggled with these problems for hours, so I felt that I needed to publish it here :)

+5
source

Source: https://habr.com/ru/post/1261080/


All Articles