What is Spark for vector values ​​when writing to CSV?

Here is the result of some code that writes predictions from a model LogisticRegressionin json:

    (predictions
        .drop(feature_col)
        .rdd
        .map(lambda x: Row(weight=x.weight,
                           target=x[target],
                           label=x.label,
                           prediction=x.prediction,
                           probability=DenseVector(x.probability)))
        .coalesce(1)
        .toDF()
        .write
        .json(
        "{}/{}/summary/predictions".format(path, self._model.bestModel.uid)))

Here is one example: a JSON object:

{"label":1.0,"prediction":0.0,"probability":{"type":1,"values":[0.5835784358591029,0.4164215641408972]},"target":"Male","weight":99}

I would like to be able to output the same data to a CSV file (preferably using only probability.values[0](the first element of an array of values). However, when I use the same code fragment as above, but replace .jsonwith .csv, I get the following result:

1.0,0.0,"[6,1,0,0,280000001c,c00000002,af154d3100000014,a1d5659f3fe2acac,3fdaa6a6]",Male,99

What happens to the third column (an array with a bunch of values ​​quoted in a row)?

+4
source share
1 answer

"" , , json, , , - .

withColumn("probability", col("probability").cast("string"))
0

Source: https://habr.com/ru/post/1652099/


All Articles