What is Spark for vector values when writing to CSV?

Question

What is Spark for vector values when writing to CSV?

Here is the result of some code that writes predictions from a model LogisticRegressionin json:

    (predictions
        .drop(feature_col)
        .rdd
        .map(lambda x: Row(weight=x.weight,
                           target=x[target],
                           label=x.label,
                           prediction=x.prediction,
                           probability=DenseVector(x.probability)))
        .coalesce(1)
        .toDF()
        .write
        .json(
        "{}/{}/summary/predictions".format(path, self._model.bestModel.uid)))

Here is one example: a JSON object:

{"label":1.0,"prediction":0.0,"probability":{"type":1,"values":[0.5835784358591029,0.4164215641408972]},"target":"Male","weight":99}

I would like to be able to output the same data to a CSV file (preferably using only probability.values[0](the first element of an array of values). However, when I use the same code fragment as above, but replace .jsonwith .csv, I get the following result:

1.0,0.0,"[6,1,0,0,280000001c,c00000002,af154d3100000014,a1d5659f3fe2acac,3fdaa6a6]",Male,99

What happens to the third column (an array with a bunch of values quoted in a row)?

+4

apache-spark pyspark apache-spark-mllib

Evan zamir Aug 22 '16 at 18:09

source share

1 answer

marmouset · Answer 1 · 2017-10-02T14:26:22+0000

"" , , json, , , - .

withColumn("probability", col("probability").cast("string"))

What is Spark for vector values ​​when writing to CSV?

More articles:

What is Spark for vector values when writing to CSV?