Is it possible to store a numpy array in a Data Spark Data Frame column?

Question

Is it possible to store a numpy array in a Data Spark Data Frame column?

I have dataframe, and I apply a function to it. This function returns the numpy arraycode as follows:

create_vector_udf = udf(create_vector, ArrayType(FloatType()))
dataframe = dataframe.withColumn('vector', create_vector_udf('text'))
dmoz_spark_df.select('lang','url','vector').show(20)

Now the spark does not seem satisfied with this and does not accept ArrayType(FloatType()) The following error message appears: net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for numpy.core.multiarray._reconstruct)

I could just numpyarray.tolist()return its version of the list, but obviously I would always need to recreate arrayit if I want to use it with numpy.

so is there a way to save numpy arrayin dataframe column?

+7

numpy pyspark spark-dataframe

Thagor Jul 07 '17 at 8:11

source share

2 answers

pissall · Answer 1 · 2019-10-02T18:09:19+0000

, , UDF, . create_vector numpy.ndarray, NumPy, API DataFrame.

- - :

udf(lambda x: create_vector(x).tolist(), ArrayType(FloatType()))

tpain · Answer 2 · 2019-05-06T07:37:24+0000

, , , UDT, _SparkSklearnEstimatorUDT spark_sklearn.

Is it possible to store a numpy array in a Data Spark Data Frame column?

More articles: