Is it possible to store a numpy array in a Data Spark Data Frame column?

I have dataframe, and I apply a function to it. This function returns the numpy arraycode as follows:

create_vector_udf = udf(create_vector, ArrayType(FloatType()))
dataframe = dataframe.withColumn('vector', create_vector_udf('text'))
dmoz_spark_df.select('lang','url','vector').show(20)

Now the spark does not seem satisfied with this and does not accept ArrayType(FloatType()) The following error message appears: net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for numpy.core.multiarray._reconstruct)

I could just numpyarray.tolist()return its version of the list, but obviously I would always need to recreate arrayit if I want to use it with numpy.

so is there a way to save numpy arrayin dataframe column?

+7
source share
2 answers

, , UDF, . create_vector numpy.ndarray, NumPy, API DataFrame.

- - :

udf(lambda x: create_vector(x).tolist(), ArrayType(FloatType()))
+1

, , , UDT, _SparkSklearnEstimatorUDT spark_sklearn.

0

Source: https://habr.com/ru/post/1680941/


All Articles