Pyspark changes column type from date to row

Question

Pyspark changes column type from date to row

I have the following framework:

corr_temp_df
[('vacationdate', 'date'),
 ('valueE', 'string'),
 ('valueD', 'string'),
 ('valueC', 'string'),
 ('valueB', 'string'),
 ('valueA', 'string')]

Now I would like to change the data type of the calendardate column to String, so that the dataframe takes this new type and overwrites the data type data for all records. For example. after writing:

corr_temp_df.dtypes

The vacation data type must be overwritten.

I already used functions like cast, StringType or astype, but I was not successful. Do you know how to do this?

+4

python apache-spark pyspark apache-spark-sql

cimbom Oct 6 '15 at 18:45

source share

1 answer

zero323 · Accepted Answer · 2015-10-07T08:21:02+0000

Allows you to create some dummy data:

import datetime
from pyspark.sql import Row
from pyspark.sql.functions import col

row = Row("vacationdate")

df = sc.parallelize([
    row(datetime.date(2015, 10, 07)),
    row(datetime.date(1971, 01, 01))
]).toDF()

If you Spark> = 1.5.0, you can use the function date_format:

from pyspark.sql.functions import date_format

(df
   .select(date_format(col("vacationdate"), "dd-MM-YYYY")
   .alias("date_string"))
   .show())

In spark & lt; 1.5.0 this can be done using Hive UDF:

df.registerTempTable("df")
sqlContext.sql(
    "SELECT date_format(vacationdate, 'dd-MM-YYYY') AS date_string FROM df")

, Spark >= 1.5.0.

HiveContext, date_format UDF:

from pyspark.sql.functions import udf, lit
my_date_format = udf(lambda d, fmt: d.strftime(fmt))

df.select(
    my_date_format(col("vacationdate"), lit("%d-%m-%Y")).alias("date_string")
).show()

, C > Java

Pyspark changes column type from date to row

More articles: