Pyspark changes column type from date to row

I have the following framework:

corr_temp_df
[('vacationdate', 'date'),
 ('valueE', 'string'),
 ('valueD', 'string'),
 ('valueC', 'string'),
 ('valueB', 'string'),
 ('valueA', 'string')]

Now I would like to change the data type of the calendardate column to String, so that the dataframe takes this new type and overwrites the data type data for all records. For example. after writing:

corr_temp_df.dtypes

The vacation data type must be overwritten.

I already used functions like cast, StringType or astype, but I was not successful. Do you know how to do this?

+4
source share
1 answer

Allows you to create some dummy data:

import datetime
from pyspark.sql import Row
from pyspark.sql.functions import col

row = Row("vacationdate")

df = sc.parallelize([
    row(datetime.date(2015, 10, 07)),
    row(datetime.date(1971, 01, 01))
]).toDF()

If you Spark> = 1.5.0, you can use the function date_format:

from pyspark.sql.functions import date_format

(df
   .select(date_format(col("vacationdate"), "dd-MM-YYYY")
   .alias("date_string"))
   .show())

In spark & ​​lt; 1.5.0 this can be done using Hive UDF:

df.registerTempTable("df")
sqlContext.sql(
    "SELECT date_format(vacationdate, 'dd-MM-YYYY') AS date_string FROM df")

, Spark >= 1.5.0.

HiveContext, date_format UDF:

from pyspark.sql.functions import udf, lit
my_date_format = udf(lambda d, fmt: d.strftime(fmt))

df.select(
    my_date_format(col("vacationdate"), lit("%d-%m-%Y")).alias("date_string")
).show()

, C > Java

+6

Source: https://habr.com/ru/post/1610487/


All Articles