Spark DataFrame TimestampType - how to get Year, Month, Day data from a field?

Question

Spark DataFrame TimestampType - how to get Year, Month, Day data from a field?

I have a Spark DataFrame with the top lines of take (5) as follows:

[Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=1, value=638.55), Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=2, value=638.55), Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=3, value=638.55), Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=4, value=638.55), Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=5, value=638.55)]

This scheme is defined as:

 elevDF.printSchema() root |-- date: timestamp (nullable = true) |-- hour: long (nullable = true) |-- value: double (nullable = true)

How can I get Year, Month, Day values from the 'date' field?

+6

python timestamp apache-spark pyspark

curtisp Jun 20 '15 at 0:51

source share

2 answers

You can use functions in pyspark.sql.functions : functions such as year , month , etc

see here: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame

 from pyspark.sql.functions import * newdf = elevDF.select(year(elevDF.date).alias('dt_year'), month(elevDF.date).alias('dt_month'), dayofmonth(elevDF.date).alias('dt_day'), dayofyear(elevDF.date).alias('dt_dayofy'), hour(elevDF.date).alias('dt_hour'), minute(elevDF.date).alias('dt_min'), weekofyear(elevDF.date).alias('dt_week_no'), unix_timestamp(elevDF.date).alias('dt_int')) newdf.show() +-------+--------+------+---------+-------+------+----------+----------+ |dt_year|dt_month|dt_day|dt_dayofy|dt_hour|dt_min|dt_week_no| dt_int| +-------+--------+------+---------+-------+------+----------+----------+ | 2015| 9| 6| 249| 0| 0| 36|1441497601| | 2015| 9| 6| 249| 0| 0| 36|1441497601| | 2015| 9| 6| 249| 0| 0| 36|1441497603| | 2015| 9| 6| 249| 0| 1| 36|1441497694| | 2015| 9| 6| 249| 0| 20| 36|1441498808| | 2015| 9| 6| 249| 0| 20| 36|1441498811| | 2015| 9| 6| 249| 0| 20| 36|1441498815|

+7

hamed Dec 28 '16 at 17:01

source share

zero323 · Accepted Answer · 2015-06-20T16:23:05+0000

You can use a simple map as with any other RDD:

 elevDF = sqlContext.createDataFrame(sc.parallelize([ Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=1, value=638.55), Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=2, value=638.55), Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=3, value=638.55), Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=4, value=638.55), Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=5, value=638.55)])) (elevDF .map(lambda (date, hour, value): (date.year, date.month, date.day)) .collect())

and the result:

 [(1984, 1, 1), (1984, 1, 1), (1984, 1, 1), (1984, 1, 1), (1984, 1, 1)]

Btw: datetime.datetime stores the hour anyway, so keeping it separate seems like a waste of memory.

Starting with Spark 1.5, you can use several date processing functions.

 import datetime from pyspark.sql.functions import year, month, dayofmonth elevDF = sc.parallelize([ (datetime.datetime(1984, 1, 1, 0, 0), 1, 638.55), (datetime.datetime(1984, 1, 1, 0, 0), 2, 638.55), (datetime.datetime(1984, 1, 1, 0, 0), 3, 638.55), (datetime.datetime(1984, 1, 1, 0, 0), 4, 638.55), (datetime.datetime(1984, 1, 1, 0, 0), 5, 638.55) ]).toDF(["date", "hour", "value"]) elevDF.select(year("date").alias('year'), month("date").alias('month'), dayofmonth("date").alias('day')).show() # +----+-----+---+ # |year|month|day| # +----+-----+---+ # |1984| 1| 1| # |1984| 1| 1| # |1984| 1| 1| # |1984| 1| 1| # |1984| 1| 1| # +----+-----+---+

Spark DataFrame TimestampType - how to get Year, Month, Day data from a field?

More articles: