You can use a simple map as with any other RDD:
elevDF = sqlContext.createDataFrame(sc.parallelize([ Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=1, value=638.55), Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=2, value=638.55), Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=3, value=638.55), Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=4, value=638.55), Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=5, value=638.55)])) (elevDF .map(lambda (date, hour, value): (date.year, date.month, date.day)) .collect())
and the result:
[(1984, 1, 1), (1984, 1, 1), (1984, 1, 1), (1984, 1, 1), (1984, 1, 1)]
Btw: datetime.datetime stores the hour anyway, so keeping it separate seems like a waste of memory.
Starting with Spark 1.5, you can use several date processing functions.
import datetime from pyspark.sql.functions import year, month, dayofmonth elevDF = sc.parallelize([ (datetime.datetime(1984, 1, 1, 0, 0), 1, 638.55), (datetime.datetime(1984, 1, 1, 0, 0), 2, 638.55), (datetime.datetime(1984, 1, 1, 0, 0), 3, 638.55), (datetime.datetime(1984, 1, 1, 0, 0), 4, 638.55), (datetime.datetime(1984, 1, 1, 0, 0), 5, 638.55) ]).toDF(["date", "hour", "value"]) elevDF.select(year("date").alias('year'), month("date").alias('month'), dayofmonth("date").alias('day')).show() # +----+-----+---+ # |year|month|day| # +----+-----+---+ # |1984| 1| 1| # |1984| 1| 1| # |1984| 1| 1| # |1984| 1| 1| # |1984| 1| 1| # +----+-----+---+