How do you convert timestamp data from Spark Python to Pandas and vice versa? I am reading data from a Hive table in Spark, I want to do some calculations in Pandas and write the results back to Hive. Only the last part does not work, converting the Pandas timestamp back to the Spark DataFrame timestamp.
import datetime
import pandas as pd
dates = [
('today', '2017-03-03 11:30:00')
, ('tomorrow', '2017-03-04 08:00:00')
, ('next Thursday', '2017-03-09 20:00:00')
]
string_date_rdd = sc.parallelize(dates)
timestamp_date_rdd = string_date_rdd.map(lambda t: (t[0], datetime.datetime.strptime(t[1], "%Y-%m-%d %H:%M:%S')))
timestamp_df = sqlContext.createDataFrame(timestamp_date_rdd, ['Day', 'Date'])
timestamp_pandas_df = timestamp_df.toPandas()
roundtrip_df = sqlContext.createDataFrame(timestamp_pandas_df)
roundtrip_df.printSchema()
roundtrip_df.show()
root
|-- Day: string (nullable = true)
|-- Date: long (nullable = true)
+-------------+-------------------+
| Day| Date|
+-------------+-------------------+
| today|1488540600000000000|
| tomorrow|1488614400000000000|
|next Thursday|1489089600000000000|
+-------------+-------------------+
At this point, the roundtrip Spark DataFrame has a long date column as a data type. In Pyspark, this can easily be converted back to a datetime object, for example, datetime.datetime.fromtimestamp (148908960000000000/1000000000), although the time of day is disabled for several hours. How to do this to convert a Spark DataFrame data type?
Python 3.4.5, Spark 1.6.0
Thanks, John