PySpark 1.5 How to reduce the timestamp to the next minute from seconds

I am using PySpark. I have a column ('dt') in a dataframe ('canon_evt') that is a timestamp. I am trying to remove seconds from a DateTime value. It is initially read from the parquet as a String. Then I try to convert it to Timestamp through

canon_evt = canon_evt.withColumn('dt',to_date(canon_evt.dt)) canon_evt= canon_evt.withColumn('dt',canon_evt.dt.astype('Timestamp')) 

Then I would like to delete the seconds. I tried trunc, date_format, or even tried to merge the fragments as shown below. I think this requires some kind of map and lambda combination, but I'm not sure if Timestamp is a suitable format and whether seconds can be eliminated.

 canon_evt = canon_evt.withColumn('dyt',year('dt') + '-' + month('dt') + '-' + dayofmonth('dt') + ' ' + hour('dt') + ':' + minute('dt')) [Row(dt=datetime.datetime(2015, 9, 16, 0, 0),dyt=None)] 
+5
source share
2 answers

Converting to Unix timestamps and basic arithmetic should do the trick:

 from pyspark.sql import Row from pyspark.sql.functions import col, unix_timestamp, round df = sc.parallelize([ Row(dt='1970-01-01 00:00:00'), Row(dt='2015-09-16 05:39:46'), Row(dt='2015-09-16 05:40:46'), Row(dt='2016-03-05 02:00:10'), ]).toDF() ## unix_timestamp converts string to Unix timestamp (bigint / long) ## in seconds. Divide by 60, round, multiply by 60 and cast ## should work just fine. ## dt_truncated = ((round(unix_timestamp(col("dt")) / 60) * 60) .cast("timestamp")) df.withColumn("dt_truncated", dt_truncated).show(10, False) ## +-------------------+---------------------+ ## |dt |dt_truncated | ## +-------------------+---------------------+ ## |1970-01-01 00:00:00|1970-01-01 00:00:00.0| ## |2015-09-16 05:39:46|2015-09-16 05:40:00.0| ## |2015-09-16 05:40:46|2015-09-16 05:41:00.0| ## |2016-03-05 02:00:10|2016-03-05 02:00:00.0| ## +-------------------+---------------------+ 
+6
source

I think zero323 has a better answer. It is annoying that Spark does not support this initially, given how easy it is to implement it. For posterity, there is a function here that I use:

 def trunc(date, format): """Wraps spark trunc fuction to support day, minute, and hour""" import re import pyspark.sql.functions as func # Ghetto hack to get the column name from Column object or string: try: colname = re.match(r"Column<.?'(.*)'>", str(date)).groups()[0] except AttributeError: colname = date alias = "trunc(%s, %s)" % (colname, format) if format in ('year', 'YYYY', 'yy', 'month', 'mon', 'mm'): return func.trunc(date, format).alias(alias) elif format in ('day', 'DD'): return func.date_sub(date, 0).alias(alias) elif format in ('min', ): return ((func.round(func.unix_timestamp(date) / 60) * 60).cast("timestamp")).alias(alias) elif format in ('hour', ): return ((func.round(func.unix_timestamp(date) / 3600) * 3600).cast("timestamp")).alias(alias) 
+1
source

Source: https://habr.com/ru/post/1237952/


All Articles