I use Spark to collect data containing a date column and create 3 new columns containing the time in days, weeks and months between the date in the column and today.
My problem is using SimpleDateFormat, which is not thread safe. Normally, without Spark, this would be nice, since it is a local variable, but with a lazy evaluation, Spark shares one instance of SimpleDateFormat into several UDFs that might cause the problem?
def calcTimeDifference(...){ val sdf = new SimpleDateFormat(dateFormat) val dayDifference = udf{(x: String) => math.abs(Days.daysBetween(new DateTime(sdf.parse(x)), presentDate).getDays)} output = output.withColumn("days", dayDifference(myCol)) val weekDifference = udf{(x: String) => math.abs(Weeks.weeksBetween(new DateTime(sdf.parse(x)), presentDate).getWeeks)} output = output.withColumn("weeks", weekDifference(myCol)) val monthDifference = udf{(x: String) => math.abs(Months.monthsBetween(new DateTime(sdf.parse(x)), presentDate).getMonths)} output = output.withColumn("months", monthDifference(myCol)) }
source share