Suppose that I have DataFrame entitled transactionsto the following integer columns: year, month, day, timestamp, transaction_id.
In [1]: transactions = ctx.createDataFrame([(2017, 12, 1, 10000, 1), (2017, 12, 2, 10001, 2), (2017, 12, 3, 10003, 3), (2017, 12, 4, 10004, 4), (2017, 12, 5, 10005, 5), (2017, 12, 6, 10006, 6)],('year', 'month', 'day', 'timestamp', 'transaction_id'))
In [2]: transactions.show()
+----+-----+---+---------+--------------+
|year|month|day|timestamp|transaction_id|
+----+-----+---+---------+--------------+
|2017| 12| 1| 10000| 1|
|2017| 12| 2| 10001| 2|
|2017| 12| 3| 10003| 3|
|2017| 12| 4| 10004| 4|
|2017| 12| 5| 10005| 5|
|2017| 12| 6| 10006| 6|
+----+-----+---+---------+--------------+
I want to define a function filter_date_rangethat returns a DataFrame consisting of transaction rows falling into a certain date range.
>>> filter_date_range(
df = transactions,
start_date = datetime.date(2017, 12, 2),
end_date = datetime.date(2017, 12, 4)).show()
+----+-----+---+---------+--------------+
|year|month|day|timestamp|transaction_id|
+----+-----+---+---------+--------------+
|2017| 12| 1| 10001| 2|
|2017| 12| 1| 10003| 3|
|2017| 12| 1| 10004| 4|
+----+-----+---+---------+--------------+
Assuming that data is stored in the Hive section, divided into year, month, daywhich is the most effective way to accomplish such a filter, which includes the date arithmetic? I'm looking for a way to do this in a purely DataFrame-ic way, without resorting to use transactions.rdd, so Spark can conclude that you really need to read only a subset of partitions.