I have data that I would like to make a lot of analytical queries, and I'm trying to find out if there is a mechanism that I can use to store it so that Spark can connect to it efficiently. I have a solution using RedShift, but ideally I prefer to have something based on S3 files instead of the whole RedShift cluster running 24/7.
Introduction to Data
This is a simplified example. We have 2 initial CSV files.
- Personal notes
- Event Records
Two tables are linked via the person_id field. person_id is unique in the Person table. Events have a one-to-one relationship with a person.
purpose
I would like to understand how to set up the data so that I can efficiently execute the following query. I will need to do a lot of queries like this (all requests are evaluated based on each user):
The query is to create a data frame with 4 columns and 1 row for each person.
- person_id - person_id for each person in the data set
- age - field "age" from the person’s record
- cost - the sum of the “cost” field for all event records for the person where the “date” is during the month of 6/2013.
, Spark , , ( ). , , .
RedShift
RedShift :
RedShift DISTKEY person_id, SORTKEY person_id. , node. :
select person_id, age, e.cost from person
left join (select person_id, sum(cost) as cost from events
where date between '2013-06-01' and '2013-06-30'
group by person_id) as e using (person_id)
Spark/Parquet
Spark, , . :
- Spark Dataset 'bucketBy'. CSV, , "bucketBy". . , RedShift, bucketBy.
- - . , person_id Key. , "partition_key" "person_id", - -. , , CSV .
- , . , , .
- - , ( "", ) . , :
- ** explode ** - , , . , .
- ** UDF ** - , . , UDF, ( ).
Spark RedShift, , Spark. , , - , .