Is there a data architecture for efficient joins in Spark (a la RedShift)?

Question

Is there a data architecture for efficient joins in Spark (a la RedShift)?

I have data that I would like to make a lot of analytical queries, and I'm trying to find out if there is a mechanism that I can use to store it so that Spark can connect to it efficiently. I have a solution using RedShift, but ideally I prefer to have something based on S3 files instead of the whole RedShift cluster running 24/7.

Introduction to Data

This is a simplified example. We have 2 initial CSV files.

Personal notes
Event Records

Two tables are linked via the person_id field. person_id is unique in the Person table. Events have a one-to-one relationship with a person.

purpose

I would like to understand how to set up the data so that I can efficiently execute the following query. I will need to do a lot of queries like this (all requests are evaluated based on each user):

The query is to create a data frame with 4 columns and 1 row for each person.

person_id - person_id for each person in the data set
age - field "age" from the person’s record
cost - the sum of the “cost” field for all event records for the person where the “date” is during the month of 6/2013.

, Spark , , ( ). , , .

RedShift

RedShift :

RedShift DISTKEY person_id, SORTKEY person_id. , node. :

select person_id, age, e.cost from person 
    left join (select person_id, sum(cost) as cost from events 
       where date between '2013-06-01' and '2013-06-30' 
       group by person_id) as e using (person_id)

Spark/Parquet

Spark, , . :

Spark Dataset 'bucketBy'. CSV, , "bucketBy". . , RedShift, bucketBy.
- . , person_id Key. , "partition_key" "person_id", - -. , , CSV .
, . , , .
- , ( "", ) . , :
- ** explode ** - , , . , .
- ** UDF ** - , . , UDF, ( ).

Spark RedShift, , Spark. , , - , .

+4

amazon-redshift apache-spark apache-spark-sql spark-dataframe

Dave DeCaprio 23 . '17 12:51

1

Garren · Accepted Answer · 2017-03-24T04:04:58+0000

.

:

val eventAgg = spark.sql("""select person_id, sum(cost) as cost 
                            from events 
                            where date between '2013-06-01' and '2013-06-30' 
                            group by person_id""")
eventAgg.cache.count
val personDF = spark.sql("""SELECT person_id, age from person""")
personDF.cache.count // cache is less important here, so feel free to omit
eventAgg.join(personDF, "person_id", "left")

, (9 node/140 vCPUs, ~ 600GB RAM):
27 000 000 000 "" ( 14 331 487 "" )
64 000 000 "(~ 20 )
~ 3
~ 30 ( , )
" " , . , , 1 .

Is there a data architecture for efficient joins in Spark (a la RedShift)?

Introduction to Data

purpose

RedShift

Spark/Parquet

More articles: