Is there a data architecture for efficient joins in Spark (a la RedShift)?

I have data that I would like to make a lot of analytical queries, and I'm trying to find out if there is a mechanism that I can use to store it so that Spark can connect to it efficiently. I have a solution using RedShift, but ideally I prefer to have something based on S3 files instead of the whole RedShift cluster running 24/7.

Introduction to Data

This is a simplified example. We have 2 initial CSV files.

  • Personal notes
  • Event Records

Two tables are linked via the person_id field. person_id is unique in the Person table. Events have a one-to-one relationship with a person.

purpose

I would like to understand how to set up the data so that I can efficiently execute the following query. I will need to do a lot of queries like this (all requests are evaluated based on each user):

The query is to create a data frame with 4 columns and 1 row for each person.

  • person_id - person_id for each person in the data set
  • age - field "age" from the person’s record
  • cost - the sum of the “cost” field for all event records for the person where the “date” is during the month of 6/2013.

, Spark , , ( ). , , .

RedShift

RedShift :

RedShift DISTKEY person_id, SORTKEY person_id. , node. :

select person_id, age, e.cost from person 
    left join (select person_id, sum(cost) as cost from events 
       where date between '2013-06-01' and '2013-06-30' 
       group by person_id) as e using (person_id)

Spark/Parquet

Spark, , . :

  • Spark Dataset 'bucketBy'. CSV, , "bucketBy". . , RedShift, bucketBy.
  • - . , person_id Key. , "partition_key" "person_id", - -. , , CSV .
  • , . , , .
  • - , ( "", ) . , :
    • ** explode ** - , , . , .
    • ** UDF ** - , . , UDF, ( ).

Spark RedShift, , Spark. , , - , .

+4
1

.

:

:

val eventAgg = spark.sql("""select person_id, sum(cost) as cost 
                            from events 
                            where date between '2013-06-01' and '2013-06-30' 
                            group by person_id""")
eventAgg.cache.count
val personDF = spark.sql("""SELECT person_id, age from person""")
personDF.cache.count // cache is less important here, so feel free to omit
eventAgg.join(personDF, "person_id", "left")

, (9 node/140 vCPUs, ~ 600GB RAM):

27 000 000 000 "" ( 14 331 487 "" )

64 000 000 "(~ 20 )

~ 3

~ 30 ( , )

" " , . , , 1 .

+1

Source: https://habr.com/ru/post/1673003/


All Articles