repartition()
used to separate data in memory, and partitionBy
used to separate data on disk. They are often used together, as described in this blog post .
Both repartition()
and partitionBy
can be used to “split data based on a data column”, but repartition()
splits data in memory, and partitionBy
splits data on disk.
redistribution ()
Let's play with the code to better understand the break. Suppose you have the following CSV data.
first_name,last_name,country Ernesto,Guevara,Argentina Vladimir,Putin,Russia Maria,Sharapova,Russia Bruce,Lee,China Jack,Ma,China
df.repartition(col("country"))
redistribute data by country in memory.
Let's write out the data so that we can check the contents of each section of memory.
val outputPath = new java.io.File("./tmp/partitioned_by_country/").getCanonicalPath df.repartition(col("country")) .write .csv(outputPath)
Here's how the data is written to disk:
partitioned_by_country/ part-00002-95acd280-42dc-457e-ad4f-c6c73be6226f-c000.csv part-00044-95acd280-42dc-457e-ad4f-c6c73be6226f-c000.csv part-00059-95acd280-42dc-457e-ad4f-c6c73be6226f-c000.csv
Each file contains data for one country - for example, the file part-00059-95acd280-42dc-457e-ad4f-c6c73be6226f-c000.csv
contains data for China:
Bruce,Lee,China Jack,Ma,China
partitionBy ()
Let's write the data to disk using partitionBy
and see how the output of the file system differs.
Here is the code for writing data to disk partitions.
val outputPath = new java.io.File("./tmp/partitionedBy_disk/").getCanonicalPath df .write .partitionBy("country") .csv(outputPath)
Here's what the data looks like on disk:
partitionedBy_disk/ country=Argentina/ part-00000-906f845c-ecdc-4b37-a13d-099c211527b4.c000.csv country=China/ part-00000-906f845c-ecdc-4b37-a13d-099c211527b4.c000 country=Russia/ part-00000-906f845c-ecdc-4b37-a13d-099c211527b4.c000
Why split data to disk?
Sharing data on disk can make some queries much faster, as described in this blog post .