Scala: How do I split data into multiple csv files based on line count

Question

Scala: How do I split data into multiple csv files based on line count

I have a dataframe, say df1 with 10M lines. I want to split the same into several csv files with 1 M lines. Any suggestions to do the same in scala?

0

scala csv dataframe apache-spark rdd

Nitish Apr 23 '17 at 3:48

source share

1 answer

Steffen Schmitz · Answer 1 · 2017-04-23T09:10:21+0000

You can use the randomSplit method for Dataframes.

import scala.util.Random
val df = List(0,1,2,3,4,5,6,7,8,9).toDF
val splitted = df.randomSplit(Array(1,1,1,1,1)) 
splitted foreach { a => a.write.format("csv").save("path" + Random.nextInt) }

I used Random.nextInt for a unique name. If necessary, you can add some other logic there.

Source:

http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset

How to save a DataFrame candle as csv on disk?

https://forums.databricks.com/questions/8723/how-can-i-split-a-spark-dataframe-into-n-equal-dat.html

: :

var input = List(1,2,3,4,5,6,7,8,9).toDF
val limit = 2

var newFrames = List[org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]]()
var size = input.count;

while (size > 0) {
    newFrames = input.limit(limit) :: newFrames
    input = input.except(newFrames.head)
    size = size - limit
}

newFrames.foreach(_.show)

, .

Scala: How do I split data into multiple csv files based on line count

More articles: