Replication Spark Row N-times

Question

Replication Spark Row N-times

I want to duplicate a row in a DataFrame, how can I do this?

For example, I have a DataFrame consisting of 1 row, and I want to create a DataFrame with 100 identical rows. I came up with the following solution:

  var data:DataFrame=singleRowDF

   for(i<-1 to 100-1) {
       data = data.unionAll(singleRowDF)
   }

But this introduces a lot of transformations, and it seems that my subsequent actions are becoming very slow. Is there any other way to do this?

+1

scala apache-spark

Raphael roth Nov 03 '16 at 9:19

source share

2 answers

, , .

import org.apache.spark.sql.DataFrame

val testDf = sc.parallelize(Seq(
    (1,2,3), (4,5,6)
)).toDF("one", "two", "three")

def replicateDf(n: Int, df: DataFrame) = sqlContext.createDataFrame(
    sc.parallelize(List.fill(n)(df.take(1)(0)).toSeq), 
    df.schema)

val replicatedDf = replicateDf(100, testDf)

0

septra 03 . '16 10:33

Tzach zohar · Accepted Answer · 2016-11-03T10:21:26+0000

You can add a column with a literal value of an array of size 100, and then use explodeeach of its elements to create its own row; Then just get rid of this dummy column:

import org.apache.spark.sql.functions._

val result = singleRowDF
  .withColumn("dummy", explode(array((1 until 100).map(lit): _*)))
  .selectExpr(singleRowDF.columns: _*)

Replication Spark Row N-times

More articles: