Java & Spark: Add Unique Incremental Identifier to Dataset

Question

Java & Spark: Add Unique Incremental Identifier to Dataset

With Spark and Java, I am trying to add to the existing dataset [Row], where n columns contains an Integer column.

I have successfully added id with zipWithUniqueId()or with zipWithIndex, even using monotonically_increasing_id(). But no one is satisfied.

Example. I have one dataset with 195 rows. When I use one of these three methods, I get some identifier, for example, 1584156487 or 12036. In addition, these identifiers are not adjacent.

What I need / need is pretty simple: the id Integer column, whose value is 1 for the foreach dataset.count () row, where id = 1 is followed by id = 2, etc.

How can I do this in Java / Spark?

+4

java apache-spark

Baerrow Aug 3 '17 at 9:30

source share

3 answers

In Scala, you can do this as shown below.

 var a = dataframe.collect().zipWithIndex
    for (  b:(Row,Int)<-a){
      println(b._2) 

    }

Here b._2 you will get a unique number starting from 0 to the number count -1

0

Uttam Kasundara Aug 3 '17 at 11:59

source share

You can also create a unique zoom identifier as shown below.

val df1 = spark.sqlContext.createDataFrame(
    df.rdd.zipWithIndex.map {
  case (row, index) => Row.fromSeq(row.toSeq :+ index)
},
StructType(df.schema.fields :+ StructField("id", LongType, false)))

Hope this helps!

0

Shankar koirala Aug 3 '17 at 14:19

source share

Fabich · Accepted Answer · 2017-08-03T11:59:30+0000

You can try using the row_number function :

In java:

import org.apache.spark.sql.functions;
import org.apache.spark.sql.expressions.Window;

df.withColumn("id", functions.row_number().over(Window.orderBy("a column")));

Or in scala:

import org.apache.spark.sql.expressions.Window;
df.withColumn("id",row_number().over(Window.orderBy("a column")))

Java & Spark: Add Unique Incremental Identifier to Dataset

More articles: