Java & Spark: Add Unique Incremental Identifier to Dataset

With Spark and Java, I am trying to add to the existing dataset [Row], where n columns contains an Integer column.

I have successfully added id with zipWithUniqueId()or with zipWithIndex, even using monotonically_increasing_id(). But no one is satisfied.

Example. I have one dataset with 195 rows. When I use one of these three methods, I get some identifier, for example, 1584156487 or 12036. In addition, these identifiers are not adjacent.

What I need / need is pretty simple: the id Integer column, whose value is 1 for the foreach dataset.count () row, where id = 1 is followed by id = 2, etc.

How can I do this in Java / Spark?

+4
source share
3 answers

You can try using the row_number function :

In java:

import org.apache.spark.sql.functions;
import org.apache.spark.sql.expressions.Window;

df.withColumn("id", functions.row_number().over(Window.orderBy("a column")));

Or in scala:

import org.apache.spark.sql.expressions.Window;
df.withColumn("id",row_number().over(Window.orderBy("a column")))
+3
source

In Scala, you can do this as shown below.

 var a = dataframe.collect().zipWithIndex
    for (  b:(Row,Int)<-a){
      println(b._2) 

    }

Here b._2 you will get a unique number starting from 0 to the number count -1

0
source

You can also create a unique zoom identifier as shown below.

val df1 = spark.sqlContext.createDataFrame(
    df.rdd.zipWithIndex.map {
  case (row, index) => Row.fromSeq(row.toSeq :+ index)
},
StructType(df.schema.fields :+ StructField("id", LongType, false)))

Hope this helps!

0
source

Source: https://habr.com/ru/post/1682920/


All Articles