I have a DataFrame that looks like this:
scala> data.show +-----+---+---------+ |label| id| features| +-----+---+---------+ | 1.0| 1|[1.0,2.0]| | 0.0| 2|[5.0,6.0]| | 1.0| 1|[3.0,4.0]| | 0.0| 2|[7.0,8.0]| +-----+---+---------+
I want to rearrange functions based on "id" to get the following:
scala> data.show +---------+---+-----------------+ | label| id| features | +---------+---+-----------------+ | 1.0,1.0| 1|[1.0,2.0,3.0,4.0]| | 0.0,0.0| 2|[5.0,6.0,7.8,8.0]| +---------+---+-----------------+
This is the code I use to create the mentioned DataFrame
val rdd = sc.parallelize(List((1.0, 1, Vectors.dense(1.0, 2.0)), (0.0, 2, Vectors.dense(5.0, 6.0)), (1.0, 1, Vectors.dense(3.0, 4.0)), (0.0, 2, Vectors.dense(7.0, 8.0)))) val data = rdd.toDF("label", "id", "features")
I tried different things with both RDD and DataFrames. The most promising approach so far has been id-based filtering.
data.filter($"id".equalTo(1)) +-----+---+---------+ |label| id| features| +-----+---+---------+ | 1.0| 1|[1.0,2.0]| | 1.0| 1|[3.0,4.0]| +-----+---+---------+
But I have two bottlenecks:
1) How to automate filtering for all the different values ββthat an identifier can have?
The following error is generated:
data.select("id").distinct.foreach(x => data.filter($"id".equalTo(x)))
2) How to combine common "functions" in relation to this "id". Didn't try a lot since I'm still stuck on 1)
Any suggestion is more than welcome
Note. For clarification, the "label" is always the same for each occurrence of "id". Sorry for the confusion, a simple extension of my task will also group the βtagsβ (updated example)