Get a series of Spark RDD columns

Question

Get a series of Spark RDD columns

Now I have 300+ columns in my RDD, but I found that there is a need to dynamically select a range of columns and put them in the LabledPoints data type. As a newbie to Spark, I am wondering if there is any index way to select a range of columns in RDD. Something like temp_data = data[, 101:211]in R. Is there something like val temp_data = data.filter(_.column_index in range(101:211)...?

Any thought is welcome and appreciated.

+4

scala apache-spark rdd

Richard Liu Jul 24 '15 at 1:09

source share

3 answers

Justin pihony · Answer 1 · 2015-07-24T02:26:00+0000

If it's a DataFrame, then something like this should work:

val df = rdd.toDF
df.select(df.columns.slice(101,211) : _*)

marios · Answer 2 · 2015-07-24T06:34:03+0000

Assuming you have an RDD Arrayor any other scala collection (e.g. List). You can do something like this:

val data: RDD[Array[Int]] = sc.parallelize(Array(Array(1,2,3), Array(4,5,6)))
val sliced: RDD[Array[Int]] = data.map(_.slice(0,2))

sliced.collect()
> Array[Array[Int]] = Array(Array(1, 2), Array(4, 5))

Forrealhomie · Answer 3 · 2016-09-23T20:08:18+0000

, - . , , 200 .

Spark 1.4.1
Scala 2.10.4

val df = hiveContext.sql("SELECT * FROM foobar")
val cols = df.columns.slice(0, df.columns.length - 1)
val new_df = df.select(cols.head, cols.tail:_*)

Get a series of Spark RDD columns

More articles: