How to extract a specific item from a column for each row?

Question

How to extract a specific item from a column for each row?

I have the following DataFrame in Spark 2.2.0 and Scala 2.11.8.

+----------+-------------------------------+
|item      |        other_items            |
+----------+-------------------------------+
|  111     |[[444,1.0],[333,0.5],[666,0.4]]|
|  222     |[[444,1.0],[333,0.5]]          |
|  333     |[]                             |
|  444     |[[111,2.0],[555,0.5],[777,0.2]]|

I want to get the following DataFrame:

+----------+-------------+
|item      | other_items |
+----------+-------------+
|  111     | 444         |
|  222     | 444         |
|  444     | 111         |

So basically, I need to extract the first itemof other_itemsfor each row. In addition, I need to ignore those lines that have an empty array []in other_products.

How can i do this?

I tried this approach, but it does not give the expected result.

result = df.withColumn("other_items",$"other_items"(0))

printScheme outputs the following result:

 |-- item: string (nullable = true)
 |-- other_items: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- _1: string (nullable = true)
 |    |    |-- _2: double (nullable = true)

+4

scala apache-spark spark-dataframe

Markus Nov 22 '17 at 15:16

source share

1 answer

user8991934 · Accepted Answer · 2017-11-22T16:22:32+0000

Like this:

val df = Seq(
  ("111", Seq(("111", 1.0), ("333", 0.5), ("666", 0.4))), ("333", Seq())
).toDF("item", "other_items")


df.select($"item", $"other_items"(0)("_1").alias("other_items"))
  .na.drop(Seq("other_items")).show

Where the first apply( $"other_items"(0)) selects the first element of the array, the second field apply( _("_1")) selects _1, and na.dropdeletes nullsthe empty array entered.

+----+-----------+
|item|other_items|
+----+-----------+
| 111|        111|
+----+-----------+

How to extract a specific item from a column for each row?

More articles: