Starting with Spark 2.4, you can use the slice
function. In Python ):
pyspark.sql.functions.slice(x, start, length)
Collection function: returns an array containing all elements in x from the beginning of the index (or starting from the end, if the beginning is negative) with the specified length.
...
New in version 2.4.
from pyspark.sql.functions import slice df = spark.createDataFrame([ (10, "Finance", ["Jon", "Snow", "Castle", "Black", "Ned"]), (20, "IT", ["Ned", "is", "no", "more"]) ], ("dept_id", "dept_nm", "emp_details")) df.select(slice("emp_details", 1, 3).alias("empt_details")).show()
+-------------------+ | empt_details| +-------------------+ |[Jon, Snow, Castle]| | [Ned, is, no]| +-------------------+
In scala
def slice(x: Column, start: Int, length: Int): Column
Returns an array containing all elements at x from the beginning of the index (or starting from the end, if the beginning is negative) with the specified length.
import org.apache.spark.sql.functions.slice val df = Seq( (10, "Finance", Seq("Jon", "Snow", "Castle", "Black", "Ned")), (20, "IT", Seq("Ned", "is", "no", "more")) ).toDF("dept_id", "dept_nm", "emp_details") df.select(slice($"emp_details", 1, 3) as "empt_details").show
+-------------------+ | empt_details| +-------------------+ |[Jon, Snow, Castle]| | [Ned, is, no]| +-------------------+
The same can, of course, be done in SQL
SELECT slice(emp_details, 1, 3) AS emp_details FROM df
Important :
Note that unlike Seq.slice
, values ββare indexed from scratch, and the second argument is the length, not the end position.