This does not work because:
- the second argument to
withColumn should be a Column not a collection. np.array will not work here. - when you pass
"index in indexes" because the SQL expression for where indexes is out of scope and not allowed as a valid identifier
PySpark> = 1.4.0
You can add line numbers using the appropriate window function and query using the Column.isin method or a well-formed query string:
from pyspark.sql.functions import col, rowNumber from pyspark.sql.window import Window w = Window.orderBy() indexed = df.withColumn("index", rowNumber().over(w))
It seems that window functions, called without a PARTITION BY , move all the data into one partition, so the above might not be the best solution in the end.
Any faster and easier way to handle this?
Not really. Spark DataFrames does not support random row access.
PairedRDD can be obtained using the lookup method, which is relatively fast if the data is partitioned using the HashPartitioner . There is also an indexed-rdd project that supports efficient search.
Edit
Regardless of the version of PySpark, you can try something like this:
from pyspark.sql import Row from pyspark.sql.types import StructType, StructField, LongType row = Row("char") row_with_index = Row("char", "index") df = sc.parallelize(row(chr(x)) for x in range(97, 112)).toDF() df.show(5)
zero323 Sep 24 '15 at 12:18 2015-09-24 12:18
source share