Getting a specific area from a selected row in a Pyspark DataFrame

Question

I have a Spark DataFrame created via pyspark from a JSON file as

sc = SparkContext() sqlc = SQLContext(sc) users_df = sqlc.read.json('users.json')

Now I want to access the selected_user data, where is its _id field. I can do

 print users_df[users_df._id == chosen_user].show()

and that gives me the complete user string. But suppose I just want one specific field in a row, say a custom gender, how would I get it?

+9

mar tin Mar 01 '16 at 10:23

1 answer

zero323 · Accepted Answer · 2016-03-01T10:29:16+0000

Just select and select:

 result = users_df.where(users_df._id == chosen_user).select("gender")

or with col

 from pyspark.sql.functions import col result = users_df.where(col("_id") == chosen_user).select(col("gender"))

Finally, PySpark Row is just a tuple with some extensions, so you can, for example, flatMap :

 result.rdd.flatMap(list).first()

or map with something like this:

 result.rdd.map(lambda x: x.gender).first()