Getting a specific area from a selected row in a Pyspark DataFrame

I have a Spark DataFrame created via pyspark from a JSON file as

sc = SparkContext() sqlc = SQLContext(sc) users_df = sqlc.read.json('users.json') 

Now I want to access the selected_user data, where is its _id field. I can do

 print users_df[users_df._id == chosen_user].show() 

and that gives me the complete user string. But suppose I just want one specific field in a row, say a custom gender, how would I get it?

+9
source share
1 answer

Just select and select:

 result = users_df.where(users_df._id == chosen_user).select("gender") 

or with col

 from pyspark.sql.functions import col result = users_df.where(col("_id") == chosen_user).select(col("gender")) 

Finally, PySpark Row is just a tuple with some extensions, so you can, for example, flatMap :

 result.rdd.flatMap(list).first() 

or map with something like this:

 result.rdd.map(lambda x: x.gender).first() 
+18
source

Source: https://habr.com/ru/post/1244200/


All Articles