How to select and order multiple columns in DataSphere Pyspark after joining

Question

How to select and order multiple columns in DataSphere Pyspark after joining

I want to select multiple columns from an existing data framework (which is created after joins) and would like to order these fields as the structure of my target table. How can I do that? The approach I used is given below. Here I can select the necessary columns, but I can not do them sequentially.

Required (Target Table structure) : hist_columns = ("acct_nbr","account_sk_id", "zip_code","primary_state", "eff_start_date" ,"eff_end_date","eff_flag") account_sk_df = hist_process_df.join(broadcast(df_sk_lkp) ,'acct_nbr','inner' ) account_sk_df_ld = account_sk_df.select([c for c in account_sk_df.columns if c in hist_columns]) >>> account_sk_df DataFrame[acct_nbr: string, primary_state: string, zip_code: string, eff_start_date: string, eff_end_date: string, eff_flag: string, hash_sk_id: string, account_sk_id: int] >>> account_sk_df_ld DataFrame[acct_nbr: string, primary_state: string, zip_code: string, eff_start_date: string, eff_end_date: string, eff_flag: string, account_sk_id: int]

Account_sk_id should be in second place. What is the best way to do this?

+5

python apache-spark pyspark apache-spark-sql

user3858193 Nov 07 '16 at 14:21

source share

1 answer

Mariusz · Accepted Answer · 2016-11-07T18:04:13+0000

Try selecting columns by simply specifying a list, rather than iterating over existing columns or arranging them, should be OK:

 account_sk_df_ld = account_sk_df.select(*hist_columns)

How to select and order multiple columns in DataSphere Pyspark after joining

More articles: