How to select and order multiple columns in DataSphere Pyspark after joining

I want to select multiple columns from an existing data framework (which is created after joins) and would like to order these fields as the structure of my target table. How can I do that? The approach I used is given below. Here I can select the necessary columns, but I can not do them sequentially.

Required (Target Table structure) : hist_columns = ("acct_nbr","account_sk_id", "zip_code","primary_state", "eff_start_date" ,"eff_end_date","eff_flag") account_sk_df = hist_process_df.join(broadcast(df_sk_lkp) ,'acct_nbr','inner' ) account_sk_df_ld = account_sk_df.select([c for c in account_sk_df.columns if c in hist_columns]) >>> account_sk_df DataFrame[acct_nbr: string, primary_state: string, zip_code: string, eff_start_date: string, eff_end_date: string, eff_flag: string, hash_sk_id: string, account_sk_id: int] >>> account_sk_df_ld DataFrame[acct_nbr: string, primary_state: string, zip_code: string, eff_start_date: string, eff_end_date: string, eff_flag: string, account_sk_id: int] 

Account_sk_id should be in second place. What is the best way to do this?

+5
source share
1 answer

Try selecting columns by simply specifying a list, rather than iterating over existing columns or arranging them, should be OK:

 account_sk_df_ld = account_sk_df.select(*hist_columns) 
+6
source

Source: https://habr.com/ru/post/1259412/


All Articles