Rename multiple columns dynamically in PySpark DataFrame

I have a dataframe in pyspark that has 15 columns.

Column Name: id , name , emp.dno , emp.sal , state , emp.city , zip .....

Now I want to replace the names of columns that have '.' in them, on '_'

Like 'emp.dno' to 'emp_dno'

I would like to do it dynamically

How can I achieve this in pyspark?

+5
source share
1 answer

You can use something similar to this great solution from @ zero323 :

 df.toDF(*(c.replace('.', '_') for c in df.columns)) 

as an alternative:

 from pyspark.sql.functions import col replacements = {c:c.replace('.','_') for c in df.columns if '.' in c} df.select([col(c).alias(replacements.get(c, c)) for c in df.columns]) 

The replacement dictionary will then look like this:

 {'emp.city': 'emp_city', 'emp.dno': 'emp_dno', 'emp.sal': 'emp_sal'} 

UPDATE:

if I have a dataframe with a space in the column names, same as replacing as '.' , and a space with '_'

 import re df.toDF(*(re.sub(r'[\.\s]+', '_', c) for c in df.columns)) 
+11
source

Source: https://habr.com/ru/post/1262810/


All Articles