Rename multiple columns dynamically in PySpark DataFrame

Question

Rename multiple columns dynamically in PySpark DataFrame

I have a dataframe in pyspark that has 15 columns.

Column Name: id , name , emp.dno , emp.sal , state , emp.city , zip .....

Now I want to replace the names of columns that have '.' in them, on '_'

Like 'emp.dno' to 'emp_dno'

I would like to do it dynamically

How can I achieve this in pyspark?

+5

special-characters dataframe apache-spark pyspark

User12345 Jan 14 '17 at 21:24

source share

1 answer

Maxu · Accepted Answer · 2017-01-14T21:37:07+0000

You can use something similar to this great solution from @ zero323 :

 df.toDF(*(c.replace('.', '_') for c in df.columns))

as an alternative:

 from pyspark.sql.functions import col replacements = {c:c.replace('.','_') for c in df.columns if '.' in c} df.select([col(c).alias(replacements.get(c, c)) for c in df.columns])

The replacement dictionary will then look like this:

 {'emp.city': 'emp_city', 'emp.dno': 'emp_dno', 'emp.sal': 'emp_sal'}

UPDATE:

if I have a dataframe with a space in the column names, same as replacing as '.' , and a space with '_'

 import re df.toDF(*(re.sub(r'[\.\s]+', '_', c) for c in df.columns))

Rename multiple columns dynamically in PySpark DataFrame

More articles: