How to assign and use column headers in Spark?

Question

How to assign and use column headers in Spark?

I am reading a dataset as shown below.

f = sc.textFile("s3://test/abc.csv")

My file contains 50+ fields, and I want to assign column headings for each of the fields for reference later in my script.

How to do it in PySpark? Is the DataFrame way here?

PS - New to the spark.

+5

python hadoop multiple-columns apache-spark pyspark

GoldenPlatinum Apr 13 '16 at 20:07

source share

2 answers

Ida · Answer 1 · 2016-04-14T10:08:09+0000

Here's how to add column names using a DataFrame:

Suppose your csv has a delimiter ','. Before transferring data to a DataFrame, prepare the data as follows:

 f = sc.textFile("s3://test/abc.csv") data_rdd = f.map(lambda line: [x for x in line.split(',')])

Suppose the data has 3 columns:

 data_rdd.take(1) [[u'1.2', u'red', u'55.6']]

Now you can specify the column names when passing this RDD to a DataFrame using toDF() :

 df_withcol = data_rdd.toDF(['height','color','width']) df_withcol.printSchema() root |-- height: string (nullable = true) |-- color: string (nullable = true) |-- width: string (nullable = true)

If you do not specify column names, you will get a DataFrame with the default column names "_1", "_2", ...:

 df_default = data_rdd.toDF() df_default.printSchema() root |-- _1: string (nullable = true) |-- _2: string (nullable = true) |-- _3: string (nullable = true)

Bushminusero · Answer 2 · 2017-10-24T22:28:37+0000

The solution to this issue really depends on the Spark launch version. Assuming you are on Spark 2.0+, then you can read CSV as a DataFrame and add columns to toDF, which is good for converting RDD to DataFrame or adding columns to an existing data frame.

 filename = "/path/to/file.csv" df = spark.read.csv(filename).toDF("col1","col2","col3")

How to assign and use column headers in Spark?

More articles: