How to assign and use column headers in Spark?

I am reading a dataset as shown below.

f = sc.textFile("s3://test/abc.csv") 

My file contains 50+ fields, and I want to assign column headings for each of the fields for reference later in my script.

How to do it in PySpark? Is the DataFrame way here?

PS - New to the spark.

+5
source share
2 answers

Here's how to add column names using a DataFrame:

Suppose your csv has a delimiter ','. Before transferring data to a DataFrame, prepare the data as follows:

 f = sc.textFile("s3://test/abc.csv") data_rdd = f.map(lambda line: [x for x in line.split(',')]) 

Suppose the data has 3 columns:

 data_rdd.take(1) [[u'1.2', u'red', u'55.6']] 

Now you can specify the column names when passing this RDD to a DataFrame using toDF() :

 df_withcol = data_rdd.toDF(['height','color','width']) df_withcol.printSchema() root |-- height: string (nullable = true) |-- color: string (nullable = true) |-- width: string (nullable = true) 

If you do not specify column names, you will get a DataFrame with the default column names "_1", "_2", ...:

 df_default = data_rdd.toDF() df_default.printSchema() root |-- _1: string (nullable = true) |-- _2: string (nullable = true) |-- _3: string (nullable = true) 
+4
source

The solution to this issue really depends on the Spark launch version. Assuming you are on Spark 2.0+, then you can read CSV as a DataFrame and add columns to toDF, which is good for converting RDD to DataFrame or adding columns to an existing data frame.

 filename = "/path/to/file.csv" df = spark.read.csv(filename).toDF("col1","col2","col3") 
+3
source

Source: https://habr.com/ru/post/1247075/


All Articles