Here's how to add column names using a DataFrame:
Suppose your csv has a delimiter ','. Before transferring data to a DataFrame, prepare the data as follows:
f = sc.textFile("s3://test/abc.csv") data_rdd = f.map(lambda line: [x for x in line.split(',')])
Suppose the data has 3 columns:
data_rdd.take(1) [[u'1.2', u'red', u'55.6']]
Now you can specify the column names when passing this RDD to a DataFrame using toDF() :
df_withcol = data_rdd.toDF(['height','color','width']) df_withcol.printSchema() root |-- height: string (nullable = true) |-- color: string (nullable = true) |-- width: string (nullable = true)
If you do not specify column names, you will get a DataFrame with the default column names "_1", "_2", ...:
df_default = data_rdd.toDF() df_default.printSchema() root |-- _1: string (nullable = true) |-- _2: string (nullable = true) |-- _3: string (nullable = true)
source share