Spark upload data and add file name as data column

Question

Spark upload data and add file name as data column

I load some data into Spark using a wrapper function:

def load_data( filename ): df = sqlContext.read.format("com.databricks.spark.csv")\ .option("delimiter", "\t")\ .option("header", "false")\ .option("mode", "DROPMALFORMED")\ .load(filename) # add the filename base as hostname ( hostname, _ ) = os.path.splitext( os.path.basename(filename) ) ( hostname, _ ) = os.path.splitext( hostname ) df = df.withColumn('hostname', lit(hostname)) return df

in particular, I use the globe to upload multiple files at once:

 df = load_data( '/scratch/*.txt.gz' )

files:

 /scratch/host1.txt.gz /scratch/host2.txt.gz ...

I would like the hostname column to actually contain the real name of the file being uploaded, not host1 (that is, host1 , host2 etc., not * ).

How can i do this?

+10

apache-spark pyspark apache-spark-sql

yee379 Oct 05 '16 at 7:50

source share

2 answers

I get below errors. Any suggestions how to avoid? -

Py4JJavaError: An error occurred while calling o679.load. : org.apache.spark.SparkException: the job was interrupted due to a failure of the step: task 0 in step 1.0 was not completed 4 times, the last failure: lost job 0.3 in step 1.0 (TID 7, 169.92.25.14, executor 0): java. lang.ArrayIndexOutOfBoundsException: 63

TIA!

-1

Rohan vakharkar Jun 14 '19 at 16:00

source share

user6910411 · Accepted Answer · 2016-10-05T11:31:58+0000

You can use input_file_name which:

Creates a string column for the file name of the current Spark task.

 from pyspark.sql.functions import input_file_name df.withColumn("filename", input_file_name())

Spark upload data and add file name as data column

More articles: