Spark upload data and add file name as data column

I load some data into Spark using a wrapper function:

def load_data( filename ): df = sqlContext.read.format("com.databricks.spark.csv")\ .option("delimiter", "\t")\ .option("header", "false")\ .option("mode", "DROPMALFORMED")\ .load(filename) # add the filename base as hostname ( hostname, _ ) = os.path.splitext( os.path.basename(filename) ) ( hostname, _ ) = os.path.splitext( hostname ) df = df.withColumn('hostname', lit(hostname)) return df 

in particular, I use the globe to upload multiple files at once:

 df = load_data( '/scratch/*.txt.gz' ) 

files:

 /scratch/host1.txt.gz /scratch/host2.txt.gz ... 

I would like the hostname column to actually contain the real name of the file being uploaded, not host1 (that is, host1 , host2 etc., not * ).

How can i do this?

+10
source share
2 answers

You can use input_file_name which:

Creates a string column for the file name of the current Spark task.

 from pyspark.sql.functions import input_file_name df.withColumn("filename", input_file_name()) 
+27
source

I get below errors. Any suggestions how to avoid? -

Py4JJavaError: An error occurred while calling o679.load. : org.apache.spark.SparkException: the job was interrupted due to a failure of the step: task 0 in step 1.0 was not completed 4 times, the last failure: lost job 0.3 in step 1.0 (TID 7, 169.92.25.14, executor 0): java. lang.ArrayIndexOutOfBoundsException: 63

TIA!

-1
source

Source: https://habr.com/ru/post/1234335/


All Articles