I load some data into Spark using a wrapper function:
def load_data( filename ): df = sqlContext.read.format("com.databricks.spark.csv")\ .option("delimiter", "\t")\ .option("header", "false")\ .option("mode", "DROPMALFORMED")\ .load(filename)
in particular, I use the globe to upload multiple files at once:
df = load_data( '/scratch/*.txt.gz' )
files:
/scratch/host1.txt.gz /scratch/host2.txt.gz ...
I would like the hostname column to actually contain the real name of the file being uploaded, not host1 (that is, host1 , host2 etc., not * ).
How can i do this?
source share