How to read multiple Excel files and combine them into one Apache Spark DataFrame?

I recently wanted to make Spark Machine Learning Lab from Spark Summit 2016. The training video is here and the exported laptop is available here.

The lab dataset can be downloaded from the UCI Machine Learning Repository . It contains a set of readings from various sensors in a gas power station. The format is an xlsx file with five sheets.

To use the data in the laboratory, I had to read all the sheets from the Excel file and combine them into one Spark DataFrame. During training, they use the Databricks Notebook, but I used IntelliJ IDEA with Scala and evaluated the code in the console.

The first step was to save all the Excel sheets in separate xlsx files with the name sheet1.xlxs, sheet2.xlsxetc. and put them in a directory sheets.

How to read all Excel files and combine them into one Apache Spark DataFrame?

0
source share
2 answers

For this, I used the spark-excel package. It can be added to the build.sbt file as:libraryDependencies += "com.crealytics" %% "spark-excel" % "0.8.2"

Code to run in IntelliJ IDEA Scala Console:

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{SparkSession, DataFrame}
import java.io.File

val conf = new SparkConf().setAppName("Excel to DataFrame").setMaster("local[*]")
val sc = new SparkContext(conf)
sc.setLogLevel("WARN")

val spark = SparkSession.builder().getOrCreate()

// Function to read xlsx file using spark-excel. 
// This code format with "trailing dots" can be sent to IJ Scala Console as a block.
def readExcel(file: String): DataFrame = spark.read.
  format("com.crealytics.spark.excel").
  option("location", file).
  option("useHeader", "true").
  option("treatEmptyValuesAsNulls", "true").
  option("inferSchema", "true").
  option("addColorColumns", "False").
  load()

val dir = new File("./data/CCPP/sheets")
val excelFiles = dir.listFiles.sorted.map(f => f.toString)  // Array[String]

val dfs = excelFiles.map(f => readExcel(f))  // Array[DataFrame]
val ppdf = dfs.reduce(_.union(_))  // DataFrame 

ppdf.count()  // res3: Long = 47840
ppdf.show(5)

Console output:

+-----+-----+-------+-----+------+
|   AT|    V|     AP|   RH|    PE|
+-----+-----+-------+-----+------+
|14.96|41.76|1024.07|73.17|463.26|
|25.18|62.96|1020.04|59.08|444.37|
| 5.11| 39.4|1012.16|92.14|488.56|
|20.86|57.32|1010.24|76.64|446.48|
|10.82| 37.5|1009.23|96.62| 473.9|
+-----+-----+-------+-----+------+
only showing top 5 rows 
+1
source

We need an intrinsic safety library for this, can be obtained from

https://github.com/crealytics/spark-excel#scala-api

  • clone a git project on top of a github link and build using the "sbt package"
  • Spark 2

- - --./spark-excel_2.11-0.8.3.jar --master = -

import org.apache.spark.sql._
import org.apache.spark.sql.functions._
val sqlContext = new SQLContext(sc)

  1. excel

val document = "path to excel doc"

val dataDF = sqlContext.read
                          .format("com.crealytics.spark.excel")
                          .option("sheetName", "Sheet Name")
                          .option("useHeader", "true")
                          .option("treatEmptyValuesAsNulls", "false")
                          .option("inferSchema", "false")
                          .option("location", document)
                          .option("addColorColumns", "false")
                          .load(document)

! Dataframe dataDF.

0

Source: https://habr.com/ru/post/1677970/


All Articles