How to build a Dataframe from an Excel file (xls, xlsx) in Scala Spark?

I have a large file Excel(xlsx and xls)with several sheets, and I need to convert it to RDDor Dataframeso that later it can be connected to another Dataframe. I thought about using Apache POI and saved it as CSV, and then read CSVin Dataframe. But if there are any libraries or APIs that can help in this process, it would be easy. Any help is appreciated.

+8
source share
4 answers

The solution to your problem is to use dependency Spark Excelin your project.

Spark Excel is flexible optionsto play.

I checked the following code to read from exceland converted it to dataframe, and it just works fine

def readExcel(file: String): DataFrame = sqlContext.read
    .format("com.crealytics.spark.excel")
    .option("location", file)
    .option("useHeader", "true")
    .option("treatEmptyValuesAsNulls", "true")
    .option("inferSchema", "true")
    .option("addColorColumns", "False")
    .load()

val data = readExcel("path to your excel file")

data.show(false)

you can specify sheetnamehow optionif your excel sheet contains several sheets

.option("sheetName", "Sheet2")

I hope its useful

+19
source

Here are read and write examples for reading and writing in excel with a full range of options. ..

Source sparking from crealytics

Scala API Spark 2.0 +:

Create DataFrame from Excel File

import org.apache.spark.sql.SQLContext

val sqlContext = new SQLContext(sc)
val df = sqlContext.read
    .format("com.crealytics.spark.excel")
    .option("sheetName", "Daily") // Required
    .option("useHeader", "true") // Required
    .option("treatEmptyValuesAsNulls", "false") // Optional, default: true
    .option("inferSchema", "false") // Optional, default: false
    .option("addColorColumns", "true") // Optional, default: false
    .option("startColumn", 0) // Optional, default: 0
    .option("endColumn", 99) // Optional, default: Int.MaxValue
    .option("timestampFormat", "MM-dd-yyyy HH:mm:ss") // Optional, default: yyyy-mm-dd hh:mm:ss[.fffffffff]
    .option("maxRowsInMemory", 20) // Optional, default None. If set, uses a streaming reader which can help with big files
    .option("excerptSize", 10) // Optional, default: 10. If set and if schema inferred, number of rows to infer schema from
    .schema(myCustomSchema) // Optional, default: Either inferred schema, or all columns are Strings
    .load("Worktime.xlsx")

Writing DataFrame to Excel File

df.write
  .format("com.crealytics.spark.excel")
  .option("sheetName", "Daily")
  .option("useHeader", "true")
  .option("dateFormat", "yy-mmm-d") // Optional, default: yy-m-d h:mm
  .option("timestampFormat", "mm-dd-yyyy hh:mm:ss") // Optional, default: yyyy-mm-dd hh:mm:ss.000
  .mode("overwrite")
  .save("Worktime2.xlsx")

. 1 2 . , , .

  • ...

Spark, --packages. , :

$SPARK_HOME/bin/spark-shell --packages com.crealytics:spark-excel_2.11:0.9.8
  • ( maven ..):
groupId: com.crealytics
artifactId: spark-excel_2.11
version: 0.9.8

: , maven, Excel excel src/main/resources, unit test (scala/java), DataFrame [s] excel...

Spark HadoopOffice. Spark , , Spark 2.0.1. HadoopOffice Spark 1.x. HadoopOffice:

Datasource Excel: org.zuinnote.spark.office.Excel Excel (.xls) Excel (.xlsx) Spark-packages.org Maven Central.

+6

HadoopOffice (https://github.com/ZuInnoTe/hadoopoffice/wiki), Excel , , , Spark .

+2

jar- com.crealytics.spark.excel-0.11 spark-Java, scala , javaSparkContext SparkContext.

tempTable = new SQLContext(javaSparkContxt).read()
    .format("com.crealytics.spark.excel") 
    .option("sheetName", "sheet1")
    .option("useHeader", "false") // Required 
    .option("treatEmptyValuesAsNulls","false") // Optional, default: true 
    .option("inferSchema", "false") //Optional, default: false 
    .option("addColorColumns", "false") //Required
    .option("timestampFormat", "MM-dd-yyyy HH:mm:ss") // Optional, default: yyyy-mm-dd hh:mm:ss[.fffffffff] .schema(schema)
    .schema(schema)
    .load("hdfs://localhost:8020/user/tester/my.xlsx");
0

Source: https://habr.com/ru/post/1677969/


All Articles