How to write Dataset to excel file using hadoop office library in apache spark java

Question

How to write Dataset to excel file using hadoop office library in apache spark java

I am currently using com.crealytics.spark.excel to read an excel file, but using this library I cannot write a dataset to an excel file. this link says that using the office memory library ( org.zuinnote.spark.office.excel) we can read and write excel file

Please help me write a dataset object to an excel file in spark java.

+4

java excel apache-spark apache-spark-sql spark-dataframe

Shashi kumar Jun 28 '17 at 10:28

source share

1 answer

abaghel · Accepted Answer · 2017-06-28T14:55:08+0000

You can use org.zuinnote.spark.office.excelto read and write an Excel file using a dataset. Examples are provided at https://github.com/ZuInnoTe/spark-hadoopoffice-ds/ . However, there is one problem if you are reading Excel in a dataset and trying to write it to another Excel file. Please view the problem and workaround in scala at https://github.com/ZuInnoTe/hadoopoffice/issues/12 .

I wrote a sample program in Java using org.zuinnote.spark.office.exceland the workaround given at this link. See if this helps you.

public class SparkExcel {
    public static void main(String[] args) {
        //spark session
        SparkSession spark = SparkSession
                .builder()
                .appName("SparkExcel")
                .master("local[*]")
                .getOrCreate();

        //Read
        Dataset<Row> df = spark
                .read()
                .format("org.zuinnote.spark.office.excel")
                .option("read.locale.bcp47", "de")
                .load("c:\\temp\\test1.xlsx");

        //Print
        df.show();
        df.printSchema();

        //Flatmap function
        FlatMapFunction<Row, String[]> flatMapFunc = new FlatMapFunction<Row, String[]>() {
            @Override
            public Iterator<String[]> call(Row row) throws Exception {
                ArrayList<String[]> rowList = new ArrayList<String[]>();
                List<Row> spreadSheetRows = row.getList(0);
                for (Row srow : spreadSheetRows) {
                    ArrayList<String> arr = new ArrayList<String>();
                    arr.add(srow.getString(0));
                    arr.add(srow.getString(1));
                    arr.add(srow.getString(2));
                    arr.add(srow.getString(3));
                    arr.add(srow.getString(4));
                    rowList.add(arr.toArray(new String[] {}));
                }
                return rowList.iterator();
            }
        };

        //Apply flatMap function
        Dataset<String[]> df2 = df.flatMap(flatMapFunc, spark.implicits().newStringArrayEncoder());

        //Write
        df2.write()
           .mode(SaveMode.Overwrite)
           .format("org.zuinnote.spark.office.excel")
           .option("write.locale.bcp47", "de")
           .save("c:\\temp\\test2.xlsx");

    }
}

I tested this code with Java 8 and Spark 2.1.0. I use maven and added a dependency for org.zuinnote.spark.office.excelon https://mvnrepository.com/artifact/com.github.zuinnote/spark-hadoopoffice-ds_2.11/1.0.3

How to write Dataset to excel file using hadoop office library in apache spark java

More articles: