Fixed work in Java: how to access files from "resources" when running in a cluster

I wrote a Spark work in Java. The task is packaged in the form of a shaded jar and completed:

spark-submit my-jar.jar 

There are some files in the code (Freemarker templates) that are located in src/main/resources/templates . When running locally, I can access the files:

 File[] files = new File("src/main/resources/templates/").listFiles(); 

When a task is executed in a cluster, a null pointer exception is excluded during the execution of the previous row.

If I run jar tf my-jar.jar , I see that the files are packed in the templates/ folder:

  [...] templates/ templates/my_template.ftl [...] 

I just can't read them; I suspect .listFiles() trying to access the local file system in the node cluster, but there are no files there.

I am curious to know how I should pack the files that will be used in Spark offline work. I would prefer not to copy them to HDFS outside of work, because it becomes useless for support.

+5
source share
2 answers

Existing code refers to them as files that are not packaged and sent to Spark nodes. But, since they are in your jar file, you should be able to reference them via Foo.getClass().getResourceAsStream("/templates/my_template_ftl") . Learn more about Java resource streams here: http://www.javaworld.com/article/2077352/java-se/smartly-load-your-properties.html

+6
source

It seems that running Scala (2.11) code on Spark does not support access to resources in shaded banks.

The execution of this code is:

 var path = getClass.getResource(fileName) println("#### Resource: " + path.getPath()) 

prints the expected string at startup outside of Spark.

When running inside Spark a java.lang.NullPointerException raised because the path is null.

+4
source

Source: https://habr.com/ru/post/1247314/


All Articles