Run spark as a java web application

I used Spark ML and was able to get reasonable accuracy in forecasting for my business problem.

The data is not huge, and I was able to convert the input (mainly a csv file) using stanford NLP and run Naive Bayes to predict on my local machine.

I want to run this prediction service as a simple main Java program or together with a simple MVC web application.

Am I currently running my prediction using the spark-submit command? Instead, can I create spark context and data frames from my servlet / controller class?

I could not find any documentation for such scenarios.

Please inform about the possibility of the above.

+4
source share
1 answer

Spark has a REST apis for sending jobs by invoking the name of the master spark master.

Submit Application:

curl -X POST http://spark-cluster-ip:6066/v1/submissions/create --header "Content-Type:application/json;charset=UTF-8" --data '{
  "action" : "CreateSubmissionRequest",
  "appArgs" : [ "myAppArgument1" ],
  "appResource" : "file:/myfilepath/spark-job-1.0.jar",
  "clientSparkVersion" : "1.5.0",
  "environmentVariables" : {
    "SPARK_ENV_LOADED" : "1"
  },
  "mainClass" : "com.mycompany.MyJob",
  "sparkProperties" : {
    "spark.jars" : "file:/myfilepath/spark-job-1.0.jar",
    "spark.driver.supervise" : "false",
    "spark.app.name" : "MyJob",
    "spark.eventLog.enabled": "true",
    "spark.submit.deployMode" : "cluster",
    "spark.master" : "spark://spark-cluster-ip:6066"
  }
}'

Dispatch Response:

{
  "action" : "CreateSubmissionResponse",
  "message" : "Driver successfully submitted as driver-20151008145126-0000",
  "serverSparkVersion" : "1.5.0",
  "submissionId" : "driver-20151008145126-0000",
  "success" : true
}

Get the status of the sent application

curl http://spark-cluster-ip:6066/v1/submissions/status/driver-20151008145126-0000

State response

{
  "action" : "SubmissionStatusResponse",
  "driverState" : "FINISHED",
  "serverSparkVersion" : "1.5.0",
  "submissionId" : "driver-20151008145126-0000",
  "success" : true,
  "workerHostPort" : "192.168.3.153:46894",
  "workerId" : "worker-20151007093409-192.168.3.153-46894"
}

Now, in the application, the spark that you send should do all the operations and save output to any datasource and access the data via thrift server, since they don’t have much data to transfer (you can think of sqoop if you want to transfer data between your MVC db application and the Hadoop cluster),

loans: link1 , link2

: ( ) . , CSV MLib, -.

+5

Source: https://habr.com/ru/post/1657669/


All Articles