Org.apache.spark.sql.SQLContext cannot load file

I have a simple Spark task that reads values ​​from a file divided by channels and does some business logic on it and writes the processed value to our database.

So, to upload a file, I use org.apache.spark.sql.SQLContext. This is the code I need to upload the file asDataFrame

 DataFrame df = sqlContext.read()
            .format("com.databricks.spark.csv")
            .option("header", "false")
            .option("comment", null)
            .option("delimiter", "|")
            .option("quote", null)
            .load(pathToTheFile);

Now the problem 1. The function loadcould not load the file 2. It does not give many details (exceptions) about the problem, except in my console I get

WARN  2017-11-07 17:26:40,108 akka.remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor@172.17.0.2:35359] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
ERROR 2017-11-07 17:26:40,134 org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend: Asked to remove non-existent executor 0

and he continues the survey.

I am sure the file is available in the expected folder with the correct format. But I don’t know what kind of log it is and why I SQLContextwas able to download the file.

Here is my build.gradle dependency section:

dependencies {

provided(
        [group: 'org.apache.spark', name: 'spark-core_2.10', version: '1.4.0'],
        [group: 'org.apache.spark', name: 'spark-sql_2.10', version: '1.4.0'],
        [group: 'com.datastax.spark', name: 'spark-cassandra-connector-java_2.10', version: '1.4.0']
)

    compile([
            [group: 'com.databricks', name: 'spark-csv_2.10', version: '1.4.0'],
    ])


}

And I do this work inside the container docker

Any help would be appreciated

+4
1

, :

, akka , . , NAT .
DNS- --net=host.

, --net=host, .
SPARK_LOCAL_IP IP- , AKKA , , Spark .

Docker , P7h/docker-spark 2.2.0, , - , .

+1

Source: https://habr.com/ru/post/1688900/


All Articles