No hive site when using direct cluster scrambling mode

Question

No hive site when using direct cluster scrambling mode

Using HDP 2.5.3, and I tried to debug some problems of the container class of the YARN class.

Since HDP includes both Spark 1.6 and 2.0.0, there were some conflicting versions

The users that I support can successfully use Spark2 with Hive requests in YARN client mode, but not from cluster mode, they get errors about tables not found or something like that because the Metastore connection is not established.

I assume that setting --driver-class-path /etc/spark2/conf:/etc/hive/conf or passing --files /etc/spark2/conf/hive-site.xml after spark-submit will work, but why hive-site.xml is not loaded from the conf folder already?

According to the Hortonworks docs , the hive-site should be placed in $SPARK_HOME/conf , and that ...

I see hdfs-site.xml and core-site.xml and other files that are part of HADOOP_CONF_DIR , for example, and this is from the YARN UI container information.

 2232355 4 drwx------ 2 yarn hadoop 4096 Aug 2 21:59 ./__spark_conf__ 2232379 4 -rx------ 1 yarn hadoop 2358 Aug 2 21:59 ./__spark_conf__/topology_script.py 2232381 8 -rx------ 1 yarn hadoop 4676 Aug 2 21:59 ./__spark_conf__/yarn-env.sh 2232392 4 -rx------ 1 yarn hadoop 569 Aug 2 21:59 ./__spark_conf__/topology_mappings.data 2232398 4 -rx------ 1 yarn hadoop 945 Aug 2 21:59 ./__spark_conf__/taskcontroller.cfg 2232356 4 -rx------ 1 yarn hadoop 620 Aug 2 21:59 ./__spark_conf__/log4j.properties 2232382 12 -rx------ 1 yarn hadoop 8960 Aug 2 21:59 ./__spark_conf__/hdfs-site.xml 2232371 4 -rx------ 1 yarn hadoop 2090 Aug 2 21:59 ./__spark_conf__/hadoop-metrics2.properties 2232387 4 -rx------ 1 yarn hadoop 662 Aug 2 21:59 ./__spark_conf__/mapred-env.sh 2232390 4 -rx------ 1 yarn hadoop 1308 Aug 2 21:59 ./__spark_conf__/hadoop-policy.xml 2232399 4 -rx------ 1 yarn hadoop 1480 Aug 2 21:59 ./__spark_conf__/__spark_conf__.properties 2232389 4 -rx------ 1 yarn hadoop 1602 Aug 2 21:59 ./__spark_conf__/health_check 2232385 4 -rx------ 1 yarn hadoop 913 Aug 2 21:59 ./__spark_conf__/rack_topology.data 2232377 4 -rx------ 1 yarn hadoop 1484 Aug 2 21:59 ./__spark_conf__/ranger-hdfs-audit.xml 2232383 4 -rx------ 1 yarn hadoop 1020 Aug 2 21:59 ./__spark_conf__/commons-logging.properties 2232357 8 -rx------ 1 yarn hadoop 5721 Aug 2 21:59 ./__spark_conf__/hadoop-env.sh 2232391 4 -rx------ 1 yarn hadoop 281 Aug 2 21:59 ./__spark_conf__/slaves 2232373 8 -rx------ 1 yarn hadoop 6407 Aug 2 21:59 ./__spark_conf__/core-site.xml 2232393 4 -rx------ 1 yarn hadoop 812 Aug 2 21:59 ./__spark_conf__/rack-topology.sh 2232394 4 -rx------ 1 yarn hadoop 1044 Aug 2 21:59 ./__spark_conf__/ranger-hdfs-security.xml 2232395 8 -rx------ 1 yarn hadoop 4956 Aug 2 21:59 ./__spark_conf__/metrics.properties 2232386 8 -rx------ 1 yarn hadoop 4221 Aug 2 21:59 ./__spark_conf__/task-log4j.properties 2232380 4 -rx------ 1 yarn hadoop 64 Aug 2 21:59 ./__spark_conf__/ranger-security.xml 2232372 20 -rx------ 1 yarn hadoop 19975 Aug 2 21:59 ./__spark_conf__/yarn-site.xml 2232397 4 -rx------ 1 yarn hadoop 1006 Aug 2 21:59 ./__spark_conf__/ranger-policymgr-ssl.xml 2232374 4 -rx------ 1 yarn hadoop 29 Aug 2 21:59 ./__spark_conf__/yarn.exclude 2232384 4 -rx------ 1 yarn hadoop 1606 Aug 2 21:59 ./__spark_conf__/container-executor.cfg 2232396 4 -rx------ 1 yarn hadoop 1000 Aug 2 21:59 ./__spark_conf__/ssl-server.xml 2232375 4 -rx------ 1 yarn hadoop 1 Aug 2 21:59 ./__spark_conf__/dfs.exclude 2232359 8 -rx------ 1 yarn hadoop 7660 Aug 2 21:59 ./__spark_conf__/mapred-site.xml 2232378 16 -rx------ 1 yarn hadoop 14474 Aug 2 21:59 ./__spark_conf__/capacity-scheduler.xml 2232376 4 -rx------ 1 yarn hadoop 884 Aug 2 21:59 ./__spark_conf__/ssl-client.xml

As you can see, the hive-site does not exist, although I definitely have conf/hive-site.xml for spark-submit to take

 [ spark@asthad006 conf]$ pwd && ls -l /usr/hdp/2.5.3.0-37/spark2/conf total 32 -rw-r--r-- 1 spark spark 742 Mar 6 15:20 hive-site.xml -rw-r--r-- 1 spark spark 620 Mar 6 15:20 log4j.properties -rw-r--r-- 1 spark spark 4956 Mar 6 15:20 metrics.properties -rw-r--r-- 1 spark spark 824 Aug 2 22:24 spark-defaults.conf -rw-r--r-- 1 spark spark 1820 Aug 2 22:24 spark-env.sh -rwxr-xr-x 1 spark spark 244 Mar 6 15:20 spark-thrift-fairscheduler.xml -rw-r--r-- 1 hive hadoop 918 Aug 2 22:24 spark-thrift-sparkconf.conf

So, I don’t think I should host the hive site in HADOOP_CONF_DIR since HIVE_CONF_DIR split, but my question is how do we get Spark2 to select hive-site.xml without having to manually pass it as a parameter at runtime?

EDIT Naturally, since I'm on HDP, I use Ambari. The previous cluster administrator installed Spark2 clients on all computers, so all YARN NodeManagers, which may be potential Spark drivers, must have the same configuration files

+5

hive apache-spark hortonworks-data-platform spark-hive

cricket_007 Aug 3 '17 at 7:06

source share

3 answers

The way I understand it, in local or yarn-client modes ...

Launcher checks if it needs Kerberos tokens for HDFS, YARN, Hive, HBase
> hive-site.xml is executed in CLASSPATH by the Hive / Hadoop client libraries (including in driver.extraClassPath , since the driver runs inside Launcher, and the integrated CLASSPATH is already built at this point)
The driver checks what kind of metastor is used for internal purposes : a stand-alone metastar supported by an unstable Derby instance, or a regular metastatic hive > that $SPARK_CONF_DIR/hive-site.xml
When using the Hive interface, the Metastore connection is used to read / write Hive metadata in the driver
> hive-site.xml is executed in CLASSPATH using the Hive / Hadoop client libraries (and uses the Kerberos token, if any)

So, you can have one hive-site.xml , which states that Spark should use Derby's in-memory Derby built-in instance for use as a sandbox (in memory it means "stop leaving all these temporary files behind you"), and another hive-site.xml gives the actual Metastore Uive URI. And everything's good.

Now, in yarn-cluster mode, this whole mechanism pretty much explodes in an unpleasant undocumented mess.

Launcher needs its own CLASSPATH settings to create Kerberos tokens, otherwise it will fail. Better go to the source code to find out which undocumented Env variable you are using.
It may also be necessary to override some properties because hard-coded defaults are suddenly not defaults (silently).

The driver cannot use the original $SPARK_CONF_DIR , it must rely on what Launcher provided for download. Does this $SPARK_CONF_DIR/hive-site.xml a copy of $SPARK_CONF_DIR/hive-site.xml ? It doesn't seem like that. So you are probably using Derby as a stub.

And the driver must deal with what YARN imposed on the CLASSPATH container in any order.
In addition, driver.extraClassPath add-ons driver.extraClassPath NOT have priority by default; To do this, you need to force spark.yarn.user.classpath.first=true (which translates into the standard Hadoop property, the exact name of which I cannot remember right now, especially since there are several details with similar names that may be outdated and / or not work in Hadoop 2.x)

Think bad? Try connecting to Kerberized HBase in yarn-cluster mode. The connection is performed in the Performers, this is another layer of muck. But I'm going out.

Bottom line: start the diagnostics again .

A. Are you really sure that the cryptic "Metastore connection errors" are caused by missing properties and, in particular, the Metastore ID?

B. By the way, do your users explicitly use HiveContext ???

C. What is the CLASSPATH that YARN represents for the JVM driver, and what is the CLASSPATH that the driver presents to Hadoop libraries when opening a Metastore connection?

D. If the CLASSPATH created by YARN is for some reason messed up, what would be the minimal fix - changing the priority rules? addition? as?

+2

Samson scharfrichter Sep 06 '17 at 16:40

source share

In the configuration, cluster mode read from the conf directory of the machine that starts the driver container, and not the one used for spark-submit .

+1

user8554772 Sep 03 '17 at 12:42

source share

Artur sukhenko · Accepted Answer · 2017-09-05T14:30:43+0000

You can use the spark property - spark.yarn.dist.files and specify the path to hive-site.xml.

No hive site when using direct cluster scrambling mode

More articles: