How to load an Impala table directly into Spark using JDBC?

Question

How to load an Impala table directly into Spark using JDBC?

I am trying to write a spark job with Python that will open a jdbc connection with Impala and load the VIEW directly from Impala into a Dataframe. This question is pretty close, but in scala: JDBC call for impala / hive from a spark job and table creation

How should I do it? There are many examples for other data sources such as MySQL, PostgreSQL, etc., but I have not seen them for Impala + Python + Kerberos. An example is great help. Thank!

I tried this with information from the Internet, but it did not work.

Notepad SPARK

#!/bin/bash
export PYSPARK_PYTHON=/home/anave/anaconda2/bin/python
export HADOOP_CONF_DIR=/etc/hive/conf
export PYSPARK_DRIVER_PYTHON=/home/anave/anaconda2/bin/ipython
export PYSPARK_DRIVER_PYTHON_OPTS='notebook --ip=* --no-browser'

# use Java8
export JAVA_HOME=/usr/java/latest
export PATH=$JAVA_HOME/bin:$PATH

# JDBC Drivers for Impala
export CLASSPATH=/home/anave/impala_jdbc_2.5.30.1049/Cloudera_ImpalaJDBC41_2.5.30/*.jar:$CLASSPATH
export JDBC_PATH=/home/anave/impala_jdbc_2.5.30.1049/Cloudera_ImpalaJDBC41_2.5.30

# --jars $SRCDIR/spark-csv-assembly-1.4.0-SNAPSHOT.jar \
# --conf spark.sql.parquet.binaryAsString=true \
# --conf spark.sql.hive.convertMetastoreParquet=false

pyspark --master yarn-client \
        --driver-memory 4G \
        --executor-memory 2G \
        # --num-executors 10 \
        --jars /home/anave/spark-csv_2.11-1.4.0.jar $JDBC_PATH/*.jar
        --driver-class-path $JDBC_PATH/*.jar

Python code

properties = {
    "driver": "com.cloudera.impala.jdbc41.Driver",
    "AuthMech": "1",
#     "KrbRealm": "EXAMPLE.COM",
#     "KrbHostFQDN": "impala.example.com",
    "KrbServiceName": "impala"
}

# imp_env is the hostname of the db, works with other impala queries ran inside python
url = "jdbc:impala:imp_env;auth=noSasl"

db_df = sqlContext.read.jdbc(url=url, table='summary', properties=properties)

msg ( ):
Py4JJavaError: o42.jdbc. : java.lang.ClassNotFoundException: com.cloudera.impala.jdbc41.Driver

+4

jdbc kerberos apache-spark pyspark impala

alfredox 08 . '16 20:58

2

Ram Ghadiyaram · Answer 1 · 2016-09-21T17:48:20+0000

--jars $(echo /dir/of/jars/*.jar | tr ' ' ',')

--jars /home/anave/spark-csv_2.11-1.4.0.jar $JDBC_PATH/*.jar

.

Joey Van Halen · Answer 2 · 2017-06-20T02:28:17+0000

:

spark-shell --driver-class-path ImpalaJDBC41.jar --jars ImpalaJDBC41.jar 

val jdbcURL = s"jdbc:impala://192.168.56.101:21050;AuthMech=0"

val connectionProperties = new java.util.Properties()

val hbaseDF = sqlContext.read.jdbc(jdbcURL, "impala_table", connectionProperties)

How to load an Impala table directly into Spark using JDBC?

Notepad SPARK

Python code

More articles: