I am trying to write a spark job with Python that will open a jdbc connection with Impala and load the VIEW directly from Impala into a Dataframe. This question is pretty close, but in scala: JDBC call for impala / hive from a spark job and table creation
How should I do it? There are many examples for other data sources such as MySQL, PostgreSQL, etc., but I have not seen them for Impala + Python + Kerberos. An example is great help. Thank!
I tried this with information from the Internet, but it did not work.
Notepad SPARK
#!/bin/bash
export PYSPARK_PYTHON=/home/anave/anaconda2/bin/python
export HADOOP_CONF_DIR=/etc/hive/conf
export PYSPARK_DRIVER_PYTHON=/home/anave/anaconda2/bin/ipython
export PYSPARK_DRIVER_PYTHON_OPTS='notebook --ip=* --no-browser'
export JAVA_HOME=/usr/java/latest
export PATH=$JAVA_HOME/bin:$PATH
export CLASSPATH=/home/anave/impala_jdbc_2.5.30.1049/Cloudera_ImpalaJDBC41_2.5.30/*.jar:$CLASSPATH
export JDBC_PATH=/home/anave/impala_jdbc_2.5.30.1049/Cloudera_ImpalaJDBC41_2.5.30
pyspark --master yarn-client \
--driver-memory 4G \
--executor-memory 2G \
--jars /home/anave/spark-csv_2.11-1.4.0.jar $JDBC_PATH/*.jar
--driver-class-path $JDBC_PATH/*.jar
Python code
properties = {
"driver": "com.cloudera.impala.jdbc41.Driver",
"AuthMech": "1",
"KrbServiceName": "impala"
}
url = "jdbc:impala:imp_env;auth=noSasl"
db_df = sqlContext.read.jdbc(url=url, table='summary', properties=properties)
msg ( ):
Py4JJavaError: o42.jdbc.
: java.lang.ClassNotFoundException: com.cloudera.impala.jdbc41.Driver