I want to use Spark to process some data from a JDBC source. But first, instead of reading the source tables from JDBC, I want to run some queries from the JDBC side to filter the columns and join the tables and load the result of the query as a table into Spark SQL.
The following syntax for loading the original JDBC table works for me:
df_table1 = sqlContext.read.format('jdbc').options(
url="jdbc:mysql://foo.com:3306",
dbtable="mydb.table1",
user="me",
password="******",
driver="com.mysql.jdbc.Driver"
).load()
df_table1.show()
According to the Spark documentation (I am using PySpark 1.6.3):
dbtable: The JDBC table to be read. Note that you can use everything that is valid in the FROM clause of the SQL query. For example, instead of a full table, you can also use a subquery in parentheses.
Therefore, just for the experiment, I tried something simple:
df_table1 = sqlContext.read.format('jdbc').options(
url="jdbc:mysql://foo.com:3306",
dbtable="(SELECT * FROM mydb.table1) AS table1",
user="me",
password="******",
driver="com.mysql.jdbc.Driver"
).load()
:
com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'table1 WHERE 1=0' at line 1
(/ , "", ..) . , ? ? , "WHERE 1 = 0" ? !