Spark SQL and Apache Drill integration via JDBC

Question

Spark SQL and Apache Drill integration via JDBC

I would like to create a Spark SQL DataFrame from the results of a query executed on CSV data (on HDFS) using Apache Drill. I have successfully configured Spark SQL to connect to Drill via JDBC:

Map<String, String> connectionOptions = new HashMap<String, String>(); connectionOptions.put("url", args[0]); connectionOptions.put("dbtable", args[1]); connectionOptions.put("driver", "org.apache.drill.jdbc.Driver"); DataFrame logs = sqlc.read().format("jdbc").options(connectionOptions).load();

Spark SQL performs two queries: the first to get the schema, and the second to get the actual data:

 SELECT * FROM (SELECT * FROM dfs.output.`my_view`) WHERE 1=0 SELECT "field1","field2","field3" FROM (SELECT * FROM dfs.output.`my_view`)

The first is successful, but in the second Spark includes fields in double quotes, which prevents Drill from not supporting, so the request fails.

Did anyone get this integration?

Thanks!

+5

jdbc hadoop apache-spark apache-spark-sql apache-drill

Lorenzo ridge Feb 18 '16 at 8:15

source share

1 answer

zvee · Accepted Answer · 2016-05-27T17:13:11+0000

you can add a JDBC dialect for this and register the dialect before using the jdbc connector

 case object DrillDialect extends JdbcDialect { def canHandle(url: String): Boolean = url.startsWith("jdbc:drill:") override def quoteIdentifier(colName: java.lang.String): java.lang.String = { return colName } def instance = this } JdbcDialects.registerDialect(DrillDialect)

Spark SQL and Apache Drill integration via JDBC

More articles: