Spark Sql JDBC Support

We are currently creating a reporting platform as a data warehouse in which we used Shark. Since the development of Shark is discontinued, we are in the evaluation phase of Spark SQL. Based on the use cases, we had several questions.

1) We have data from different sources (MySQL, Oracle, Cassandra, Mongo). We would like to know how we can get this data in Spark SQL? Is there any utility we can use? Does this service support continuous updating of data (synchronization of a new add / update / delete to a data warehouse in Spark SQL?

2) Is there a way to create multiple databases in Spark SQL?

3) We use Jasper to represent the user interface, we would like to connect from Jasper to Spark SQL. When we did our initial search, we learned that there is currently no consumer support for connecting Spark SQL via JDBC, but in future releases you would like to add the same. We would like to know when Spark SQL will have a stable version that will support JDBC? Meanwhile, we took the source code https://github.com/amplab/shark/tree/sparkSql , but it was difficult for us to install it locally and evaluate it. It would be great if you could help us with the installation instructions. (I can share the question that we are faced with, please let me know where I can publish error logs)

4) We will also need an SQL prompt in which we can execute queries; currently Spark Shell provides a SCALA prompt where SCALA code can be executed, from SCALA code we can run SQL queries. Like Shark, we would like to have a SQL query in Spark SQL. When we did our search, we found that in a future version of Spark this would be added. It would be great if you could tell us which edition of Spark will touch the same.

+7
source share
6 answers

as for

3) Spark 1.1 provides better support for the SparkSQL ThriftServer interface, which you can use to interact with JDBC. Hive JDBC clients supporting v. 0.12.0, can connect and interact with such a server.

4) Spark 1.1 also provides the SparkSQL CLI, which you can use to enter queries. Just like the Hive CLI or Impala Shell.

Please provide more details on what you are trying to achieve in 1 and 2.

+2
source

I can answer (1):

Apache Sqoop was specifically designed to solve this problem for relational databases. The tool was created for HDFS, HBase, and Hive - so it can be used to provide Spark data via HDFS and hive metastability.

http://sqoop.apache.org/

I believe that Cassandra is accessible by SparkContext through this connector from DataStax: https://github.com/datastax/spark-cassandra-connector - which I never used.

I do not know any connector for MongoDB.

+1
source

1) We have data from different sources (MySQL, Oracle, Cassandra, Mongo)

For each case, you need to use a different driver. There is a datastax driver for cassandra (but I ran into some compatibility issues with SparkSQL). For any SQL system, you can use JdbcRDD. Usage is simple, see scala example :

test("basic functionality") { sc = new SparkContext("local", "test") val rdd = new JdbcRDD( sc, () => { DriverManager.getConnection("jdbc:derby:target/JdbcRDDSuiteDb") }, "SELECT DATA FROM FOO WHERE ? <= ID AND ID <= ?", 1, 100, 3, (r: ResultSet) => { r.getInt(1) } ).cache() assert(rdd.count === 100) assert(rdd.reduce(_+_) === 10100) } 

But the idea is that it is just an RDD, so you should work with this data using the map-reducing api, and not in SQLContext.

Is there any utility we can use?

There is an Apache Sqoop project, but it is under active development. The current stable version does not even save files in parquet format.

+1
source

Spark SQL is a feature of the Spark infrastructure. It cannot be compared to a shark, because a shark is a service. (Recall that with Shark, you start ThriftServer, which you can connect to in your Thrift application or even ODBC.)

Can you talk about what you mean by "get this data in Spark SQL"?

0
source

There are several Spark connectors - MongoDB: - a mongodb connector for hadoop (I really don't need Hadoop!) Https://databricks.com/blog/2015/03/20/using-mongodb-with-spark.html

0
source

If your data is huge and requires a lot of transformations, then Spark SQL can be used for ETL purposes, otherwise presto can solve all your problems. Addressing your requests one at a time:

  1. Since your data is in MySQL, Oracle, Cassandra, Mongo, all of them can be integrated into Presto, since it has connectors https://prestodb.imtqy.com/docs/current/connector.html for all these databases.

  2. After installing Presto in cluster mode, you can query all these databases together on one platform, which also allows you to combine tables from Cassandra and other tables from Mongo, this flexibility is unparalleled.

  3. Presto can be used to connect to Apache Superset https://superset.incubator.apache.org/ , which is open source and provides all Dashboarding kits. Presto can also be connected to Tableau.

  4. You can install MySQL Workbench with connection details to help provide a user interface for all your databases in one place.

0
source

Source: https://habr.com/ru/post/971916/


All Articles