I am studying Titan (on HBase) as a candidate for a large distributed graph database. We need both OLTP access (fast, multiprocessing requests on the schedule) and OLAP access (loading all - or at least most of the graphics in Spark for analytics).
From what I understand, I can use the Gremlin server to process OLTP-style requests, where my result set will be small. Since my queries will be created using the user interface, I can use the API to interact with the Gremlin server. So far so good.
The problem relates to the case of using OLAP. Since the data in HBase will be shared with the Spark artists, it would be useful to read the data in Spark with HDFSInputFormat. It would be inefficient (in fact, given the projected size of the chart) to fulfill Gremlinβs request from the driver, and then distribute the data back to the performers.
The best guide I've found is an unfinished discussion with the Titan GitHub repo ( https://github.com/thinkaurelius/titan/issues/1045 ), which assumes that (with at least Cassandra back-end), the standard TitanCassandraInputFormatshould work to read Titan tables. Nothing claims to be HBase backends.
However, after reading about the underlying Titan data model ( http://s3.thinkaurelius.com/docs/titan/current/data-model.html ), it seems that parts of the raw graph data are serialized, without explaining how restore a property graph from content.
So, I have two questions:
1) Is everything that I stated above correct, or have I missed / misunderstood something?
2) Has anyone been able to read the raw Titan chart from HBase and restore it to Spark (either in GraphX, or in DataFrames, RDD, etc.)? If so, can you give me any directions?