How does Titan achieve continuous time search using HBase / Cassandra?

Question

How does Titan achieve continuous time search using HBase / Cassandra?

In O'Reilly's book, Graph Databases, in Chapter 6, which describes how Neo4j stores a graph database, she says:

To understand why the inline processing of graphs is much more efficient than graphs based on intensive indexing, consider the following. Depending on the implementation, index queries can be O (log n) in algorithmic complexity compared to O (1) for finding direct relationships. To go through a network of m steps, the cost of the indexed approach, on O (m log n), carlizes the cost of O (m) for an implementation that uses no index.

It is then explained that Neo4j achieves this constant search of time by storing all nodes and relationships as fixed-size records:

With fixed record sizes and record identifiers like pointers, crawls are implemented simply by chasing pointers around a data structure that can run at a very high speed. To go through a specific relationship from one node to another, the database performs somewhat cheap ID calculations (these calculations are much cheaper than searching for global indexes, as wed should do if you falsify a graph in a non-graphical database)

This last sentence raises my question: how does Titan using Cassandra or HBase as storage achieve these gains or offset it?

+5

neo4j graph-databases titan

Lodewijk bogaards 24 sept '14 at 5:19

source share

2 answers

OrientDB uses a similar approach in which relations are managed without indexes (no deposit adjacency), but rather with direct pointers (LINKS) between vertices. This is similar to memory pointers, but on disk. Thus, OrientDB reaches O (1) when moving in memory and on disk.

But if you have a “City” vertex with thousands of edges to the “Man” vertices, and you are looking for all people with an age> 18, then OrientDB uses indexes because the query is involved, so in this case it is O (log N).

+1

Lvca 25 sept. '14 at 23:50

source share

Marko A. Rodriguez · Accepted Answer · 2014-09-24T21:20:05+0000

Neo4j only reaches O (1) when data is in memory in one JVM. When the data is on the disk, Neo4j is slow due to the pointer chasing on the disk (they have a poor presentation of the disk).

Titan only reaches O (1) when data is in memory in the same JVM. When data is on disk, Titan is faster than Neo4j because it has a better performance on disk.

Please see the following blog post for a quantitative explanation of the above: http://thinkaurelius.com/2013/11/24/boutique-graph-data-with-titan/

Therefore, it is important to understand when people say O (1) what part of the memory hierarchy they are in. When you are in the same JVM (single machine), it is easy to be fast, since Neo4j and Titan demonstrate their respective caching mechanisms. If you cannot fit the entire graph into memory, you must rely on smart disk layouts, distributed caches, etc.

For more information, see the following two blog posts:

http://thinkaurelius.com/2013/11/01/a-letter-regarding-native-graph-databases/ http://thinkaurelius.com/2013/07/22/scalable-graph-computing-der-gekrummte- graph /

How does Titan achieve continuous time search using HBase / Cassandra?

More articles: