Gremlin - Giraph - GraphX? On TitanDb

I need help to confirm my choice ... and find out if you can give me some information. My storage database is TitanDb with Cassandra. I have a very big schedule. My goal is to use Mllib on the last chart.

My first idea is to use Titan with GraphX, but I haven’t found anything or during development ... TinkerPop is not ready yet. So I look at J. TinkerPop, Titan can comment on TinkerPop's Rexster.

My question is: What is the use of using Giraph? Gremlin seems to be doing the same and is being distributed.

Thanks so much to explain to me. I think I really do not understand the difference between Gremlin and Giraffe (or GraphX).

A good day.

+6
source share
2 answers

Interest Ask. I am on the same path.

First, your question about MLlib. I assume you mean Apache Spark MLlib , the introduction of machine learning (ML) on top of Apache Spark. So, my conclusion: you want to run ML-algorithms for such purposes as clustering and classification using the data in Titan / Cassandra . Note that you can also use graph processing algorithms such as Page Rank mentioned by spidy to do things like clustering on top of your Titan / Cassandra graph database. In other words: you do not need ML to perform clustering when the starting point is the graph database.

Apache Spark MLlib seems to be future proof and widely supported, their latest announcements have been about new ML algorithms, although Apache Mahout , another Apache ML, is more mature in terms of the number of ML algorithms supported. Apache Mahout also used Apache Spark as its data storage tier, which is why I am mentioning it in this post. Apache Spark offers, in addition to in-memory calculations, the mentioned MLlib for machine learning, Spark SQL , which is similar to Hive on Spark, GraphX , which is a graph processing system, as explained by spidy and Spark Streaming for processing streaming data.

I see Apache Spark as a logical data layer, presented as RDD (Resilient Distributed Datasets) on top of storage layers such as Cassandra, Hadoop / Hcatalog and HBase. Apache Spark offers a connector for Cassandra. Please note that RDDs are immutable, you cannot change data using Spark, you can only process and analyze data in Spark. Regarding the logical memory level of Apache Spark RDD: you can compare RDD as a representation in the good old days of SQL, RDD gives you an idea, for example, of a table in Cassandra from HBase. Also note that Apache Spark offers an API for three development environments: Scala, Java, and Python.

Apache Giraph is also a suite of charting tools, the functional equivalent of Apache Spark GraphX. Apache Giraph uses Hadoop as its storage tier. You are using Titan / Cassandra, so you are likely to go into data transfer tasks when you select Apache Giraph as your solution. Secondly, you started your post with the ML question using MLlib, and Apache Giraph is not an ML solution.

Your conclusion regarding Giraph and Gremlin is incorrect: they do not coincide, although both use the graph database. Giraph is a graph processing solution, as explained. Using Giraph, you can perform graph analysis algorithms such as Page Rank, for example. which has most followers, while Gremlin is for passing, for example. queery in the graph database, using complex relationships (edges) between objects (vertices), obtaining sets of vertex results and edge properties.

+8
source
  • Gremlin is a graph traversal language, while Giraph or Graphx is a graph processing system.

I believe that you are asking for the difference between graphics or graphics and titanium. To be more specific, why should you use a graph processing system when you already have data in the graph database?

Thus, essentially this is the difference between a graph database and a graph processing system.

  • A graphical database is your boyfriend when your application requires frequently requesting data. For instance. for the type of facebook application, given the user, return all your friends. This is suitable for a graph database, and you can use gremlin to query.

  • Now, if you want to calculate the rank of each user on facebook, you need to run the pagerank algorithm throughout the graph. In other words, the pagerank algorithm processes your entire graph and returns you a map. This is a suitable application for a graph processing system. Yes, you can write queries using the gremlin system to do this, but 1. it will not be as convenient as the basic pregel model used by giraph or graphx. 2. It will not be effective.

To summarize, it really depends on your application. If you think your application is like a request. Do not download the download to any graphics system. If you think your application is more like pagerank (this requires a full schedule), and you have a large schedule (at least 1M edges). Go for giraph or graphx.

giraph and graphx has a graph input format. You can dump your data into this form in a file and enter it into one of these systems, or you can write your own input format.

ps it would be nice to add the input format added to the giraph graphx, which accepts data stored in titanium.

+9
source

Source: https://habr.com/ru/post/983902/


All Articles