Interest Ask. I am on the same path.
First, your question about MLlib. I assume you mean Apache Spark MLlib , the introduction of machine learning (ML) on top of Apache Spark. So, my conclusion: you want to run ML-algorithms for such purposes as clustering and classification using the data in Titan / Cassandra . Note that you can also use graph processing algorithms such as Page Rank mentioned by spidy to do things like clustering on top of your Titan / Cassandra graph database. In other words: you do not need ML to perform clustering when the starting point is the graph database.
Apache Spark MLlib seems to be future proof and widely supported, their latest announcements have been about new ML algorithms, although Apache Mahout , another Apache ML, is more mature in terms of the number of ML algorithms supported. Apache Mahout also used Apache Spark as its data storage tier, which is why I am mentioning it in this post. Apache Spark offers, in addition to in-memory calculations, the mentioned MLlib for machine learning, Spark SQL , which is similar to Hive on Spark, GraphX , which is a graph processing system, as explained by spidy and Spark Streaming for processing streaming data.
I see Apache Spark as a logical data layer, presented as RDD (Resilient Distributed Datasets) on top of storage layers such as Cassandra, Hadoop / Hcatalog and HBase. Apache Spark offers a connector for Cassandra. Please note that RDDs are immutable, you cannot change data using Spark, you can only process and analyze data in Spark. Regarding the logical memory level of Apache Spark RDD: you can compare RDD as a representation in the good old days of SQL, RDD gives you an idea, for example, of a table in Cassandra from HBase. Also note that Apache Spark offers an API for three development environments: Scala, Java, and Python.
Apache Giraph is also a suite of charting tools, the functional equivalent of Apache Spark GraphX. Apache Giraph uses Hadoop as its storage tier. You are using Titan / Cassandra, so you are likely to go into data transfer tasks when you select Apache Giraph as your solution. Secondly, you started your post with the ML question using MLlib, and Apache Giraph is not an ML solution.
Your conclusion regarding Giraph and Gremlin is incorrect: they do not coincide, although both use the graph database. Giraph is a graph processing solution, as explained. Using Giraph, you can perform graph analysis algorithms such as Page Rank, for example. which has most followers, while Gremlin is for passing, for example. queery in the graph database, using complex relationships (edges) between objects (vertices), obtaining sets of vertex results and edge properties.