Is a database query faster than LinkedIn type 2 connection search algorithms on a single server?

I have a table with friends id, u1, u2 and about < 500,000 records on one mysql server

and I want to take userA and userB and check if they have common friends.

Is it faster to do

 select u2 from friends where u1 = userA and u2 IN (select u2 from friends where u1 = userB) 

how to run the shortest path algorithm on a graph (on a single server)?

What is the standard way for large networks like LinkedIn and Facebook to do this?

Thanks!

+4
source share
5 answers

If table friends are indexed by both u1 and u2, then the SQL query should cross 2 subsets and pretty quickly. This is because indexing is already done. If you perform in-memory calculations, the time depends on whether you have pre-built indexes: if you do, you will be faster due to the lack of additional database connections. If indexing is included in the computational time and the database is warming up (all data in memory), you may lose.

I'm talking about indexing, not the shortest path, because the shortest path algorithm calculates more data than you need.

+2
source

In MySQL, the query you wrote will be slower than any other way to find this information. Perhaps slower than asking each person individually. Your request:

 select u2 from friends where u1 = userA and u2 IN (select u2 from friends where u1 = userB) 

Has a subquery in an IN clause. MySQL evaluates the query for each row encountered. The best way to write this:

 select u2 from friends where u1 = userA and exists (select 1 from friends where u1 = userB limit 1) 

If your data all fits on the same server and fits into memory, the performance of an optimized MySQL query should be good. Sites like LinkedIn and FaceBook have many problems - constant network updates, significantly large amounts of data, various types of links, etc. Your simple example does not reflect what they are doing. But many of their analyzes use Hadoop or Hadoop in conjunction with relational databases.

+2
source

In the graph database, you can write your query in gremlin as:

 gV('username','userB').out('friend').retain(gV('username','userA').out('friend').gather) 

Most graph databases should run fast.

If you use Titan, you can additionally use that Titan supports neighboring vertices in sorting order, which means that you can calculate the intersection of two friend lists using only one iteration over the data and without creating additional data structures. It will probably be faster than MySQL, and much faster if the average number of friends is large.

+2
source

You really need to try and compare your own data. Take a look at cassovary , flockdb , neo4j, etc.

Personally, I would do this in my memory, since you do not have many records. For instance. try BitSet where you can use fast bit operations (AND).

0
source

Here's another trick for second-degree joins using a simple inner join :

 select fA.u2 from friends fA inner join friends fB on fA.u2 = fB.u2 where fA.u1 = userA and fB.u1 = userB 

This is the same approach as a query of type many to many. You do not need to use the shortest path for this level of relationship.

If you want to look for more relationships, you should look at adjacency lists, but this is not easy to implement using MySQL. There are some issues that need to be done in this setup:

  • disjoint graphs (can be processed by preserving transitive closures on subgraphs and combining them when necessary),
  • directed against an undirected graph,
  • data distribution (another answer mentioned hadoop as a way to speed up processing, but this requires a good partitioning scheme)

to name a few.

0
source

Source: https://habr.com/ru/post/1434387/


All Articles