Optimizing a simple query on two large tables

I am trying to suggest a feature in which I can display the pages most viewed by friends. My friends table has 5.7 M rows, and the views table has 5.3 M rows. For now, I just want to run a query in these two tables and find the 20 most viewed page identifiers by a person.

Here is the query that I have now:

SELECT page_id FROM `views` INNER JOIN `friendships` ON friendships.receiver_id = views.user_id WHERE (`friendships`.`creator_id` = 143416) GROUP BY page_id ORDER BY count(views.user_id) desc LIMIT 20 

And here is the explanation:

 +----+-------------+-------------+------+-----------------------------------------+---------------------------------+---------+-----------------------------------------+------+----------------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+-------------+------+-----------------------------------------+---------------------------------+---------+-----------------------------------------+------+----------------------------------------------+ | 1 | SIMPLE | friendships | ref | PRIMARY,index_friendships_on_creator_id | index_friendships_on_creator_id | 4 | const | 271 | Using index; Using temporary; Using filesort | | 1 | SIMPLE | views | ref | PRIMARY | PRIMARY | 4 | friendships.receiver_id | 11 | Using index | +----+-------------+-------------+------+-----------------------------------------+---------------------------------+---------+-----------------------------------------+------+----------------------------------------------+ 

There is a primary key in the views table (user_id, page_id), and you can see that it is being used. The friendship table has a primary key (receiver_id, creator_id) and a secondary index (creator_id).

If I run this query without grouping and limitation, there will be about 25,000 lines for this particular user. This is typical.

In the most recent real run, this request took 7 seconds, and it is too long for a decent response in a web application.

One thing I'm curious about is setting up a secondary index (creator_id, receiver_id). I'm not sure if this will give much of the performance. I will probably try this today depending on the answers to this question.

Can you see how you can rewrite a request to make it quickly lit?

Update: I need to do more tests on it, but it seems that my nasty request works better if I don't do grouping and sorting in db, but do it in ruby ​​afterwards. The total time is much shorter - about 80% seems. My early testing may have been erroneous, but it definitely needs more investigation. If this is true - then wtf is Mysql?

+4
source share
3 answers

As far as I know, the best way to make such a request "lightning fast" is to create a pivot table that tracks the page views of friends on the page for each creator.

You probably want to save it with triggers. Then your aggregation has already been done for you, and this is a simple request to get the most viewed pages. You can make sure that you have the appropriate indexes in the pivot table, so that the database is not even needed for sorting in order to get the most views.

Pivot tables are key to maintaining good performance for queries such as aggregation in a read environment. You do the work up when updates happen (infrequent), and then the requests (often) do not require any work.

If your stats don't have to be perfect, and your posts are actually quite frequent (which probably is the case for something like pageviews), you can load the views into memory and process them in the background so your friends don't have to hit so that the pivot table is updated as they browse the pages. This solution also reduces the number of conflicts in the database (fewer processes update the pivot table).

+1
source

You should fully study the denormalization of this table. If you create a separate table that supports the user ID and accurate counts for each page you view, your query should become much simpler.

You can easily maintain this table with a trigger in your views table, which updates the views_summary table whenever an insert occurs in the views table.

Maybe you can even denormalize it further by looking at the actual relationship or just save the top x pages per person

Hope this helps,

Evert

0
source

Your indexes look correct, although if friendship has very large rows, you might need an index on (creator_id, receiver_id) so as not to read it all.

However, something is wrong, why are you doing fileort for the 271 line? Make sure your MySQL has at least a few megabytes for tmp_table_size and max_heap_table_size . This should make GROUP BY faster.

sort_buffer should also have a normal value.

0
source

Source: https://habr.com/ru/post/1286322/


All Articles