What happens in this MYSQL query with the left join and group (on the wrong column)?

I have the following tables:

create temporary table Items (item_id int, item_name varchar(10)); create temporary table ItemRating (item_id int, rating int); 

With the following data:

 insert into Items (item_id, item_name) values (1,'Item 1'),(2,'Item 2'),(3,'Item 3'),(4,'Item 4'),(5,'Item 5'); insert into ItemRating values (1,9),(1,6),(3,10); 

And I run the following query:

 select i.item_id, i.item_name, avg(ir.rating) from Items i left join ItemRating ir ON ir.item_id = i.item_id group by ir.item_id; 

As a result, I get:

 +---------+-----------+----------------+ | item_id | item_name | avg(ir.rating) | +---------+-----------+----------------+ | 2 | Item 2 | NULL | | 1 | Item 1 | 7.5000 | | 3 | Item 3 | 10.0000 | +---------+-----------+----------------+ 

Now I fully understand that the request is spelled incorrectly, I want the group to execute on i.item_id. But I do not understand this behavior. Why does MYSQL display item_id 2 in the results, but not 4 or 5? I would really expect to see only items 1 and 3, because they are the only ones that have the corresponding entry in ItemRating.

So, can anyone explain to me what MYSQL is doing here?

+4
source share
3 answers

This is what happens. Consider the query in parts and what MySQL processes when it goes.

First you choose from the elements ( select i.item_id, i.item_name, avg(ir.rating) from Items i ):

 +---------+-----------+ | item_id | item_name | +---------+-----------+ | 1 | Item 1 | | 2 | Item 2 | | 3 | Item 3 | | 4 | Item 4 | | 5 | Item 5 | +---------+-----------+ 

Then you join the ratings ( left join ItemRating ir ON ir.item_id = i.item_id ). Note that element 1 appears in two rows after joining, because the way JOIN is defined to work is that it returns one row for each matching join condition (and LEFT basically means "return every row in the first table at least once, even if there are no matching conditions on this line).

 +---------+-----------+-----------+------------+ | item_id | item_name | ir.rating | ir.item_id | +---------+-----------+-----------+------------+ | 1 | Item 1 | 9 | 1 | | 1 | Item 1 | 6 | 1 | | 2 | Item 2 | NULL | NULL | | 3 | Item 3 | 10 | 3 | | 4 | Item 4 | NULL | NULL | | 5 | Item 5 | NULL | NULL | +---------+-----------+-----------+------------+ 

Finally, you are grouped by rating ( group by ir.item_id ). This will return one line for each unique ir.item_id file. There are three unique ir.item_ids (as you can see in the last column): 1 and NULL and 3 . For each of them, it returns one row and averages the rating.

So, for 1 we have:

 +---------+-----------+-----------+------------+ | item_id | item_name | ir.rating | ir.item_id | +---------+-----------+-----------+------------+ | 1 | Item 1 | 9 | 1 | | 1 | Item 1 | 6 | 1 | +---------+-----------+-----------+------------+ 

Collapse:

 +---------+-----------+----------------+------------+ | item_id | item_name | avg(ir.rating) | ir.item_id | +---------+-----------+----------------+------------+ | 1 | Item 1 | 7.5 | 1 | +---------+-----------+----------------+------------+ 

For NULL we have:

 +---------+-----------+-----------+------------+ | item_id | item_name | ir.rating | ir.item_id | +---------+-----------+-----------+------------+ | 2 | Item 2 | NULL | NULL | | 4 | Item 4 | NULL | NULL | | 5 | Item 5 | NULL | NULL | +---------+-----------+-----------+------------+ 

Collapse:

 +---------+-----------+----------------+------------+ | item_id | item_name | avg(ir.rating) | ir.item_id | +---------+-----------+----------------+------------+ | 2| Item 2 | NULL | NULL | +---------+-----------+----------------+------------+ 

For 3 we have:

 +---------+-----------+-----------+------------+ | item_id | item_name | ir.rating | ir.item_id | +---------+-----------+-----------+------------+ | 3 | Item 3 | 10 | 3 | +---------+-----------+-----------+------------+ 

Collapse:

 +---------+-----------+----------------+------------+ | item_id | item_name | avg(ir.rating) | ir.item_id | +---------+-----------+----------------+------------+ | 3 | Item 3 | 10 | 3 | +---------+-----------+----------------+------------+ 

The combination of three collapsed results gives:

 +---------+-----------+----------------+------------+ | item_id | item_name | avg(ir.rating) | ir.item_id | +---------+-----------+----------------+------------+ | 1 | Item 1 | 7.5 | 1 | | 3 | Item 3 | 10 | 3 | | 2 | Item 2 | NULL | NULL | +---------+-----------+----------------+------------+ 

What did you get.

One tricky part is the way to collapse NULL strings. Recall that these were zero lines:

 +---------+-----------+-----------+------------+ | item_id | item_name | ir.rating | ir.item_id | +---------+-----------+-----------+------------+ | 2 | Item 2 | NULL | NULL | | 4 | Item 4 | NULL | NULL | | 5 | Item 5 | NULL | NULL | +---------+-----------+-----------+------------+ 

When you execute a group, most database systems will not even let you select columns that are not part of the group. MySQL is an exception. Since you are only grouping on ir.rating, only one of them will allow you to choose, because there is no clear way to collapse three lines in a non-aggressive way. What MySQL does is simply select the first one it encounters and use the values ​​in this row as a collapsed value. So, (2,4,5) => (2) and (Clause 2, Clause 4, Clause 5) => Clause 2 and (NULL, NULL, NULL) => NULL. This is why you only see line 2 (in fact, you see three folded lines that look like line 2).

To really see this in action and launch a point at home, consider this query:

 select group_concat(i.item_id), group_concat(i.item_name), avg(ir.rating) from Items i left join ItemRating ir ON ir.item_id = i.item_id group by ir.item_id; 

This is similar to your original query, except that all three selected columns now have group aggregate functions. I use GROUP_CONCAT , which simply concatenates strings to form a minimized version (this would be true on other SQL systems besides MySQL). This returns the following:

 +-------------------------+---------------------------+----------------+ | group_concat(i.item_id) | group_concat(i.item_name) | avg(ir.rating) | +-------------------------+---------------------------+----------------+ | 2,4,5 | Item 2,Item 4,Item 5 | NULL | | 1,1 | Item 1,Item 1 | 7.5000 | | 3 | Item 3 | 10.0000 | +-------------------------+---------------------------+----------------+ 
+3
source

Here are your results after combining and before the group

 +---------+-----------+----------------+-----------+ | i.item_id | i.item_name | ir.rating | ir.item_id | +---------+-----------+----------------+-----------+ | 1 | Item 1 | 9 | 1 | | 1 | Item 1 | 6 | 1 | | 2 | Item 2 | null | null | | 3 | Item 3 | 10 | 3 | | 4 | Item 4 | null | null | | 5 | Item 5 | null | null | +---------+-----------+----------------+-----------+ 

You are grouping the column ir.item_id, which has only 3 different values ​​... 1,3 and null.

Apparently he takes the first name item_name, although I doubt that he has documented what he is doing, which cannot be relied on. The bottom line should be an error.

What you really need is the i.item_id group, i.item_name

+2
source

Left union brings all values, but you group in the item_id the ItemRating form so that you get only 3 values

0
source

Source: https://habr.com/ru/post/1384214/


All Articles