SQL chooses separate, but keep it first?

According to another SO post ( SQL: how to keep row order with DISTINCT? ), Different ones have rather undefined sorting behavior.

I have a request:

select col_1 from table order by col_2 

This may return values, e.g.

 3 5 3 2 

I need to select a separate one on them that keeps order, that is, I want

 select distinct(col_1) from table order by col_2 

for return

 3 5 2 

but not

 5 3 2 

This is what I'm actually trying to do. Col_1 is the user ID, and col_2 is the log timestamp event for this user. Thus, the same user (col_1) can have many logins. I am trying to create a historical list of users in which they were seen on the system. I would like to say that "our first user has ever been, our second user has ever been" and so on.

This publication seems to suggest using a group, but the group is not intended to return row ordering, so I don’t see how and why this is applicable here, since it is not displayed by the order storage group. In fact, another SO publication gives an example in which a group will destroy the ordering I am looking for: see Peter, what is the difference between GROUP BY and ORDER BY in sql . In any case, to guarantee the latest result? The strange thing is: if I were to execute the DISTINCT clause, I would probably do the order first, then do the results and perform a linear scan of the list and naturally keep the order, so I'm not sure why the behavior is so undefined.

EDIT:

Thanks everyone! I accepted IMSoP's answer because there was not only an example I could play with (thanks for including me in the SQL Fiddle), but they also explained why some things worked the way they worked, instead of to just “do it”, in particular, it is not clear that GROUP BY does not destroy (rather, stores them in some kind of internal list) values ​​in other columns outside the group, and these values ​​can still be considered in the ORDER BY clause.

+7
source share
4 answers

All this is due to the "logical order" of SQL statements. Although a DBMS can actually retrieve data in accordance with all sorts of smart strategies, it should behave in accordance with some predictable logic. Thus, we can assume that the various parts of the SQL query are processed "before" or "after" each other in terms of the behavior of this logic.

As it turned out, the ORDER BY is the last step in this logical sequence, so it cannot change the behavior of the “earlier” steps.

If you use GROUP BY , the rows were grouped by the time the SELECT executed, not to mention ORDER BY , so you can only view the columns that were grouped, or the "aggregate" values ​​calculated for all the values ​​in the group. (MySQL implements the controversial extension on GROUP BY where you can mention the column in SELECT , which logically cannot be there, and it will select one of an arbitrary row in this group).

If you use DISTINCT , it is logically processed after SELECT , but ORDER BY appears anyway later. This way, only after DISTINCT out the duplicates, the remaining results will be sorted in a certain order - but the discarded rows cannot be used to determine this order.


As for how to get the desired result, the key is to find the sorting value by which it will be valid after GROUP BY / DISTINCT been (logically) started. Remember that if you use GROUP BY , all aggregated values ​​are still valid - the aggregate function can view all the values ​​in the group. This includes MIN() and MAX() , which are ideal for ordering, since the “smallest number” ( MIN ) is the same as “the first number if I sort them in ascending order,” and vice versa for MAX

Thus, to order a set of different foo_number values ​​based on the lowest applicable bar_number for each, you can use this:

 SELECT foo_number FROM some_table GROUP BY foo_number ORDER BY MIN(bar_number) ASC 

Here is a live demo with some arbitrary data .


UPDATE: The comments discussed why, if an order is applied before grouping / deduplication, this order does not apply to groups. If that were the case, you still need a strategy for which a row is stored in each group: first or last.

As an analogy, imagine the original set of rows as a set of playing cards, selected from the deck and then sorted by their face value, from low to high. Now go through the sorted deck and put them in a separate pile for each suit. Which card should "represent" each deck?

If you are dealt cards face up, then the cards shown at the end will have the highest face value (“keep in the past” strategy); if you turn them face down and then turn each stack over, you will find the lowest face value (“stay first” strategy). Both are subordinate to the original order of cards, and the instruction "to hand over cards based on the suit" does not automatically tell the dealer (who represents the DBMS) which strategy was intended.

If the last stacks of cards are groups from GROUP BY , then MIN() and MAX() are the collection of each deck and the search for the minimum or maximum value, regardless of what order they are in. But since you can look inside groups you can also do other things, for example, sum the total cost of each deck ( SUM ) or the number of cards ( COUNT ), etc., making GROUP BY much more powerful than the “ordered DISTINCT ” can be .

+10
source

I would go for something like

 select col1 from ( select col1, rank () over(order by col2) pos from table ) group by col1 order by min(pos) 

In the subquery, I calculate the position, then in the main query, I make the group by column col1, using the smallest position for the order.

Here's a demo in SQLFiddle (it was Oracle, later MySql info was added.

Edit for MySql:

 select col1 from ( select col1 col1, @curRank := @curRank + 1 AS pos from table1, (select @curRank := 0) p ) sub group by col1 order by min(pos) 

And here is the demo for MySql .

+1
source

GROUP BY in the referenced answer is not trying to do the ordering ... it just selects one related value for the column that we want to distinguish.

Like @bluefeet conditions, if you want a guaranteed order, you must use ORDER BY .

Why can't we specify a value in ORDER BY that is not included in SELECT DISTINCT ?

Consider the following values ​​for col1 and col2 :

 create table yourTable ( col_1 int, col_2 int ); insert into yourTable (col_1, col_2) values (1, 1); insert into yourTable (col_1, col_2) values (1, 3); insert into yourTable (col_1, col_2) values (2, 2); insert into yourTable (col_1, col_2) values (2, 4); 

With this data, SELECT DISTINCT col_1 FROM yourTable ORDER BY col_2 ?

To do this, you will need GROUP BY and an aggregate function to decide which of the several values ​​for col_2 you need to order ... maybe MIN() , maybe MAX() , maybe even some other function, such as AVG() would make sense in some cases; it all depends on the specific scenario, so you need to be explicit:

 select col_1 from yourTable group by col_1 order by min(col_2) 

SQL Fiddle Here

+1
source

For MySQL only, when you select columns that are not in GROUP BY, they return the columns from the first record in the group. You can use this behavior to choose which record will be returned from each group, for example:

 SELECT foo_number, bar_number FROM ( SELECT foo_number, bar_number FROM some_table ORDER BY bar_number ) AS t GROUP BY foo_number ORDER BY bar_number DESC; 

This is more flexible because it allows you to organize records within each group using expressions that are not possible with aggregates. In my case, I would like to return the one with the shortest row in another column.

For completeness, my query looks like this:

 SELECT s.NamespaceId, s.Symbol, s.EntityName FROM ( SELECT m.NamespaceId, i.Symbol, i.EntityName FROM ImportedSymbols i JOIN ExchangeMappings m ON i.ExchangeMappingId = m.ExchangeMappingId WHERE i.Symbol NOT IN ( SELECT Symbol FROM tmp_EntityNames WHERE NamespaceId = m.NamespaceId ) AND i.EntityName IS NOT NULL ORDER BY LENGTH(i.RawSymbol), i.RawSymbol ) AS s GROUP BY s.NamespaceId, s.Symbol; 

What this does is return a separate list of characters in each namespace, and for duplicate characters, returns the one with the shortest RawSymbol. When the RawSymbol lengths are the same, it returns the one that RawSymbol arrives first in alphabetical order.

0
source

Source: https://habr.com/ru/post/956121/


All Articles