PostgreSQL - column value changed - select query optimization

Let's say we have a table:

CREATE TABLE p ( id serial NOT NULL, val boolean NOT NULL, PRIMARY KEY (id) ); 

Populated with several lines:

 insert into p (val) values (true),(false),(false),(true),(true),(true),(false); 
  ID VAL
 eleven
 20
 thirty
 4 1
 5 1
 6 1
 7 0

I want to determine when the value was changed. Therefore, the result of my query should be:

  ID VAL
 20
 4 1
 7 0

I have a solution with joins and subqueries:

 select min(id) id, val from ( select p1.id, p1.val, max(p2.id) last_prev from p p1 join p p2 on p2.id < p1.id and p2.val != p1.val group by p1.id, p1.val ) tmp group by val, last_prev order by id; 

But it is very inefficient and will work very slowly for tables with many rows.
I believe there may be a more efficient solution using PostgreSQL window functions?

SQL Fiddle

+6
source share
5 answers

So I would do this with analytics:

 SELECT id, val FROM ( SELECT id, val ,LAG(val) OVER (ORDER BY id) AS prev_val FROM p ) x WHERE val <> COALESCE(prev_val, val) ORDER BY id 

Update (some explanation):

Analytical functions work as a post-processing step. The query result is divided into groups ( partition by ), and the analytical function is used in the context of grouping.

In this case, the query is a choice from p . Applied LAG analytic function. Since there is no partition by clause, there is only one grouping: the whole set of results. This grouping is ordered by id . LAG returns the value of the previous row in the group using the specified order. As a result, each row has an additional column (aliased prev_val), which is the val previous row. This is a subquery.

Then we look for lines where val does not match the val previous line (prev_val). COALESCE handles the special case of the first row, which does not have the previous value.

Analytical functions may seem a little strange at first, but a search on analytic functions finds many examples that go through how they work. For example: http://www.cs.utexas.edu/~cannata/dbms/Analytic%20Functions%20in%20Oracle%208i%20and%209i.htm Just remember that this is a post-processing step. You cannot filter, etc. On the value of the analytic function, if you do not subquery it.

+4
source

Window function

Instead of calling COALESCE you can directly specify the window function lag() . A small detail in this case, since all columns are NOT NULL defined. But it can be important to distinguish the "previous line" from the "NULL in the previous line".

 SELECT id, val FROM ( SELECT id, val, lag(val, 1, val) OVER (ORDER BY id) <> val AS changed FROM p ) sub WHERE changed ORDER BY id; 

Calculate the result of the comparison immediately, as the previous value is not of interest as such, only a possible change. Shorter and maybe a little faster.

If you think the first row should be “modified” (as opposed to your demo output), you need to respect NULL values, even if your columns are NOT NULL defined. Basic lag() returns NULL if the previous line is missing:

 SELECT id, val FROM ( SELECT id, val, lag(val) OVER (ORDER BY id) IS DISTINCT FROM val AS changed FROM p ) sub WHERE changed ORDER BY id; 

Or again, use the additional lag() options:

 SELECT id, val FROM ( SELECT id, val, lag(val, 1, NOT val) OVER (ORDER BY id) <> val AS changed FROM p ) sub WHERE changed ORDER BY id; 

Recursive CTE

As a proof of concept. :) Performance will not keep up with published alternatives.

 WITH RECURSIVE cte AS ( SELECT id, val FROM p WHERE NOT EXISTS ( SELECT 1 FROM p p0 WHERE p0.id < p.id ) UNION ALL SELECT p.id, p.val FROM cte JOIN p ON p.id > cte.id AND p.val <> cte.val WHERE NOT EXISTS ( SELECT 1 FROM p p0 WHERE p0.id > cte.id AND p0.val <> cte.val AND p0.id < p.id ) ) SELECT * FROM cte; 

With an improvement from @wildplasser.

SQL Fiddle demonstrating everything.

+4
source

It can even be performed without window functions.

 SELECT * FROM p p0 WHERE EXISTS ( SELECT * FROM p ex WHERE ex.id < p0.id AND ex.val <> p0.val AND NOT EXISTS ( SELECT * FROM p nx WHERE nx.id < p0.id AND nx.id > ex.id ) ); 

UPDATE: self-join with non-recursive CTE (can also be a subquery instead of CTE)

 WITH drag AS ( SELECT id , rank() OVER (ORDER BY id) AS rnk , val FROM p ) SELECT d1.* FROM drag d1 JOIN drag d0 ON d0.rnk = d1.rnk -1 WHERE d1.val <> d0.val ; 

This non-recursive CTE approach is surprisingly fast, although it does need an implicit look.

+2
source

Using 2 row_number() calculations . It is also possible to do this with the usual Islands and Spaces SQL methods (it may be useful if you cannot use the lag() window function for any reason.

 with cte1 as ( select *, row_number() over(order by id) as rn1, row_number() over(partition by val order by id) as rn2 from p ) select *, rn1 - rn2 as g from cte1 order by id 

So this query will give you all the islands

 ID VAL RN1 RN2 G 1 1 1 1 0 2 0 2 1 1 3 0 3 2 1 4 1 4 2 2 5 1 5 3 2 6 1 6 4 2 7 0 7 3 4 

You see how the G field can be used to group these islands:

with cte1 as (Select *, row_number () over (order by id) as rn1, row_number () over (partition by val order by id) as rn2 from p) Select min (id) as id, shaft from cte1 group by val , rn1 - rn2 order by 1

So you get

 ID VAL 1 1 2 0 4 1 7 0 

The only thing you need is to delete the first record that can be done by getting the min(...) over() window function:

 with cte1 as ( ... ), cte2 as ( select min(id) as id, val, min(min(id)) over() as mid from cte1 group by val, rn1 - rn2 ) select id, val from cte2 where id <> mid 

And the results:

 ID VAL 2 0 4 1 7 0 
+1
source

A simple inner join can do this. SQL Fiddle

 select p2.id, p2.val from p p1 inner join p p2 on p2.id = p1.id + 1 where p2.val != p1.val 
0
source

Source: https://habr.com/ru/post/970505/


All Articles