How to denote "transitive groups" of SQL?

Question

How to denote "transitive groups" of SQL?

I have a table with pairs of identifiers that are in transitive relation t, that is, if "A t B" AND "B t C", then "A t C". Example:

table T1 ID1 | ID2 1 | 2 1 | 5 4 | 7 7 | 8 9 | 1

So there are two groups,

g1 : {1,2,5,9} because "1 t 2", "1 t 5" and "9 t 1"
g2 : {4,7,8} because "4 t 7" and "7 t 8"

and I need to create a "clean and standard SQL" new table or view:

  table T2 ID1 | ID2 | LABEL 1 | 2 | 1 1 | 5 | 1 4 | 7 | 2 7 | 8 | 2 9 | 1 | 1

PS -1: we can list "transitive groups" on

  SELECT DISTINCT label, id FROM (SELECT id1 as id, * FROM T2) UNION (SELECT id2 as id, * FROM T2) ORDER BY 1,2;

PS -2: I use PostgreSQL 9.1, but if there is a solution with standard SQL, I prefer.

+6

algorithm sql postgresql

Peter Krauss Aug 3 '13 at 12:58

source share

2 answers

You can do it in Postgres; You cannot do this in all databases. Here is the request:

 with recursive cte(id1, id2) as ( select id1, id2, 1 as level from t union all select t.id1, cte.id2, cte.level + 1 from t join cte on t.id2 = cte.id1 ) select id1, id2, dense_rank() over (order by grp) as label from (select id1, id2, least(min(id2) over (partition by id1), min(id1) over (partition by id2)) as grp, level from cte ) t where level = 1;

Using SQL Fiddle here .

You go through the tree structure to assign a label (for example, loops can cause problems with this particular version). In Postgres, you can do this using the explicit recursive CTE. In SQL Server, you can do this with a CTE, which is implicitly "recursive" (keyword not used). In Oracle, you can do this with connect by .

A recursive CTE receives all pairs that are related to each other. The main request then assigns the pair the minimum value id1 and id2 to identify all pairs that are related to each other. The final label is simply created by assigning a consistent grp value.

EDIT:

Egor makes a very good moment. The above assumes that identifiers “descend” to lower values. The next version instead uses the highest level for each identifier for grouping (which is really intended):

 with recursive cte(id1, id2) as ( select id1, id2, 1 as level from t union all select t.id1, cte.id2, cte.level + 1 from t join cte on t.id2 = cte.id1 -- where not exists (select 1 from cte cte2 where cte2.id1 = t.id1 and cte2.id2 = t.id2) ) select id1, id2, dense_rank() over (order by topvalue) as label from (select id1, id2, first_value(id2) over (partition by id1 order by level desc) as topvalue, level from cte ) t where level = 1;

EDIT II:

In response to Yegor’s second comment. This data is a bit problematic regarding the original problem. The following breaks it into two parts:

 with recursive cte as ( select id1, id2, id2 as last, id1||','||id2 as grp, 1 as level from t where id2 not in (select id1 from t) union all select t.id1, t.id2, cte.last, cte.grp, cte.level + 1 from t join cte on t.id2 = cte.id1 -- where not exists (select 1 from cte cte2 where cte2.id1 = t.id1 and cte2.id2 = t.id2) ) select * from cte;

But it is unclear whether this is really what the original wanted. He would break the original into three groups that overlap, because in the second column there are three identifiers that never appear in the first column. This is about commutativity.

+4

Gordon linoff Aug 3 '13 at 14:03

source share

Peter Krauss · Accepted Answer · 2013-12-16T10:21:54+0000

Now, the new demand in 2013, I need to work with 10,000 itens : using the elegant solution @GordonLinoff (above), it takes 1 second with 1000 yen, it takes 1 day from 2000 ... It does not have good performance . The performance issue was also here ,

@NealB Algorithm

(This is the best solution , so fast!) See the original and didactic description . Here table T1 coincides with the question-text, and the second (temporary) table R used to process and display the results,

  CREATE TABLE R ( id integer NOT NULL, -- PRIMARY KEY, label integer NOT NULL DEFAULT 0 ); CREATE FUNCTION t1r_labeler() RETURNS void AS $funcBody$ DECLARE label1 integer; label2 integer; newlabel integer; t t1%rowtype; BEGIN DELETE FROM R; INSERT INTO R(id) SELECT DISTINCT unnest(array[id1,id2]) FROM T1 ORDER BY 1; newlabel:=0; FOR t IN SELECT * FROM t1 LOOP -- -- BASIC LABELING: -- -- SELECT label INTO label1 FROM R WHERE id=t.id1; SELECT label INTO label2 FROM R WHERE id=t.id2; IF label1=0 AND label2=0 THEN newlabel:=newlabel+1; UPDATE R set label=newlabel WHERE ID in (t.id1,t.id2); ELSIF label1=0 AND label2!=0 THEN UPDATE R set label=label2 WHERE ID=t.id1; ELSIF label1!=0 AND label2=0 THEN UPDATE R set label=label1 WHERE ID=t.id2; ELSIF label1!=label2 THEN -- time consuming UPDATE tmp.R set label=label1 WHERE label = label2; END IF; END LOOP; END; $funcBody$ LANGUAGE plpgsql VOLATILE;

Preparation and launch,

  -- same CREATE TABLE T1 (id1 integer, id2 integer); DELETE FROM T1; INSERT INTO T1(id1,id2) -- populate the standard input VALUES (1, 2), (1, 5), (4, 7), (7, 8), (9, 1); -- or SELECT id1, id2 FROM table_with_1000000_items; SELECT t1r_labeler(); -- run SELECT * FROM R ORDER BY 2; -- show

Worst case management

The last condition, when label1!=label2 , is the most time-consuming operation and should be avoided or it can be separated in cases of high connectivity, which are the worst.

To report some kind of warning, you can calculate the fraction of the time that the procedure fulfills the last condition, and / cor can split the last update.

If you separate, you can analyze and do a little better with them. So, removing the last ELSIF and adding your checks and this second cycle after the first cycle:

  -- ... first loop and checks here ... FOR t IN SELECT * FROM tmp.t1 LOOP -- -- MERGING LABELS: -- -- SELECT label INTO label1 FROM R WHERE id=t.id1; SELECT label INTO label2 FROM R WHERE id=t.id2; IF label1!=0 AND label2!=0 AND label1!=label2 THEN UPDATE R set label=label1 WHERE label=label2; END IF; END LOOP; -- ...

An example of a worst case scenario: a group with more than 1000 (connected) nodes per 10000 nodes with an average length of “10 per marked group” (kernels) and only a few paths connecting the kernels.

Array Oriented Algorithm

This other solution is slower (this is a brute force algorithm), but can be used when you need direct processing with arrays and don't have to solve it so quickly (and not have the “worst cases”).

As @ peter.petrov and @RBarryYoung suggest using a more adequate data structure ... I returned to my arrays as a "more adequate data structure." In the end, there is good acceleration (compared to the @GordonLinoff algorithm) with the solution below (!) .

The first step is to transfer table T1 of the question text to a temporary one, transgroup1 , where we can calculate a new process,

  -- DROP table transgroup1; CREATE TABLE transgroup1 ( id serial NOT NULL PRIMARY KEY, items integer[], -- two or more items in the transitive relationship dels integer[] DEFAULT array[]::integer[] ); INSERT INTO transgroup1(items) SELECT array[id1, id2] FROM t1; -- now suppose t1 a 10000 items table;

with these two functions we can solve the problem,

  CREATE FUNCTION array_uunion(anyarray,anyarray) RETURNS anyarray AS $$ -- ensures distinct items of a concatemation SELECT ARRAY(SELECT unnest($1) UNION SELECT unnest($2)) $$ LANGUAGE sql immutable; CREATE FUNCTION transgroup1_loop() RETURNS void AS $BODY$ DECLARE cp_dels integer[]; i integer; max_i integer; BEGIN i:=1; max_i:=10; -- or 100 or more, but need some control to be secure LOOP UPDATE transgroup1 SET items = array_uunion(transgroup1.items,t2.items), dels = transgroup1.dels || t2.id FROM transgroup1 AS t1, transgroup1 AS t2 WHERE transgroup1.id=t1.id AND t1.id>t2.id AND t1.items && t2.items; cp_dels := array( SELECT DISTINCT unnest(dels) FROM transgroup1 ); -- ensures all itens to del EXIT WHEN i>max_i OR array_length(cp_dels,1)=0; DELETE FROM transgroup1 WHERE id IN (SELECT unnest(cp_dels)); UPDATE transgroup1 SET dels=array[]::integer[]; i:=i+1; END LOOP; UPDATE transgroup1 -- only to beautify SET items = ARRAY(SELECT unnest(items) ORDER BY 1 desc); END; $BODY$ LANGUAGE plpgsql VOLATILE;

of course, to run and view the results you can use

  SELECT transgroup1_loop(); -- not 1 day but some hours! SELECT *, dense_rank() over (ORDER BY id) AS group from transgroup1;

as a result

  id | items | ssg_label | dels | group ----+-----------+-----------+------+------- 4 | {8,7,4} | 1 | {} | 1 5 | {9,5,2,1} | 1 | {} | 2

How to denote "transitive groups" of SQL?

@NealB Algorithm

Array Oriented Algorithm

More articles: