Now, the new demand in 2013, I need to work with 10,000 itens : using the elegant solution @GordonLinoff (above), it takes 1 second with 1000 yen, it takes 1 day from 2000 ... It does not have good performance . The performance issue was also here ,
@NealB Algorithm
(This is the best solution , so fast!) See the original and didactic description . Here table T1 coincides with the question-text, and the second (temporary) table R used to process and display the results,
CREATE TABLE R ( id integer NOT NULL,
Preparation and launch,
Worst case management
The last condition, when label1!=label2 , is the most time-consuming operation and should be avoided or it can be separated in cases of high connectivity, which are the worst.
To report some kind of warning, you can calculate the fraction of the time that the procedure fulfills the last condition, and / cor can split the last update.
If you separate, you can analyze and do a little better with them. So, removing the last ELSIF and adding your checks and this second cycle after the first cycle:
-- ... first loop and checks here ... FOR t IN SELECT * FROM tmp.t1 LOOP -- -- MERGING LABELS: -- -- SELECT label INTO label1 FROM R WHERE id=t.id1; SELECT label INTO label2 FROM R WHERE id=t.id2; IF label1!=0 AND label2!=0 AND label1!=label2 THEN UPDATE R set label=label1 WHERE label=label2; END IF; END LOOP; -- ...
An example of a worst case scenario: a group with more than 1000 (connected) nodes per 10000 nodes with an average length of “10 per marked group” (kernels) and only a few paths connecting the kernels.
Array Oriented Algorithm
This other solution is slower (this is a brute force algorithm), but can be used when you need direct processing with arrays and don't have to solve it so quickly (and not have the “worst cases”).
As @ peter.petrov and @RBarryYoung suggest using a more adequate data structure ... I returned to my arrays as a "more adequate data structure." In the end, there is good acceleration (compared to the @GordonLinoff algorithm) with the solution below (!) .
The first step is to transfer table T1 of the question text to a temporary one, transgroup1 , where we can calculate a new process,
-- DROP table transgroup1; CREATE TABLE transgroup1 ( id serial NOT NULL PRIMARY KEY, items integer[], -- two or more items in the transitive relationship dels integer[] DEFAULT array[]::integer[] ); INSERT INTO transgroup1(items) SELECT array[id1, id2] FROM t1; -- now suppose t1 a 10000 items table;
with these two functions we can solve the problem,
CREATE FUNCTION array_uunion(anyarray,anyarray) RETURNS anyarray AS $$ -- ensures distinct items of a concatemation SELECT ARRAY(SELECT unnest($1) UNION SELECT unnest($2)) $$ LANGUAGE sql immutable; CREATE FUNCTION transgroup1_loop() RETURNS void AS $BODY$ DECLARE cp_dels integer[]; i integer; max_i integer; BEGIN i:=1; max_i:=10; -- or 100 or more, but need some control to be secure LOOP UPDATE transgroup1 SET items = array_uunion(transgroup1.items,t2.items), dels = transgroup1.dels || t2.id FROM transgroup1 AS t1, transgroup1 AS t2 WHERE transgroup1.id=t1.id AND t1.id>t2.id AND t1.items && t2.items; cp_dels := array( SELECT DISTINCT unnest(dels) FROM transgroup1 ); -- ensures all itens to del EXIT WHEN i>max_i OR array_length(cp_dels,1)=0; DELETE FROM transgroup1 WHERE id IN (SELECT unnest(cp_dels)); UPDATE transgroup1 SET dels=array[]::integer[]; i:=i+1; END LOOP; UPDATE transgroup1 -- only to beautify SET items = ARRAY(SELECT unnest(items) ORDER BY 1 desc); END; $BODY$ LANGUAGE plpgsql VOLATILE;
of course, to run and view the results you can use
SELECT transgroup1_loop();
as a result
id | items | ssg_label | dels | group ----+-----------+-----------+------+------- 4 | {8,7,4} | 1 | {} | 1 5 | {9,5,2,1} | 1 | {} | 2