How to denote a large set of "transitive groups" with a restriction?

EDIT after the @NealB solution: the @NealB solution compares very quickly with any other , and distributes this new question of “adding a limit” to improve performance . " @NealB does not need to be improved, it has O (n) time and is very simple.


The problem of " labeling transitive groups with SQL " has an elegant solution using recursion and CTE ... But this solution takes exponential time (!). I need to work with 10,000 yen: 1 second is required with 1000 yen, 1 day is needed from 2000 ...

Limitation: in my case, you can break the problem into pieces of ~ 100 or less or less, but only to select one group of ~ 10 points and discard all the remaining ~ 90 marked with itens ...

Is there a general algorithm for adding and using such “pre-selection” to reduce quadratic O (N ^ 2) time? Perhaps, as shown in the comments and @wildplasser, the time is O (N log (N)); but I expect the "preselection" to be reduced to O (N).


(EDIT)

I am trying to use an alternative algorithm, but this requires some improvement to be used as a solution; or, to really increase performance (up to O (N) time), you must use "preselection".

"Pre-selection" (restriction) is based on "superpositional grouping" ... Original setting "How to mark" transitive groups "using SQL?" question t1 ,

  table T1 (original T1 augmented by "super-set grouping label" ssg, and more one row) ID1 | ID2 | ssg 1 | 2 | 1 1 | 5 | 1 4 | 7 | 1 7 | 8 | 1 9 | 1 | 1 10 | 11 | 2 

So there are three groups,

  • g1 : {1,2,5,9} because "1 t 2", "1 t 5" and "9 t 1"
  • g2 : {4,7,8} because "4 t 7" and "7 t 8"
  • g3 : {10,11} because "10 t 11"

A supergroup is just an auxiliary group,

  • ssg1 : {g1, g2}
  • ssg2 : {g3}

If we have M supergroups and N total t1 elements, the average group length will be less than N / M. We can also assume (for my typical problem) that the maximum length ssg is ~ N / M.

So, the “label algorithm” should only run M times with ~ N / M elements if it uses the ssg constraint.

+4
source share
3 answers

Thinking SQL seems to be a bit of a problem. Using some procedural programming on top of SQL, the solution seems to be a failure and efficient. Here is a brief description of the solution that can be implemented using any procedural language that calls SQL.

Declare table R using the primary key ID , where ID corresponds to the same domain as ID1 and ID2 table T1 . Table R contains one other non-key, a Label number

Fill in table R with the range of values ​​found in T1 . Set Label to zero (no label). Using the example data, the initial setup for R will look like this:

 Table R ID Label == ===== 1 0 2 0 4 0 5 0 7 0 8 0 9 0 

Using a cursor in the host language plus an auxiliary counter, read each line from T1 . Search for ID1 and ID2 in R You will find one of four cases:

  Case 1: ID1.Label == 0 and ID2.Label == 0 

In this case, none of these ID has been "noticed" before: add 1 to the counter, and then update both R lines to the counter value: update R set R.Label = :counter where R.ID in (:ID1, :ID2)

  Case 2: ID1.Label == 0 and ID2.Label <> 0 

In this case, ID1 is new, but ID2 already assigned a label. ID1 must be assigned the same labels as ID2 : update R set R.Lablel = :ID2.Label where R.ID = :ID1

  Case 3: ID1.Label <> 0 and ID2.Label == 0 

In this case, ID2 is new, but ID1 already assigned a label. ID2 must be assigned the same labels as ID1 : update R set R.Lablel = :ID1.Label where R.ID = :ID2

  Case 4: ID1.Label <> 0 and ID2.Label <> 0 

In this case, the string contains redundant information. Both lines of R must contain the same label value. If not, there is some data integrity problem. Ahhhh ... I do not quite see the editing ...

EDIT I just realized that there are situations where the Label values ​​here can be non-zero and different. If both are non-zero and different from each other, then the two Label groups should be combined at this point. All you have to do is select one Label and update the rest to match something like: update R set R.Label to ID1.Label where R.Label = ID2.Label . Now both groups have been merged with the same Label value.

At the end of the cursor, table R will contain the label values ​​needed to update T2 .

 Table R ID Label == ===== 1 1 2 1 4 2 5 1 7 2 8 2 9 1 

T2 process table using something line by line: set T2.Label to R.Label where T2.ID1 = R.ID The end result should be:

  table T2 ID1 | ID2 | LABEL 1 | 2 | 1 1 | 5 | 1 4 | 7 | 2 7 | 8 | 2 9 | 1 | 1 

This process is still repeating itself and should easily scale to fairly large tables.

+2
source

I suggest you check this out and use some universal language to solve it.

http://en.wikipedia.org/wiki/Disjoint-set_data_structure

Turn the graph, maybe run DFS or BFS from each node,
then use this disjoint set of hints. I think this should work.

+1
source

@NealB's solution is faster (!) See PostgreSQL implementation example here .

Below is an example of another “brute force algorithm”, just for curiosity!


As suggested by @ peter.petrov and @RBarryYoung, some performance problems can be avoided by abandoning CTE recursion ... I will make some problems on the base shortcut , and, abover I add a restriction for grouping to a super-set label. This new transgroup1_loop() function works!

PS: this solution still has performance limitations, please write your answer better or with some adaptation of this.


  -- DROP table transgroup1; CREATE TABLE transgroup1 ( id serial NOT NULL PRIMARY KEY, items integer[], -- two or more items in the transitive relationship ssg_label varchar(12), -- the super-set gropuping label dels integer[] DEFAULT array[]::integer[] ); INSERT INTO transgroup1(items,ssg_label) values (array[1, 2],'1'), (array[1, 5],'1'), (array[4, 7],'1'), (array[7, 8],'1'), (array[9, 1],'1'), (array[10, 11],'2'); -- or SELECT array[id1, id2],ssg_label FROM t1, with 10000 items 

with these two functions we can solve the problem,

  CREATE FUNCTION transgroup1_loop(p_ssg varchar, p_max_i integer DEFAULT 100) RETURNS integer AS $funcBody$ DECLARE cp_dels integer[]; i integer; BEGIN i:=1; LOOP UPDATE transgroup1 SET items = array_uunion(transgroup1.items,t2.items), dels = transgroup1.dels || t2.id FROM transgroup1 AS t1, transgroup1 AS t2 WHERE transgroup1.id=t1.id AND t1.ssg_label=$1 AND t1.id>t2.id AND t1.items && t2.items; cp_dels := array( SELECT DISTINCT unnest(dels) FROM transgroup1 ); -- ensures all itens to del RAISE NOTICE '-- bug, repeting dels, item-%; % dels! %', i, array_length(cp_dels,1), array_to_string(cp_dels,';','*'); EXIT WHEN i>p_max_i OR array_length(cp_dels,1)=0; DELETE FROM transgroup1 WHERE ssg_label=$1 AND id IN (SELECT unnest(cp_dels)); UPDATE transgroup1 SET dels=array[]::integer[]; i:=i+1; END LOOP; UPDATE transgroup1 -- only to beautify SET items = ARRAY(SELECT unnest(items) ORDER BY 1 desc); RETURN i; END; $funcBody$ LANGUAGE plpgsql VOLATILE; 

to run and view the results, you can use

  SELECT transgroup1_loop('1'); -- run with ssg-1 items only SELECT transgroup1_loop('2'); -- run with ssg-2 items only -- show all with a sequential group label: SELECT *, dense_rank() over (ORDER BY id) AS group_label from transgroup1; 

results:

  id | items | ssg_label | dels | group_label ----+-----------+-----------+------+------------- 4 | {8,7,4} | 1 | {} | 1 5 | {9,5,2,1} | 1 | {} | 2 6 | {11,10} | 2 | {} | 3 

PS: the array_uunion() function is the same as the original ,

  CREATE FUNCTION array_uunion(anyarray,anyarray) RETURNS anyarray AS $$ -- ensures distinct items of a concatemation SELECT ARRAY(SELECT unnest($1) UNION SELECT unnest($2)) $$ LANGUAGE sql immutable; 
0
source

Source: https://habr.com/ru/post/950964/


All Articles