Transitive SQL query in one table

Question

Transitive SQL query in one table

Hey. consider d the following table and data ...

in_timestamp | out_timestamp | name | in_id | out_id | in_server | out_server | status timestamp1 | timestamp2 | data1 |id1 | id2 | others-server1 | my-server1 | success timestamp2 | timestamp3 | data1 | id2 | id3 | my-server1 | my-server2 | success timestamp3 | timestamp4 | data1 | id3 | id4 | my-server2 | my-server3 | success timestamp4 | timestamp5 | data1 | id4 | id5 | my-server3 | others-server2 | success

the above data is a flow log of some data on servers.
eg. some data flowed from some "external server1" to a bunch of "my servers" and, finally, to "other-server2".

Question:

1) I need to provide this journal in a representable form to the client, where he does not need to know anything about the bunch of "my servers". All I have to give is a timestamp when the data entered my infrastructure and when it left; drilling up to the following information.

 in_timestamp (of 'others_server1' to 'my-server1') out_timestamp (of 'my-server3' to 'others-server2') name status

I want to write sql for the same! Can anyone help? NOTE. There cannot be 3 'my-servers' all the time. It differs from situation to situation. for example, maybe 4 'my-server' is involved, say, for data2!

2) Are there other alternatives to SQL? I mean stored procs / etc?

3) Optimization? (The records are huge in quantity! At the moment it is about 5 million per day. And we must show records that are up to a week.)

Forward, THANKS FOR HELP! :)

+4

sql self-join postgresql

Miku Apr 05 '10 at 11:11

source share

4 answers

Quassnoi · Answer 1 · 2010-04-08T11:58:51+0000

 WITH RECURSIVE foo AS ( SELECT *, in_timestamp AS timestamp1, 1 AS hop, ARRAY[in_id] AS hops FROM log_parsing.log_of_sent_mails WHERE in_server = 'other-server1' UNION ALL SELECT t_alias2.*, foo.timestamp1, foo.hop + 1, hops || in_id FROM foo JOIN log_parsing.log_of_sent_mails t_alias2 ON t_alias2.in_id = (foo.t_alias1).out_id ) SELECT * FROM foo ORDER BY hop DESC LIMIT 1

Mark byers · Answer 2 · 2010-04-05T13:22:27+0000

Your table has a hierarchical structure (adjacency lists). This can be effectively requested in PostgreSQL v8.4 and later using recursive CTEs. Quassnoi wrote a blog post on how to implement it. This is a rather complicated query that you need to write, but it explains it well with examples very similar to what you need. Especially if you look at his last example, he demonstrates the request, than gets the full path from the first node to the last using an array.

DVK · Answer 3 · 2010-04-05T11:16:54+0000

One way to do this is if the data is STABLE (for example, inc never changes), you need to calculate the transitive ON FLY relationships (for example, using a trigger or an application that does the insertion) during insertion.

eg. you have a new column "start_ts" in your table; when you insert a record:

 in_timestamp | out_timestamp | name | in_id | out_id | in_server | out_server | status timestamp3 | timestamp4 | data1 | id3 | id4 | my-server2 | my-server3 | success

... then your logic will automatically find the record with name=data1 and out_id=id3 and clone its start_ts into the newly inserted record. You may need special logic to propagate the last status, depending on how you calculate these transitive values.

By the way, you do not have to search for the previous record ( name=data1 and out_id=id3 ) - you can save the value of start_ts in the metadata of the data record during processing.

Then the final report is simply select start_ts, out_ts from T where out_server=others_server2 (of course, more complicated in terms of out_server and status, but still one simple choice)

The second option is, of course, a simpler cycle for calculating the final report - google or "stack" (is it an accepted verb?) For SQL BFS implementations, if you don't know how.

Miku · Answer 4 · 2010-04-08T11:31:28+0000

@ Other readers:
See the first answer posted by Mark Byers. I used the "answer", and not "commented" on his post, since I needed to use tables / links, etc., which are not available when commenting on the answers. :)
@Mark Byers:

Thanks for the link ... It really helped me, and I was able to figure out how to create a path between the servers ... See what I could do.

 in_id | in_timestamp | out_timestmap | name | hops_count | path | id1 | timestamp1 | timestamp2 | data1 | 1 | {id1} | id2 | timestamp2 | timestamp3 | data1 | 2 | {id1,id2} | id3 | timestamp3 | timestamp4 | data1 | 3 | {id1,id2,id3} | id4 | timestamp4 | timestamp2 | data1 | 4 | {id1,id2,id3,id4} |

* path generated using 'in_id'

I used the following query ...

 WITH RECURSIVE foo AS ( SELECT t_alias1, 1 AS hops_count, ARRAY[in_id] AS hops FROM log_parsing.log_of_sent_mails t_alias1 WHERE in_server = 'other-server1' UNION ALL SELECT t_alias2, foo.hops_count + 1 AS hops_count, hops || in_id FROM foo JOIN log_parsing.log_of_sent_mails t_alias2 ON t_alias2.in_id = (foo.t_alias1).out_id ) SELECT (foo.t_alias1).in_id, (foo.t_alias1).name, (foo.t_alias1).in_timestamp, hops_count, hops::VARCHAR AS path FROM foo ORDER BY hops

But I still could not reach the final stage. This is what I want to end up with ...

 in_id | in_timestamp | out_timestmap | name | hops_count | path | id4 | timestamp1 | timestamp5 | data1 | 4 | {id1,id2,id3,id4}|

* observe the time stamp. This is necessary because I do not want the client to know about the internal infrastructure. Therefore, the time delay between timestamp1 and timestamp5 is important to him.

Any clues how can I achieve this !?

ps I would try to contact Quassnoi . :)

Transitive SQL query in one table

More articles: