The connection between the mapping table (transition) with a certain power

I have a simple question about the most efficient way to perform a specific connection.

Take these three tables, real names have been changed to protect the innocent:

Table: animal

  animal_id name ...
 ========================
 1 bunny
 2 bear
 3 cat
 4 mouse

Table: tags

  tag_id tag
 ===================
 1 fluffy
 2 brown
 3 cute
 4 small

Map Table: animal_tag

  animal_id tag_id
 ===================
 eleven
 12
 thirteen
 2 2
 3 4
 4 2

I want to find all animals that are marked as “fluffy,” “brown,” and “cute.” That is, the animal must be labeled with all three. In reality, the number of tags required may vary, but this does not matter for discussion. This is the query I came up with:

SELECT * FROM animal JOIN ( SELECT at.animal_id FROM animal_tag at WHERE at.tag_id IN ( SELECT tg.tag_id FROM tag tg WHERE tg.tag='fluffy' OR tg.tag='brown' OR tg.tag='cute' ) GROUP BY at.animal_id HAVING COUNT(at.tag_id)=3 ) AS jt ON animal.animal_id=jt.animal_id 

On a table with thousands of "animals" and "hundreds" of tags, this request is executed with respect ... 10 with milliseconds. However, when I look at the query plan (Apache Derby is the database), the estimated cost of the optimizer is quite high (9945.12), and the plan is quite extensive. For a query, this “simple” one, I usually try to get query plans with an approximate cost of one or two digits.

So my question is: is there a better way to fulfill this request? Sounds like a simple request, but I was at a dead end, coming up with something better.

+6
source share
5 answers

First of all, many thanks to everyone who jumped on this. Ultimately, the answer, according to several commentators, is a relational division.

While I was taking a course on the relational data model of Codd, many moons ago, the course, like many, did not really cover relational division. Directly my initial request is actually a Relational Division application.

Referring to slide 26-27 in this presentation on relational division, my query uses the method of comparing installed capacities. I tried some of the other methods mentioned for applying relational division, but at least in my case, the counting method provides the fastest duration. I urge anyone interested in this issue to read the aforementioned slide stack, as well as the article cited on this page by Michael Ericsson. Again, thanks to everyone.

+1
source

You can create a temporary table using DECLARE GLOBAL TEMPORARY TABLE and then do an INNER JOIN to exclude "WHERE IN". Working with Joins that are set based on is usually much more efficient than Where statements, which should be evaluated for each line.

+1
source

try the following:

 SELECT DISTINCT f.Animal_ID, g.Name FROM Animal f INNER JOIN (SELECT a.Animal_ID, a.Name, COUNT(*) as iCount FROM Animal a INNER JOIN Animal_Tag b ON a.Animal_ID = b.animal_ID INNER JOIN Tags c On b.tag_ID = c.tag_ID WHERE c.tag IN ('fluffy', 'brown', 'cute') -- list all tags here GROUP BY a.Animal_ID) g WHERE g.iCount = 3 -- No. of tags 

UPDATE

  SELECT DISTINCT a.Animal_ID, a.Name, COUNT(*) as iCount FROM Animal a INNER JOIN Animal_Tag b ON a.Animal_ID = b.animal_ID INNER JOIN Tags c On b.tag_ID = c.tag_ID WHERE c.tag IN ('fluffy', 'brown', 'cute') -- list all tags here GROUP BY Animal_ID HAVING iCount = 3 -- No. of tags 
+1
source

Give this back:

 SELECT a.* FROM animal a INNER JOIN ( SELECT at.animal_id FROM tag t INNER JOIN animal_tag at ON at.tag_id = t.tag_id WHERE tag IN ('fluffy', 'brown', 'cute') GROUP BY at.animal_id HAVING count(*) = 3 ) f ON a.animal_id = f.animal_id 

Here is another option, just for fun:

 SELECT a.* FROM animal a INNER JOIN animal_tag at1 on at1.animal_id = a.animal_id INNER JOIN tag t1 on t1.tag_id = at1.tag_id INNER JOIN animal_tag at2 on at2.animal_id = a.animal_id INNER JOIN tag t2 on t2.tag_id = at2.tag_id INNER JOIN animal_tag at3 on at3.animal_id = a.animal_id INNER JOIN tag t3 on t3.tag_id = at3.tag_id WHERE t1.tag = 'fluffy' AND t2.tag = 'brown' AND t3.tag = 'cute' 

I really don't expect this last option to succeed ... other options avoid having to go back to the tag table several times to resolve the tag name from the identifier ... but you never know what the optimizer will do until you try it .

+1
source

I was wondering how bad it would be to use a relational unit there. Could you give him a run? I know this will take longer, but I'm intrigued by how much :) If you can provide both an approximate cost and time, it would be great.

 select a2.animal_id, a2.animal_name from animal2 a2 where not exists ( select * from animal1 a1, tags t1 where not exists ( select * from animal_tag at1 where at1.animal_id = a1.animal_id and at1.animal_tag = t1.tag_id ) and a2.animal_id = a1.animal_id and t1.tag in ('fluffy', 'brown', 'cute') ) 

Now, looking for a quick request, I can’t think of anything faster than John or yours. In fact, john may be a little slower than yours because it performs non-standard operations (delete individual and remove count (*) from select):

 SELECT a.Animal_ID, a.Name FROM Animal a INNER JOIN Animal_Tag b ON a.Animal_ID = b.animal_ID INNER JOIN Tags c On b.tag_ID = c.tag_ID WHERE c.tag IN ('fluffy', 'brown', 'cute') -- list all tags here GROUP BY Animal_ID, a.Name HAVING count(*) = 3 -- No. of tags 

It should be as fast as yours.

PS: Is there a way to remove these damned 3 without duplicating the where clause? My brain is boiling :)

0
source

Source: https://habr.com/ru/post/907810/


All Articles