Getting a hit counter in a query on a large table is very slow

Question

Getting a hit counter in a query on a large table is very slow

I have a mysql table "items" with 2 integer fields: seid and tiid
The table has about 35,000,000 entries, so it is very large.

seid tiid ----------- 1 1 2 2 2 3 2 4 3 4 4 1 4 2

The table has a primary key in both fields, an index on seid and an index on tiid.

Someone is type in 1 or more tiid values, and now I would like to get seid with most results.

For example, when someone enters 1,2,3, I would like to get the result 2 and 4. They both have 2 matches in tiid values.

My request:

 SELECT COUNT(*) as c, seid FROM items WHERE tiid IN (1,2,3) GROUP BY seid HAVING c = (SELECT COUNT(*) as c, seid FROM items WHERE tiid IN (1,2,3) GROUP BY seid ORDER BY c DESC LIMIT 1)

But this query is extremely slow due to the large table.

Does anyone know how to build a better query for this purpose?

+4

sql mysql query-optimization

Roy roes Jan 12 '11 at 21:35

source share

5 answers

This requires you to go through a large table twice. Perhaps caching the result will help reduce the time by half, but it does not seem that more optics are possible.

 DROP temporary table if exists TMP_COUNTED; create temporary table TMP_COUNTED select seid, COUNT(*) as C from items where tiid in (1,2,3) group by seid; CREATE INDEX IX_TMP_COUNTED on TMP_COUNTED(C); SELECT * FROM TMP_COUNTED WHERE C = (SELECT MAX(C) FROM seid)

+2

RichardTheKiwi Jan 12 '11 at 21:48

source share

Pre-calculate the number of all unique tiid values and save them.

Update this account hourly, daily, or weekly. Or try to keep the count correct by updating them. Then this will eliminate the need to do the counting. Counts are always slow.

+1

jDempster Jan 12 '11 at 21:44

source share

I have a table called product_category, which has a composite primary key consisting of 2 unsigned integers and no additional secondary indices:

 create table product_category ( prod_id int unsigned not null, cat_id mediumint unsigned not null, primary key (cat_id, prod_id) -- note the clustered composite index !! ) engine = innodb;

The table currently has 125 million rows

 select count(*) as c from product_category; c = 125,524,947

with the following index / power:

 show indexes from product_category; Table Non_unique Key_name Seq_in_index Column_name Collation Cardinality ===== ========== ======== ============ =========== ========= =========== product_category 0 PRIMARY 1 cat_id A 1162276 product_category 0 PRIMARY 2 prod_id A 125525826

If I run a query similar to yours (1st run is not cached with cold / empty buffers either):

 select prod_id, count(*) as c from product_category where cat_id between 1600 and 2000 -- using between to include a wider range of data group by prod_id having c = ( select count(*) as c from product_category where cat_id between 1600 and 2000 group by prod_id order by c desc limit 1 ) order by prod_id;

I get the following results:

 (cold run) +---------+---+ | prod_id | c | +---------+---+ | 34957 | 4 | | 717812 | 4 | | 816612 | 4 | | 931111 | 4 | +---------+---+ 4 rows in set (0.18 sec) (2nd run) +---------+---+ | prod_id | c | +---------+---+ | 34957 | 4 | | 717812 | 4 | | 816612 | 4 | | 931111 | 4 | +---------+---+ 4 rows in set (0.14 sec)

The outline of the explanation is as follows:

 +----+-------------+------------------+-------+---------------+---------+---------+------+--------+-----------------------------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+------------------+-------+---------------+---------+---------+------+--------+-----------------------------------------------------------+ | 1 | PRIMARY | product_category | range | PRIMARY | PRIMARY | 3 | NULL | 194622 | Using where; Using index; Using temporary; Using filesort | | 2 | SUBQUERY | product_category | range | PRIMARY | PRIMARY | 3 | NULL | 194622 | Using where; Using index; Using temporary; Using filesort | +----+-------------+------------------+-------+---------------+---------+---------+------+--------+-----------------------------------------------------------+

If I run the regilero query:

 SELECT c,prod_id FROM ( SELECT c,prod_id,CASE WHEN @mmax<=c THEN @mmax:=c ELSE 0 END 'mymax' FROM ( SELECT COUNT(*) as c, prod_id FROM product_category WHERE cat_id between 1600 and 2000 GROUP BY prod_id ORDER BY c DESC ) res1 ,(SELECT @mmax:=0) initmax ORDER BY c DESC ) res2 WHERE mymax>0;

I get the following results:

 (cold) +---+---------+ | c | prod_id | +---+---------+ | 4 | 931111 | | 4 | 34957 | | 4 | 717812 | | 4 | 816612 | +---+---------+ 4 rows in set (0.17 sec) (2nd run) +---+---------+ | c | prod_id | +---+---------+ | 4 | 34957 | | 4 | 717812 | | 4 | 816612 | | 4 | 931111 | +---+---------+ 4 rows in set (0.13 sec)

The outline of the explanation is as follows:

 +----+-------------+------------------+--------+---------------+---------+---------+------+--------+-----------------------------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+------------------+--------+---------------+---------+---------+------+--------+-----------------------------------------------------------+ | 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 92760 | Using where | | 2 | DERIVED | <derived4> | system | NULL | NULL | NULL | NULL | 1 | Using filesort | | 2 | DERIVED | <derived3> | ALL | NULL | NULL | NULL | NULL | 92760 | | | 4 | DERIVED | NULL | NULL | NULL | NULL | NULL | NULL | NULL | No tables used | | 3 | DERIVED | product_category | range | PRIMARY | PRIMARY | 3 | NULL | 194622 | Using where; Using index; Using temporary; Using filesort | +----+-------------+------------------+--------+---------------+---------+---------+------+--------+-----------------------------------------------------------+

Finally, an attempt to use cyberwiki:

 drop procedure if exists cyberkiwi_variant; delimiter # create procedure cyberkiwi_variant() begin create temporary table tmp engine=memory select prod_id, count(*) as c from product_category where cat_id between 1600 and 2000 group by prod_id order by c desc; select max(c) into @max from tmp; select * from tmp where c = @max; drop temporary table if exists tmp; end# delimiter ; call cyberkiwi_variant();

I get the following results:

 (cold and 2nd run) +---------+---+ | prod_id | c | +---------+---+ | 816612 | 4 | | 931111 | 4 | | 34957 | 4 | | 717812 | 4 | +---------+---+ 4 rows in set (0.14 sec)

The outline of the explanation is as follows:

 +----+-------------+------------------+-------+---------------+---------+---------+------+--------+-----------------------------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+------------------+-------+---------------+---------+---------+------+--------+-----------------------------------------------------------+ | 1 | SIMPLE | product_category | range | PRIMARY | PRIMARY | 3 | NULL | 194622 | Using where; Using index; Using temporary; Using filesort | +----+-------------+------------------+-------+---------------+---------+---------+------+--------+-----------------------------------------------------------+

So, it seems that all tested methods have approx. the same time intervals from 0.14 to 0.18 seconds that seem pretty effective to me, given the size of the table and the number of rows requested.

Hope this helps - http://dev.mysql.com/doc/refman/5.0/en/innodb-index-types.html

+1

Jon black Jan 13 '11 at 6:27

source share

If I understand your requirements, you can try something like this

 select seid, tiid, count(*) from items where tiid in (1,2,3) group by seid, tiid order by seid

0

Marc b Jan 12 '11 at 21:40

source share

regilero · Accepted Answer · 2011-01-12T23:07:13+0000

So, I found 2 solutions, the first one:

 SELECT c,GROUP_CONCAT(CAST(seid AS CHAR)) as seid_list FROM ( SELECT COUNT(*) as c, seid FROM items WHERE tiid IN (1,2,3) GROUP BY seid ORDER BY c DESC ) T1 GROUP BY c ORDER BY c DESC LIMIT 1; +---+-----------+ | c | seid_list | +---+-----------+ | 2 | 2,4 | +---+-----------+

Edit:

 EXPLAIN SELECT c,GROUP_CONCAT(CAST(seid AS CHAR)) as seid_list FROM ( SELECT COUNT(*) as c, seid FROM items WHERE tiid IN (1,2,3) GROUP BY seid ORDER BY c DESC ) T1 GROUP BY c ORDER BY c DESC LIMIT 1; +----+-------------+------------+-------+------------------+---------+---------+------+------+-----------------------------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+------------+-------+------------------+---------+---------+------+------+-----------------------------------------------------------+ | 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 3 | Using filesort | | 2 | DERIVED | items | range | PRIMARY,tiid_idx | PRIMARY | 4 | NULL | 4 | Using where; Using index; Using temporary; Using filesort | +----+-------------+------------+-------+------------------+---------+---------+------+------+-----------------------------------------------------------+

Reorder:

There is one problem with this 1st solution: with billions of lines, the result field may be too large. So, this is another solution that avoids the double rainbow effect by applying class maximum memorability / validation using the MySQl variable:

 SELECT c,seid FROM ( SELECT c,seid,CASE WHEN @mmax<=c THEN @mmax:=c ELSE 0 END 'mymax' FROM ( SELECT COUNT(*) as c, seid FROM items WHERE tiid IN (1,2,3) GROUP BY seid ORDER BY c DESC ) res1 ,(SELECT @mmax:=0) initmax ORDER BY c DESC ) res2 WHERE mymax>0; +---+------+ | c | seid | +---+------+ | 2 | 4 | | 2 | 2 | +---+------+

explain:

 +----+-------------+------------+--------+------------------+---------+---------+------+------+-----------------------------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+------------+--------+------------------+---------+---------+------+------+-----------------------------------------------------------+ | 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 3 | Using where | | 2 | DERIVED | <derived4> | system | NULL | NULL | NULL | NULL | 1 | Using filesort | | 2 | DERIVED | <derived3> | ALL | NULL | NULL | NULL | NULL | 3 | | | 4 | DERIVED | NULL | NULL | NULL | NULL | NULL | NULL | NULL | No tables used | | 3 | DERIVED | items | range | PRIMARY,tiid_idx | PRIMARY | 4 | NULL | 4 | Using where; Using index; Using temporary; Using filesort | +----+-------------+------------+--------+------------------+---------+---------+------+------+-----------------------------------------------------------+

Getting a hit counter in a query on a large table is very slow

More articles: