Huge performance difference when using the vs group

I am running several tests on an HSQLDB server with a table containing 500,000 records. There are no indexes in the table. There are 5,000 different business keys. I need a list of them. Naturally, I started with a DISTINCT request:

 SELECT DISTINCT business_key FROM memory WHERE concept <> 'case' or attrib <> 'status' or value <> 'closed' 

It takes about 90 seconds !!!

Then I tried using GROUP BY :

 SELECT business_key FROM memory WHERE concept <> 'case' or attrib <> 'status' or value <> 'closed' GROUP BY business_key 

And it takes 1 second !!!

Trying to figure out the difference I ran EXLAIN PLAN FOR , but it seems to give the same information for both queries.

EXLAIN PLAN FOR DISTINCT ...

 isAggregated=[false] columns=[ COLUMN: PUBLIC.MEMORY.BUSINESS_KEY ] [range variable 1 join type=INNER table=MEMORY alias=M access=FULL SCAN condition = [ index=SYS_IDX_SYS_PK_10057_10058 other condition=[ OR arg_left=[ OR arg_left=[ NOT_EQUAL arg_left=[ COLUMN: PUBLIC.MEMORY.CONCEPT] arg_right=[ VALUE = case, TYPE = CHARACTER]] arg_right=[ NOT_EQUAL arg_left=[ COLUMN: PUBLIC.MEMORY.ATTRIB] arg_right=[ VALUE = status, TYPE = CHARACTER]]] arg_right=[ NOT_EQUAL arg_left=[ COLUMN: PUBLIC.MEMORY.VALUE] arg_right=[ VALUE = closed, TYPE = CHARACTER]]] ] ]] PARAMETERS=[] SUBQUERIES[] Object References PUBLIC.MEMORY PUBLIC.MEMORY.CONCEPT PUBLIC.MEMORY.ATTRIB PUBLIC.MEMORY.VALUE PUBLIC.MEMORY.BUSINESS_KEY Read Locks PUBLIC.MEMORY WriteLocks 

EXLAIN PLAN FOR SELECT ... GROUP BY ...

 isDistinctSelect=[false] isGrouped=[true] isAggregated=[false] columns=[ COLUMN: PUBLIC.MEMORY.BUSINESS_KEY ] [range variable 1 join type=INNER table=MEMORY alias=M access=FULL SCAN condition = [ index=SYS_IDX_SYS_PK_10057_10058 other condition=[ OR arg_left=[ OR arg_left=[ NOT_EQUAL arg_left=[ COLUMN: PUBLIC.MEMORY.CONCEPT] arg_right=[ VALUE = case, TYPE = CHARACTER]] arg_right=[ NOT_EQUAL arg_left=[ COLUMN: PUBLIC.MEMORY.ATTRIB] arg_right=[ VALUE = status, TYPE = CHARACTER]]] arg_right=[ NOT_EQUAL arg_left=[ COLUMN: PUBLIC.MEMORY.VALUE] arg_right=[ VALUE = closed, TYPE = CHARACTER]]] ] ]] groupColumns=[ COLUMN: PUBLIC.MEMORY.BUSINESS_KEY] PARAMETERS=[] SUBQUERIES[] Object References PUBLIC.MEMORY PUBLIC.MEMORY.CONCEPT PUBLIC.MEMORY.ATTRIB PUBLIC.MEMORY.VALUE PUBLIC.MEMORY.BUSINESS_KEY Read Locks PUBLIC.MEMORY WriteLocks 

EDIT : I did additional tests. With 500,000 entries in HSQLDB with all clear business keys, DISTINCT performance is now better - 3 seconds, vs GROUP BY , which took about 9 seconds.

In MySQL both queries will prepare the same thing:

MySQL: 500,000 rows - 5,000 different business keys: Both queries: 0.5 seconds MySQL: 500,000 rows - all separate business keys: SELECT DISTINCT ... - 11 seconds SELECT ... GROUP BY business_key - 13 seconds

Therefore, the problem is only related to HSQLDB .

I would really appreciate it if someone could explain why there is such a radical difference.

+49
performance sql group-by distinct hsqldb
Oct 30 '11 at 8:29
source share
1 answer

Two queries express the same question. It seems that the query optimizer selects two different execution plans. I assume that the distinct approach runs like this:

  • Copy all business_key values ​​to a temporary table
  • Sort temporary table
  • Scan the temporary table, returning every item other than the one before it

group by can run as:

  • Scan a complete table, storing each business key value in a hash table
  • Returns hash table keys

The first method optimizes memory usage: it will still work quite well when part of the temporary table needs to be replaced. The second method is optimized for speed, but potentially requires a large amount of memory if there are many different keys.

Since you either have enough memory or several different keys, the second method is superior to the first. It is not surprising to see a difference in performance of 10x or even 100x between two execution plans.

+51
Oct 30 '11 at 10:44
source share



All Articles