I am running several tests on an HSQLDB server with a table containing 500,000 records. There are no indexes in the table. There are 5,000 different business keys. I need a list of them. Naturally, I started with a DISTINCT request:
SELECT DISTINCT business_key FROM memory WHERE concept <> 'case' or attrib <> 'status' or value <> 'closed'
It takes about 90 seconds !!!
Then I tried using GROUP BY :
SELECT business_key FROM memory WHERE concept <> 'case' or attrib <> 'status' or value <> 'closed' GROUP BY business_key
And it takes 1 second !!!
Trying to figure out the difference I ran EXLAIN PLAN FOR , but it seems to give the same information for both queries.
EXLAIN PLAN FOR DISTINCT ...
isAggregated=[false] columns=[ COLUMN: PUBLIC.MEMORY.BUSINESS_KEY ] [range variable 1 join type=INNER table=MEMORY alias=M access=FULL SCAN condition = [ index=SYS_IDX_SYS_PK_10057_10058 other condition=[ OR arg_left=[ OR arg_left=[ NOT_EQUAL arg_left=[ COLUMN: PUBLIC.MEMORY.CONCEPT] arg_right=[ VALUE = case, TYPE = CHARACTER]] arg_right=[ NOT_EQUAL arg_left=[ COLUMN: PUBLIC.MEMORY.ATTRIB] arg_right=[ VALUE = status, TYPE = CHARACTER]]] arg_right=[ NOT_EQUAL arg_left=[ COLUMN: PUBLIC.MEMORY.VALUE] arg_right=[ VALUE = closed, TYPE = CHARACTER]]] ] ]] PARAMETERS=[] SUBQUERIES[] Object References PUBLIC.MEMORY PUBLIC.MEMORY.CONCEPT PUBLIC.MEMORY.ATTRIB PUBLIC.MEMORY.VALUE PUBLIC.MEMORY.BUSINESS_KEY Read Locks PUBLIC.MEMORY WriteLocks
EXLAIN PLAN FOR SELECT ... GROUP BY ...
isDistinctSelect=[false] isGrouped=[true] isAggregated=[false] columns=[ COLUMN: PUBLIC.MEMORY.BUSINESS_KEY ] [range variable 1 join type=INNER table=MEMORY alias=M access=FULL SCAN condition = [ index=SYS_IDX_SYS_PK_10057_10058 other condition=[ OR arg_left=[ OR arg_left=[ NOT_EQUAL arg_left=[ COLUMN: PUBLIC.MEMORY.CONCEPT] arg_right=[ VALUE = case, TYPE = CHARACTER]] arg_right=[ NOT_EQUAL arg_left=[ COLUMN: PUBLIC.MEMORY.ATTRIB] arg_right=[ VALUE = status, TYPE = CHARACTER]]] arg_right=[ NOT_EQUAL arg_left=[ COLUMN: PUBLIC.MEMORY.VALUE] arg_right=[ VALUE = closed, TYPE = CHARACTER]]] ] ]] groupColumns=[ COLUMN: PUBLIC.MEMORY.BUSINESS_KEY] PARAMETERS=[] SUBQUERIES[] Object References PUBLIC.MEMORY PUBLIC.MEMORY.CONCEPT PUBLIC.MEMORY.ATTRIB PUBLIC.MEMORY.VALUE PUBLIC.MEMORY.BUSINESS_KEY Read Locks PUBLIC.MEMORY WriteLocks
EDIT : I did additional tests. With 500,000 entries in HSQLDB with all clear business keys, DISTINCT performance is now better - 3 seconds, vs GROUP BY , which took about 9 seconds.
In MySQL both queries will prepare the same thing:
MySQL: 500,000 rows - 5,000 different business keys: Both queries: 0.5 seconds MySQL: 500,000 rows - all separate business keys: SELECT DISTINCT ... - 11 seconds SELECT ... GROUP BY business_key - 13 seconds
Therefore, the problem is only related to HSQLDB .
I would really appreciate it if someone could explain why there is such a radical difference.