Note from the team: Stay tuned! Very soon, BigQuery will turn it into a tip does not matter.
BigQuery Performance Recommendation: Avoid Using LOWER() and UPPER()
Operations
LOWER() and UPPER() have a hard time when working with Unicode text: each character must be displayed individually, and they can also be multibytes.
Solution 1: case insensitive regular expression
Faster alternative: use REGEX_MATCH() and add case-insensitive modifier (?i) to your regular expression
#standardSQL SELECT COUNT(*) c FROM `bigquery-public-data.hacker_news.full` WHERE REGEXP_CONTAINS(text, '(?i)bigquery')
Performance is much better:
- 1 search query: 18 s to 8 s
- 2 searches: 34s to 9s
- 3 searches: 52s to 11s.
Solution 2: Combine Regular Expressions
Why 3 search queries when a regular expression can combine many into 1?
#standardSQL SELECT COUNT(*) c FROM `bigquery-public-data.hacker_news.full` WHERE REGEXP_CONTAINS(text, '(?i)(bigquery|big query|google cloud)')
3 members in 7s - nice.
Solution 3: Convert to Bytes
This is uglier, but shows that UPPER() and LOWER() work better when working with individual bytes - for equivalent results in these searches:
#standardSQL SELECT COUNT(*) c FROM `bigquery-public-data.hacker_news.full` WHERE LOWER(CAST(text AS BYTES)) LIKE b'%bigquery%' OR LOWER(CAST(text AS BYTES)) LIKE b'%big query%' OR LOWER(CAST(text AS BYTES)) LIKE b'%google cloud%'
LOWER () is slower. Use the regex (? I) modifier instead.
If this worked for you, feel free to comment on your performance improvements.
source share