Case-insensitive BigQuery Performance Improvement

Question

Case-insensitive BigQuery Performance Improvement

The BigQuery team strikes again . This question is no longer relevant, as the results with LOWER () are as fast as with REGEX ().

Processing ~ 5 GB of data using BigQuery should be very fast. For example, the following query performs a case-insensitive search in 18 seconds:

#standardSQL SELECT COUNT(*) c FROM `bigquery-public-data.hacker_news.full` WHERE LOWER(text) LIKE '%bigquery%' # 18s

BigQuery usually works faster, but the real problem is that adding new search queries makes this query much slower (almost a minute with three search terms):

 #standardSQL SELECT COUNT(*) c FROM `bigquery-public-data.hacker_news.full` WHERE LOWER(text) LIKE '%bigquery%' OR LOWER(text) LIKE '%big query%' # 34s #standardSQL SELECT COUNT(*) c FROM `bigquery-public-data.hacker_news.full` WHERE LOWER(text) LIKE '%bigquery%' OR LOWER(text) LIKE '%big query%' OR LOWER(text) LIKE '%google cloud%' # 52s

How to improve the performance of my request?

+5

performance google-bigquery

Felipe hoffa Jan 24 '18 at 7:13

source share

1 answer

Felipe hoffa · Answer 1 · 2018-01-24T07:13:30+0000

Note from the team: Stay tuned! Very soon, BigQuery will turn it into a tip does not matter.

BigQuery Performance Recommendation: Avoid Using `LOWER()` and `UPPER()`

Operations

LOWER() and UPPER() have a hard time when working with Unicode text: each character must be displayed individually, and they can also be multibytes.

Solution 1: case insensitive regular expression

Faster alternative: use REGEX_MATCH() and add case-insensitive modifier (?i) to your regular expression

 #standardSQL SELECT COUNT(*) c FROM `bigquery-public-data.hacker_news.full` WHERE REGEXP_CONTAINS(text, '(?i)bigquery') # 7s # REGEXP_CONTAINS(text, '(?i)bigquery') # OR REGEXP_CONTAINS(text, '(?i)big query') # 9s # REGEXP_CONTAINS(text, '(?i)bigquery') # OR REGEXP_CONTAINS(text, '(?i)big query') # OR REGEXP_CONTAINS(text, '(?i)google cloud') # 11s

Performance is much better:

1 search query: 18 s to 8 s
2 searches: 34s to 9s
3 searches: 52s to 11s.

Solution 2: Combine Regular Expressions

Why 3 search queries when a regular expression can combine many into 1?

 #standardSQL SELECT COUNT(*) c FROM `bigquery-public-data.hacker_news.full` WHERE REGEXP_CONTAINS(text, '(?i)(bigquery|big query|google cloud)') # 7s

3 members in 7s - nice.

Solution 3: Convert to Bytes

This is uglier, but shows that UPPER() and LOWER() work better when working with individual bytes - for equivalent results in these searches:

 #standardSQL SELECT COUNT(*) c FROM `bigquery-public-data.hacker_news.full` WHERE LOWER(CAST(text AS BYTES)) LIKE b'%bigquery%' OR LOWER(CAST(text AS BYTES)) LIKE b'%big query%' OR LOWER(CAST(text AS BYTES)) LIKE b'%google cloud%' # 7s

LOWER () is slower. Use the regex (? I) modifier instead.

If this worked for you, feel free to comment on your performance improvements.

Case-insensitive BigQuery Performance Improvement

BigQuery Performance Recommendation: Avoid Using LOWER() and UPPER()

Solution 1: case insensitive regular expression

Solution 2: Combine Regular Expressions

Solution 3: Convert to Bytes

LOWER () is slower. Use the regex (? I) modifier instead.

More articles:

BigQuery Performance Recommendation: Avoid Using `LOWER()` and `UPPER()`