Case-insensitive BigQuery Performance Improvement

The BigQuery team strikes again . This question is no longer relevant, as the results with LOWER () are as fast as with REGEX ().


Processing ~ 5 GB of data using BigQuery should be very fast. For example, the following query performs a case-insensitive search in 18 seconds:

#standardSQL SELECT COUNT(*) c FROM `bigquery-public-data.hacker_news.full` WHERE LOWER(text) LIKE '%bigquery%' # 18s 

BigQuery usually works faster, but the real problem is that adding new search queries makes this query much slower (almost a minute with three search terms):

 #standardSQL SELECT COUNT(*) c FROM `bigquery-public-data.hacker_news.full` WHERE LOWER(text) LIKE '%bigquery%' OR LOWER(text) LIKE '%big query%' # 34s #standardSQL SELECT COUNT(*) c FROM `bigquery-public-data.hacker_news.full` WHERE LOWER(text) LIKE '%bigquery%' OR LOWER(text) LIKE '%big query%' OR LOWER(text) LIKE '%google cloud%' # 52s 

How to improve the performance of my request?

+5
source share
1 answer

Note from the team: Stay tuned! Very soon, BigQuery will turn it into a tip does not matter.

BigQuery Performance Recommendation: Avoid Using LOWER() and UPPER()

Operations

LOWER() and UPPER() have a hard time when working with Unicode text: each character must be displayed individually, and they can also be multibytes.

Solution 1: case insensitive regular expression

Faster alternative: use REGEX_MATCH() and add case-insensitive modifier (?i) to your regular expression

 #standardSQL SELECT COUNT(*) c FROM `bigquery-public-data.hacker_news.full` WHERE REGEXP_CONTAINS(text, '(?i)bigquery') # 7s # REGEXP_CONTAINS(text, '(?i)bigquery') # OR REGEXP_CONTAINS(text, '(?i)big query') # 9s # REGEXP_CONTAINS(text, '(?i)bigquery') # OR REGEXP_CONTAINS(text, '(?i)big query') # OR REGEXP_CONTAINS(text, '(?i)google cloud') # 11s 

Performance is much better:

  • 1 search query: 18 s to 8 s
  • 2 searches: 34s to 9s
  • 3 searches: 52s to 11s.

Solution 2: Combine Regular Expressions

Why 3 search queries when a regular expression can combine many into 1?

 #standardSQL SELECT COUNT(*) c FROM `bigquery-public-data.hacker_news.full` WHERE REGEXP_CONTAINS(text, '(?i)(bigquery|big query|google cloud)') # 7s 

3 members in 7s - nice.

Solution 3: Convert to Bytes

This is uglier, but shows that UPPER() and LOWER() work better when working with individual bytes - for equivalent results in these searches:

 #standardSQL SELECT COUNT(*) c FROM `bigquery-public-data.hacker_news.full` WHERE LOWER(CAST(text AS BYTES)) LIKE b'%bigquery%' OR LOWER(CAST(text AS BYTES)) LIKE b'%big query%' OR LOWER(CAST(text AS BYTES)) LIKE b'%google cloud%' # 7s 

LOWER () is slower. Use the regex (? I) modifier instead.

If this worked for you, feel free to comment on your performance improvements.

+8
source

Source: https://habr.com/ru/post/1274917/


All Articles