BigQuery - Increased processing time with a specific table

We performed 2 queries on two different pairs of tables (several connections for receiving updated data, etc.), the same format (both in the table and in the query) and purpose. The difference between the requests is the string variable contained in the REGEXP_MATCH string (Regex formulas have the same format, only a different kernel string). The rest is exactly the same, except, of course, the data contained.

Despite the fact that the query processed almost 2 GB of data in 20-50 seconds in one pair of tables, the same query with a different REGEX parameter (the same column) processes another pair of tables of 250 MB in 100 seconds (sometimes even From 500 to 1000+ seconds). Both queries are executed interactively without caching the results.

What could be causing this and is there a fix?

How is it possible that a smaller table needs a significant amount of processing time compared to a significantly larger table, given that the queries being performed are basically the same?

Sorry for the mess below, tried to make it as beautiful as possible. So, for a brief briefing: the request intends to create a user funnel based on their events. Real-time data is similar, so we have both updated users and events. The following steps are included:

  • Get "done" actions - get users from updated events selected by the first regular expression
  • Get actions "don't have to do" - get users from updated events selected by the second event
  • Make the difference between 2 with LEFT OUTER JOIN
  • USERS ( )

, . .

SELECT Count(*) as count
FROM 
    (
    SELECT final._nid as _nid
    FROM (
        -- Start of events funnel
        SELECT did.user as user
        FROM (
            -- Get updated events
            SELECT events.user as user, events.createdOn as createdOn
            FROM [shop1_events] as events
            JOIN EACH (
                SELECT session, createdOn, MAX(updatedOn) as updatedOn
                FROM [shop1_events]
                GROUP EACH BY session, createdOn) as latest_events
            ON events.session = latest_events.session AND events.createdOn = latest_events.createdOn AND events.updatedOn = latest_events.updatedOn
            -- Regex for categories (concatenated categories)
            WHERE ((REGEXP_MATCH(events.category_a , r"([\:\^]100000453[\:\^]|^100000453$|^100000453[\^\:]|[\^\:]100000453$)"))) AND events.type = 10006) as did
        -- Exclude the following events:    
        LEFT OUTER JOIN EACH (
            -- Get updated events
            SELECT events.user as user, events.createdOn as createdOn
            FROM [shop1_events] as events
            JOIN EACH (
                SELECT session, createdOn, MAX(updatedOn) as updatedOn
                FROM [shop1_events]
                GROUP EACH BY session, createdOn) as latest_events
            ON events.session = latest_events.session AND events.createdOn = latest_events.createdOn AND events.updatedOn = latest_events.updatedOn
            -- Regex for categories
            WHERE ((REGEXP_MATCH(events.category_a , r"([\:\^]100000485[\:\^]|^100000485$|^100000485[\^\:]|[\^\:]100000485$)"))) AND events.type = 10006) as step_not_0
        ON did.user = step_not_0.user
        WHERE step_not_0.user IS NULL) as funnel
    JOIN EACH (
        -- Join with users
        SELECT all._nid as _nid
        FROM [shop1_users] as all
        JOIN EACH (
            -- Get updated users
            SELECT _nid, MAX(updatedOn) as updatedOn
            FROM [shop1_users]
            GROUP EACH BY _nid) as latest
        ON all._nid = latest._nid AND all.updatedOn = latest.updatedOn
        ) as final
ON final._nid = funnel.user
GROUP EACH BY _nid) as counting;
+4
2

, :

, , , , , :

:

REGEXP_MATCH(events.category_a , r"([\:\^]100000485[\:\^]|^100000485$|^100000485[\^\:]|[\^\:]100000485$)"))) 

:

REGEXP_REPLACE(events.category_a, r"\D*(\d+)\D*", (\1)) = "100000485"
0

, . . . JOIN WHERE - , . , , ( JOIN), . , .

- Explanation , , . , .

P.S. :

    SELECT events.user as user, events.createdOn as createdOn
    FROM [shop1_events] as events
    JOIN EACH (...) as latest_events
    ON ...
    WHERE <condition-for-events-table>

    SELECT events.user as user, events.createdOn as createdOn
    FROM (SELECT <fields-used> FROM [shop1_events] 
          WHERE <condition-for-events-table>) as events
    JOIN EACH (...) as latest_events
    ON ...
0

Source: https://habr.com/ru/post/1545258/


All Articles