How to perform linear regression in BigQuery?

BigQuery has some aggregation aggregate functions such as STDDEV (X) and CORR (X, Y), but does not offer functions for direct linear regression.

How can linear regression be calculated using functions that exist?

+4
source share
4 answers

Editor: see the following answer , linear regression is now natively supported in BigQuery. --Fh


, . Y = SLOPE * X + INTERCEPT , CORR.

. , , , . , .

SELECT Bucket,
       SLOPE,
       (SUM_OF_Y - SLOPE * SUM_OF_X) / N AS INTERCEPT,
       CORRELATION
FROM (
    SELECT Bucket,
           N,
           SUM_OF_X,
           SUM_OF_Y,
           CORRELATION * STDDEV_OF_Y / STDDEV_OF_X AS SLOPE,
           CORRELATION
    FROM (
        SELECT Bucket,
               COUNT(*) AS N,
               SUM(X) AS SUM_OF_X,
               SUM(Y) AS SUM_OF_Y,
               STDDEV_POP(X) AS STDDEV_OF_X,
               STDDEV_POP(Y) AS STDDEV_OF_Y,
               CORR(X,Y) AS CORRELATION
        FROM (SELECT state AS Bucket,
                     gestation_weeks AS X,
                     weight_pounds AS Y
              FROM [publicdata.samples.natality])
        WHERE Bucket IS NOT NULL AND
              X IS NOT NULL AND
              Y IS NOT NULL
        GROUP BY Bucket));

STDDEV_POP CORR X Y, , , , .

+10

! BigQuery ML.

CREATE MODEL, SELECT FROM ML.PREDICT.

Docs:

: Qaru

+2

Here is the code for creating a linear regression model using a public dataset about naturalness (live birth) and generating it into a dataset called demo_ml_bq. This must be created before running the statement below.

%%bq query
CREATE or REPLACE MODEL demo_bq_ml.babyweight_model_asis
OPTIONS
  (model_type='linear_reg', labels=['weight_pounds']) AS

WITH natality_data AS (
  SELECT
     weight_pounds, -- this is the label; because it is continuous, we need to use regression
    CAST(is_male AS STRING) AS is_male,
    mother_age,
    CAST(plurality AS STRING) AS plurality,
    gestation_weeks,
    CAST(alcohol_use AS STRING) AS alcohol_use,
    CAST(year AS STRING) AS year,
    ABS(FARM_FINGERPRINT(CONCAT(CAST(YEAR AS STRING), CAST(month AS STRING)))) AS hashmonth
  FROM
    publicdata.samples.natality
  WHERE
    year > 2000
    AND gestation_weeks > 0
    AND mother_age > 0
    AND plurality > 0
    AND weight_pounds > 0
)

SELECT
    weight_pounds,
    is_male,
    mother_age,
    plurality,
    gestation_weeks,
    alcohol_use,
    year
FROM
    natality_data
WHERE
  MOD(hashmonth, 4) < 3  -- select 75% of the data as training
0
source

What is the best way to quantify X when X is a date? I am trying to translate the date into time units (seconds / days / etc.) from the earliest date in the dataset to date X, and so far this has not given any reasonable results.

0
source

Source: https://habr.com/ru/post/1654413/


All Articles