The easiest way to schedule Google Cloud Dataflow

I just need to run the data flow pipeline every day, but it seems to me that the proposed solutions, such as the App Engine Cron Service, which require the creation of an entire web application, seem too large. I was thinking of just starting the pipeline from the cron job in the VM Compute Engine Linux, but maybe it's too simple :). What problem is doing it this way, why isn't anyone (besides me, I think) offering it?

+10
source share
3 answers

There is absolutely nothing wrong with using a cron job to run your data streams. We do this all the time for our production systems, whether our development is in Java or Python.

However, we are trying to wean ourselves from cron and move more towards using either AWS Lambdas (we run a lot of clouds) or cloud functions. Unfortunately, cloud features are not yet planned . AWS Lambdas do .

+4
source

This question has answers to frequently asked questions: https://cloud.google.com/dataflow/docs/resources/faq#is_there_a_built-in_scheduling_mechanism_to_execute_pipelines_at_given_time_or_interval

  • Google App Engine ( ) .
  • Apache Airflow Dataflow Operator, Google Cloud Composer.
  • (cron) Compute Engine.

"", , ( AWS), Pub/Sub, , HTTP-.

. Apache Airflow, . " ", cron :)

+2

, Cloud Functions, PubSub Cloud Scheduler (, - GCS)

  1. PubSub.

  2. , . , CF. , . index.js :

const google = require('googleapis');

exports.triggerTemplate = (event, context) => {
  // in this case the PubSub message payload and attributes are not used
  // but can be used to pass parameters needed by the Dataflow template
  const pubsubMessage = event.data;
  console.log(Buffer.from(pubsubMessage, 'base64').toString());
  console.log(event.attributes);

  google.google.auth.getApplicationDefault(function (err, authClient, projectId) {
  if (err) {
    console.error('Error occurred: ' + err.toString());
    throw new Error(err);
  }

  const dataflow = google.google.dataflow({ version: 'v1b3', auth: authClient });

  dataflow.projects.templates.create({
        projectId: projectId,
        resource: {
          parameters: {},
          jobName: 'SOME-DATAFLOW-JOB-NAME',
          gcsPath: 'gs://PATH-TO-YOUR-TEMPLATE'
        }
      }, function(err, response) {
        if (err) {
          console.error("Problem running dataflow template, error was: ", err);
        }
        console.log("Dataflow template response: ", response);
      });
  });
};

Package.json

{
  "name": "pubsub-trigger-template",
  "version": "0.0.1",
  "dependencies": {
    "googleapis": "37.1.0",
    "@google-cloud/pubsub": "^0.18.0"
  }
}
  1. PubSub , , .

  2. Cloud Scheduler PubSub https://cloud.google.com/scheduler/docs/tut-pub-sub

0
source

Source: https://habr.com/ru/post/1676528/


All Articles