AWS Data Pipeline - How to Set a Global Pipeline Variable from ShellCommandActivity

I am trying to increase my pipeline (transferring data from RDS to RedShift) so that it selects all rows whose idgreater than the maximum idthat exists in RedShift. I have a script in Python that calculates this value and returns it to the output. I want to take this output and save it in a variable max_id, which I can later refer to in my RDS selection request. For example, my RDS selection section currently looks like this:

{
  "database": {
    "ref": "rds_mysql"
  },
  "scheduleType": "TIMESERIES",
  "name": "SrcRDSTable",
  "id": "SrcRDSTable",
  "type": "SqlDataNode",
  "table": "#{myRDSTableName}",
  "selectQuery": "select * from #{table} where #{myRDSTableLastModifiedCol} > '#{max_id}'"
},

Then I want to add a section before that that will run the bash script, get the field idand save it in a variable max_idso that it can be referenced in the above code. So far, I:

{
 "myComment": "Retrieves the maximum ID for a given table in RedShift",
  "id": "ShellCommandActivity_Max_ID",
  "workerGroup": "wg-12345",
  "type": "ShellCommandActivity",
  "command": "starting_point=$(/usr/bin/python /home/user/aws-taskrunner-docker/get_id.py --schema=schema_name --table=users --database=master)"
},

, max_id starting_point? .

+4
1

, , . , .

-, , RDS MySQL Redshift. MySQL, .

SqlDataNode ShellCommandActivity, python RDS S3. S3, RedshiftCopyActivity.

+1

Source: https://habr.com/ru/post/1653379/


All Articles