I am trying to increase my pipeline (transferring data from RDS to RedShift) so that it selects all rows whose idgreater than the maximum idthat exists in RedShift. I have a script in Python that calculates this value and returns it to the output. I want to take this output and save it in a variable max_id, which I can later refer to in my RDS selection request. For example, my RDS selection section currently looks like this:
{
"database": {
"ref": "rds_mysql"
},
"scheduleType": "TIMESERIES",
"name": "SrcRDSTable",
"id": "SrcRDSTable",
"type": "SqlDataNode",
"table": "#{myRDSTableName}",
"selectQuery": "select * from #{table} where #{myRDSTableLastModifiedCol} > '#{max_id}'"
},
Then I want to add a section before that that will run the bash script, get the field idand save it in a variable max_idso that it can be referenced in the above code. So far, I:
{
"myComment": "Retrieves the maximum ID for a given table in RedShift",
"id": "ShellCommandActivity_Max_ID",
"workerGroup": "wg-12345",
"type": "ShellCommandActivity",
"command": "starting_point=$(/usr/bin/python /home/user/aws-taskrunner-docker/get_id.py --schema=schema_name --table=users --database=master)"
},
, max_id starting_point? .