I have finished coding our own migrator using NodeJS. I'm a little annoyed with the answers explaining what redshift and MongoDB are, so I decided I would take the time to share what I had to do at the end.
Temporary data
Basically, we guarantee that all of our MongoDB collections that we want to transfer to tables in redshift are timestamped and indexed according to that timestamp.
Cursors plugins
Then we code the plugin for each migration that we want to make from the mongo collection into a redshift table. Each plugin returns a cursor that takes into account the last transferred date (passed to it from the migration mechanism) and returns only data that has been changed since the last successful migration for this plugin.
How cursors are used
Then the migration engine uses this cursor and moves through each record. He turns to the plugin for each entry to convert the document into an array, which the migrator then uses to create a separation line, which he passes to a file on disk. We use tabs to delimit this file, as our data contains many commas and pipes.
Restrict export from S3 to table with redshift
The migrator then downloads the delimited file to S3 and runs the red text copy command to load the file from S3 into the temporary table using the plugin configuration to get the name and symbol as the temporary table.
So, for example, if I had a plugin configured with the name of the employees table, it would create a temporary table called temp_employees .
Now we have the data in this temporary table. And the entries in this temporary table get their identifiers from the created MongoDB collection. This allows us to then delete with the target table, in our example, the employee table, where the identifier is present in the temp table. If any of the tables does not exist, it is created on the fly based on the scheme provided by the plugin . And so we need to insert all the entries from the temp table into the target table. This caters for both new records and updated records. We remove only soft blows according to our data, so it will be updated with the is_deleted flag in redshift.
Once this whole process has been completed, the transfer mechanism will save the timestamp for the plugin in the redshift table to keep track of when the latter is successfully running for it. This value is then passed to the plugin the next time the engine decides that it should transfer the data, allowing the plugin to use the timestamp in the cursor that it should provide to the engine.
So, each plugin / migration provides the following engine:
- A cursor that optionally uses the last rescheduled date passed to it from the engine to ensure that only deltas are moved across.
- The conversion function that the engine uses to turn each document into a cursor into a delimited line that is added to the export file
- A schema file is an SQL file containing a schema for a table at redshift