MongoDB in AWS Redshift

We have a fairly large instance of MongoDB with defined collections. He has reached the point where it is becoming too expensive to rely on the capabilities of MongoDB queries (including the aggregation structure) to understand the data.

I looked at options to make the data accessible and more convenient to use, and settled on two promising options:

  • AWS redshift
  • Hadoop + hive

We want to be able to use SQL syntax to parse our data, and we want to access the data in real time (a few minutes delay is good, we just don’t want to wait for MongoDB to synchronize overnight).

As far as I can tell, for option 2 you can use https://github.com/mongodb/mongo-hadoop to transfer data from MongoDB to the Hadoop cluster,

I looked high and low, but I'm struggling to find a similar solution for getting MongoDB in AWS Redshift. From a look at Amazon articles, it seems right that you need to use AWS Kinesis to retrieve data in Redshift. However, I cannot find a single example of someone who did something similar, and I cannot find any libraries or connectors for moving data from MongoDB to the Kinesis stream. At least it looks promising.

Has anyone done something like this?

+6
source share
3 answers

I have finished coding our own migrator using NodeJS. I'm a little annoyed with the answers explaining what redshift and MongoDB are, so I decided I would take the time to share what I had to do at the end.

Temporary data

Basically, we guarantee that all of our MongoDB collections that we want to transfer to tables in redshift are timestamped and indexed according to that timestamp.

Cursors plugins

Then we code the plugin for each migration that we want to make from the mongo collection into a redshift table. Each plugin returns a cursor that takes into account the last transferred date (passed to it from the migration mechanism) and returns only data that has been changed since the last successful migration for this plugin.

How cursors are used

Then the migration engine uses this cursor and moves through each record. He turns to the plugin for each entry to convert the document into an array, which the migrator then uses to create a separation line, which he passes to a file on disk. We use tabs to delimit this file, as our data contains many commas and pipes.

Restrict export from S3 to table with redshift

The migrator then downloads the delimited file to S3 and runs the red text copy command to load the file from S3 into the temporary table using the plugin configuration to get the name and symbol as the temporary table.

So, for example, if I had a plugin configured with the name of the employees table, it would create a temporary table called temp_employees .

Now we have the data in this temporary table. And the entries in this temporary table get their identifiers from the created MongoDB collection. This allows us to then delete with the target table, in our example, the employee table, where the identifier is present in the temp table. If any of the tables does not exist, it is created on the fly based on the scheme provided by the plugin . And so we need to insert all the entries from the temp table into the target table. This caters for both new records and updated records. We remove only soft blows according to our data, so it will be updated with the is_deleted flag in redshift.

Once this whole process has been completed, the transfer mechanism will save the timestamp for the plugin in the redshift table to keep track of when the latter is successfully running for it. This value is then passed to the plugin the next time the engine decides that it should transfer the data, allowing the plugin to use the timestamp in the cursor that it should provide to the engine.

So, each plugin / migration provides the following engine:

  • A cursor that optionally uses the last rescheduled date passed to it from the engine to ensure that only deltas are moved across.
  • The conversion function that the engine uses to turn each document into a cursor into a delimited line that is added to the export file
  • A schema file is an SQL file containing a schema for a table at redshift
+4
source

Redshift is a data warehouse product, and Mongo DB is a NoSQL database. It is clear that they do not replace each other and can coexist and serve different purposes. Now how to save and update records in both places. You can move all Mongo DB data to Redshift as a one-time activity. Redshift is not suitable for real-time recording. To synchronize in real time with Redshift, you must change the program that is written to Mongo DB. Let this program also be recorded in places S3. Arrangement S3 for redshift movement may be performed at a regular interval.

+2
source

Mongo DB, which is a document storage engine, Apache Solr, Elastic Search can be considered as possible replacements. But they do not support SQL query capabilities. They mainly use a different filtering mechanism. For example, Solr might require the use of a Dismax filter.

In the cloud, Amazon / Azure Search cloud search will be an attractive option to try.

0
source

Source: https://habr.com/ru/post/977172/


All Articles