MongoDB for BigQuery

What is the best way to export data from MongoDB hosted in mlab to google bigquery?

At first I try to do a one-time download from MongoDB to BigQuery, and later I think about using Pub / Sub to stream in real time for bigquery.

I need help with the first one-time download from mongodb to bigquery.

+6
source share
2 answers

In my opinion, the best practice is to create your own extractor. This can be done in the language of your choice, and you can extract it in CSV or JSON.

But if you are looking for a quick way and if your data is not huge and can fit on the same server, I recommend using mongoexport . Suppose you have a simple document structure, such as below:

 { "_id" : "tdfMXH0En5of2rZXSQ2wpzVhZ", "statuses" : [ { "status" : "dc9e5511-466c-4146-888a-574918cc2534", "score" : 53.24388894 } ], "stored_at" : ISODate("2017-04-12T07:04:23.545Z") } 

Then you need to define a BigQuery schema ( mongodb_schema.json ), for example:

 $ cat > mongodb_schema.json <<EOF [ { "name":"_id", "type": "STRING" }, { "name":"stored_at", "type": "record", "fields": [ { "name":"date", "type": "STRING" } ]}, { "name":"statuses", "type": "record", "mode": "repeated", "fields": [ { "name":"status", "type": "STRING" }, { "name":"score", "type": "FLOAT" } ]} ] EOF 

Now the fun part begins :-) Extracting data as JSON from your MongoDB. Suppose you have a cluster named replica statuses , your db sample , and your collection status .

 mongoexport \ --host statuses/db-01:27017,db-02:27017,db-03:27017 \ -vv \ --db "sample" \ --collection "status" \ --type "json" \ --limit 100000 \ --out ~/sample.json 

As you can see above, I limit the output to 100 thousand records, because I recommend that you run the sample and load BigQuery before doing this for all your data. After executing on the command, you should have your sample data in sample.json , but there is a $date field that will lead to loading the error in BigQuery. To fix this, we can use sed to replace them with a simple field name:

 # Fix Date field to make it compatible with BQ sed -i 's/"\$date"/"date"/g' sample.json 

Now you can compress, upload to Google Cloud Storage (GCS), and then upload to BigQuery using the following commands:

 # Compress for faster load gzip sample.json # Move to GCloud gsutil mv ./sample.json.gz gs://your-bucket/sample/sample.json.gz # Load to BQ bq load \ --source_format=NEWLINE_DELIMITED_JSON \ --max_bad_records=999999 \ --ignore_unknown_values=true \ --encoding=UTF-8 \ --replace \ "YOUR_DATASET.mongodb_sample" \ "gs://your-bucket/sample/*.json.gz" \ "mongodb_schema.json" 

If everything is ok, go back and remove the --limit 100000 command from mongoexport and run the commands above again to load everything instead of the 100k sample.

ALTERNATIVE DECISION:

If you want more flexibility and performance, this is not your problem, you can use the mongo CLI tool. That way, you can write your extraction logic in JavaScript and execute it against your data, and then send the output to BigQuery. Here is what I did for the same process, but used JavaScript to output to CSV, so I can load it a lot easier in BigQuery:

 # Export Logic in JavaScript cat > export-csv.js <<EOF var size = 100000; var maxCount = 1; for (x = 0; x < maxCount; x = x + 1) { var recToSkip = x * size; db.entities.find().skip(recToSkip).limit(size).forEach(function(record) { var row = record._id + "," + record.stored_at.toISOString();; record.statuses.forEach(function (l) { print(row + "," + l.status + "," + l.score) }); }); } EOF # Execute on Mongo CLI _MONGO_HOSTS="db-01:27017,db-02:27017,db-03:27017/sample?replicaSet=statuses" mongo --quiet \ "${_MONGO_HOSTS}" \ export-csv.js \ | split -l 500000 --filter='gzip > $FILE.csv.gz' - sample_ # Load all Splitted Files to Google Cloud Storage gsutil -m mv ./sample_* gs://your-bucket/sample/ # Load files to BigQuery bq load \ --source_format=CSV \ --max_bad_records=999999 \ --ignore_unknown_values=true \ --encoding=UTF-8 \ --replace \ "YOUR_DATASET.mongodb_sample" \ "gs://your-bucket/sample/sample_*.csv.gz" \ "ID,StoredDate:DATETIME,Status,Score:FLOAT" 

TIP: In the above script, I did a little trick by outputting a pipeline to split the output in multiple files with the prefix sample_ . Also during the split, GZip outputs so that you can load GCS more easily.

+8
source

From a basic reading of MongoDB documentation, it seems like you can use mongoexport to dump your database as JSON. After you have done this, refer to the BigQuery loading data section for a description of how to create a table from JSON files after copying them to the GKS.

+2
source

Source: https://habr.com/ru/post/1014882/


All Articles