In my opinion, the best practice is to create your own extractor. This can be done in the language of your choice, and you can extract it in CSV or JSON.
But if you are looking for a quick way and if your data is not huge and can fit on the same server, I recommend using mongoexport
. Suppose you have a simple document structure, such as below:
{ "_id" : "tdfMXH0En5of2rZXSQ2wpzVhZ", "statuses" : [ { "status" : "dc9e5511-466c-4146-888a-574918cc2534", "score" : 53.24388894 } ], "stored_at" : ISODate("2017-04-12T07:04:23.545Z") }
Then you need to define a BigQuery schema ( mongodb_schema.json
), for example:
$ cat > mongodb_schema.json <<EOF [ { "name":"_id", "type": "STRING" }, { "name":"stored_at", "type": "record", "fields": [ { "name":"date", "type": "STRING" } ]}, { "name":"statuses", "type": "record", "mode": "repeated", "fields": [ { "name":"status", "type": "STRING" }, { "name":"score", "type": "FLOAT" } ]} ] EOF
Now the fun part begins :-) Extracting data as JSON from your MongoDB. Suppose you have a cluster named replica statuses
, your db sample
, and your collection status
.
mongoexport \ --host statuses/db-01:27017,db-02:27017,db-03:27017 \ -vv \ --db "sample" \ --collection "status" \ --type "json" \ --limit 100000 \ --out ~/sample.json
As you can see above, I limit the output to 100 thousand records, because I recommend that you run the sample and load BigQuery before doing this for all your data. After executing on the command, you should have your sample data in sample.json
, but there is a $date
field that will lead to loading the error in BigQuery. To fix this, we can use sed
to replace them with a simple field name:
# Fix Date field to make it compatible with BQ sed -i 's/"\$date"/"date"/g' sample.json
Now you can compress, upload to Google Cloud Storage (GCS), and then upload to BigQuery using the following commands:
# Compress for faster load gzip sample.json
If everything is ok, go back and remove the --limit 100000
command from mongoexport
and run the commands above again to load everything instead of the 100k sample.
ALTERNATIVE DECISION:
If you want more flexibility and performance, this is not your problem, you can use the mongo
CLI tool. That way, you can write your extraction logic in JavaScript and execute it against your data, and then send the output to BigQuery. Here is what I did for the same process, but used JavaScript to output to CSV, so I can load it a lot easier in BigQuery:
# Export Logic in JavaScript cat > export-csv.js <<EOF var size = 100000; var maxCount = 1; for (x = 0; x < maxCount; x = x + 1) { var recToSkip = x * size; db.entities.find().skip(recToSkip).limit(size).forEach(function(record) { var row = record._id + "," + record.stored_at.toISOString();; record.statuses.forEach(function (l) { print(row + "," + l.status + "," + l.score) }); }); } EOF
TIP: In the above script, I did a little trick by outputting a pipeline to split the output in multiple files with the prefix sample_
. Also during the split, GZip outputs so that you can load GCS more easily.