MongoDB collection used for log data: index or not?

I am using MongoDB as a temporary log repository. The collection receives ~ 400,000 new lines per hour. Each line contains a UNIX timestamp and a JSON line.

Periodically, I would like to copy the contents of the collection to a file on S3, creating a file for every hour containing ~ 400,000 lines (for example, today_10_11.log contains all the lines received from 10:00 to 11:00). I need to make this copy while the collection gets pasted.

My question is: what is the impact of performance on the index in the timestamp column on 400,000 hourly poetic inserts for the extra time it takes to query the number of hours in rows.

The application in question uses Ruby running on Heroku and using the MongoHQ plugin.

+4
source share
4 answers

Mongo indexes the _id field by default, and ObjectId already starts with a timestamp, so basically Mongo already indexes your collection using input time for you. Therefore, if you use the default Mongo settings, you do not need to index the second temporary field (or even add it).

To get the rudy object id creation time:

ruby-1.9.2-p136 :001 > id = BSON::ObjectId.new => BSON::ObjectId('4d5205ed0de0696c7b000001') ruby-1.9.2-p136 :002 > id.generation_time => 2011-02-09 03:11:41 UTC 

To create object identifiers within a given time:

 ruby-1.9.2-p136 :003 > past_id = BSON::ObjectId.from_time(1.week.ago) => BSON::ObjectId('4d48cb970000000000000000') 

So, for example, if you want to download all the documents inserted last week, you will simply look for _ids more than past_id and less id. So, through the Ruby driver:

 collection.find({:_id => {:$gt => past_id, :$lt => id}}).to_a => #... a big array of hashes. 

You can, of course, also add a separate field for timestamps and index it, but it makes no sense to use this performance when Mongo already does the necessary work for you with its default field _id.

Additional information about object identifiers.

+4
source

I have an application like yours and it currently contains 150 million log entries. At 400k per hour, this database will develop rapidly. A 400k inserts an hour with timestamped indexing would be much more appropriate than executing a raw query. I have no problem inserting tens of millions of records per hour with an indexed timestamp, but if I make an unindexed request for a timestamp, it takes a couple of minutes on a 4-server shard (processor binding). An indexed query appears instantly. So definitely index it, the recording overhead for indexing is not so high, and 400 thousand. There are not many records per hour for Mongo.

One thing you need to look for is memory size. At 400 thousand records per hour, you make 10 million per day. It will consume about 350 MB of memory per day to keep this index in memory. Therefore, if this continues for some time, your index may get more memory faster.

Also, if you truncate records after a certain period of time using remove, I find that it removes, creates a lot of I / O to disk, and is bound to disk.

+4
source

Of course, with every entry you will need to update the index data. If you are going to make large queries on the data, you definitely need an index.

Consider saving the timestamp in the _id field instead of the MongoDB ObjectId. As long as you keep the unique timestamps, you will be fine here. _id does not have to be an ObjectID, but has an automatic index on _id. This may be your best bet as you will not add extra load to the index.

+1
source

I would use a limited collection, non-indexed, with space for, say, 600k rows to resolve slush. Once per hour, upload the collection to a text file, then use grep to filter strings that are not relevant to your given date. This does not allow you to use good DB bits, but it means that you do not need to ever worry about collection indexes, flushes or any of these nonsense. The critical bit for this is that the collection is kept free for inserts, so if you can make a hard bit (filter by date) outside the context of the database, you should not have a noticeable effect on performance. Text lines 400-600k are trivial for grep and probably shouldn't take more than a second or two.

If you don't mind a bit of slush in every magazine, you can just drop the gzip collection too. You will get older data in each dump, but if you do not enter more than 600 thousand lines between dumps, you should have a continuous series of log images of 600 thousand lines.

+1
source

Source: https://habr.com/ru/post/1338958/


All Articles