Well, this is a compromise with document repositories. You can store in a normalized manner, like any standard RDMS, and you should strive to normalize as much as possible. It is only where its performance struck that you have to upset the normalization and smooth out your data structures. The trade-off is reading efficiency and the cost of updating.
Mongo has really efficient indexes that simplify normalization like traditional RDMS (most document stores do not give you this for free, so Mongo is more a hybrid than a pure document store). Using this, you can create a collection of relationships between users and events. It is similar to a surrogate table in a table data warehouse. The index of event and user fields should be pretty fast and help you normalize your data.
I like to build the efficiency of aligning the structure and maintaining its normalization when it comes to the time I need to update these records, and also to read what I need in the query. You can do this in terms of the big O notation, but you don't have to be such a fantasy. Just put some numbers on paper, based on several use cases with different data models, and get a good idea of ​​how much work is required.
Basically, I'm trying to predict the likelihood of how many updates a record will have, and how often it is done. Then I try to predict what the cost of updating corresponds to reading when it is normalized or smoothed (or maybe partially combined with two that I can imagine ... many optimization options). Then I can judge how to save it compared to the cost of collecting data from normalized sources. After I built all the variables, if the savings from saving it save me a lot, then I will save it.
A few tips:
- If you need a quick search to be fast and atomic (completely modern), you may need a service in which you prefer alignment to normalization and getting into an update.
- If you want the update to be fast and access immediately normalizes.
- If you need quick search queries, but they do not require completely fresh data, consider building normalized data in batch jobs (possibly using a map / reduce).
- If your requests need to be fast and updates are rare and do not necessarily require immediate access to your update or require locking of a transaction level that has passed 100% of the time (to ensure that your update has been written to disk), you can consider writing your updates to the queue, processing them in the background. (In this model, you will probably have to deal with conflict resolution and reconciliation later).
- Profile of different models. Create a data request abstraction layer (for example, ORM) in your code so that you can later reorganize the data warehouse structure.
There are many other ideas that you can use. There are many great on-line blogs that go into it like highscalabilty.org, and make sure you understand the CAP theorem.
Also consider a caching layer like Redis or memcache. I will put one of these products in front of my data layer. When I request mongo (which keeps everything normalized), I use the data to build a flattened view and store it in the cache. When I update the data, I will invalidate any data in the cache that refers to what I am updating. (Although you need to take the time to invalidate the data and tracking data in the cache, which is updated based on your scaling factors). Someone once said: "The two most difficult things in Computer Science are also called invalid cache."
Hope this helps!
Zac Bowling Oct 24 '10 at 20:09 2010-10-24 20:09
source share