Too much data duplication in mongodb?

Question

Too much data duplication in mongodb?

I am new to all of this NOSQL stuff and have recently been intrigued by mongoDB. I am creating a new site from scratch and decided to go with MONGODB / NORM (for C #) as my only database. I read a lot about how to properly create a document model database, and I think that for the most part I have a well-designed design. I am about 6 months old on a new site, and I am starting to see problems with duplication / synchronization of data that I need to deal with again and again. From what I read, this is expected in the document model, but for performance it makes sense. I.E. you embed embedded objects in your document to read quickly - no connections; but of course you cannot always embed, so mongodb has this concept of DbReference, which is basically similar to the foreign key in relational databases.

So here is an example: I have users and events; both get their own document, users attend events, events have users. I decided to include a list of objects with limited data in User objects. I included the "Users" list in the "Event" objects as "participants." Now the problem is that I have to synchronize users with the list of users, which is also embedded in the Event object. As I read this, this is apparently the preferred approach, and the NOSQL way of doing something. The search is quick, but the decline is when I update the main document of the user, I also need to go to the Event objects, perhaps find all the links to this user and update it.

So, the question I have is a fairly common problem that people have to deal with? How much should this problem happen before you start saying: "Perhaps the NOSQL strategy is not suitable for what I'm trying to do here"? When does the performance advantage that you do not need to make connections turn into a disadvantage, because it is difficult for you to store data in embedded objects and make several records in the database for this?

+49

duplicates mongodb norm

mike Oct 24 '10 at 19:35

source share

3 answers

Zac Bowling · Answer 1 · 2010-10-24 20:09

Well, this is a compromise with document repositories. You can store in a normalized manner, like any standard RDMS, and you should strive to normalize as much as possible. It is only where its performance struck that you have to upset the normalization and smooth out your data structures. The trade-off is reading efficiency and the cost of updating.

Mongo has really efficient indexes that simplify normalization like traditional RDMS (most document stores do not give you this for free, so Mongo is more a hybrid than a pure document store). Using this, you can create a collection of relationships between users and events. It is similar to a surrogate table in a table data warehouse. The index of event and user fields should be pretty fast and help you normalize your data.

I like to build the efficiency of aligning the structure and maintaining its normalization when it comes to the time I need to update these records, and also to read what I need in the query. You can do this in terms of the big O notation, but you don't have to be such a fantasy. Just put some numbers on paper, based on several use cases with different data models, and get a good idea of how much work is required.

Basically, I'm trying to predict the likelihood of how many updates a record will have, and how often it is done. Then I try to predict what the cost of updating corresponds to reading when it is normalized or smoothed (or maybe partially combined with two that I can imagine ... many optimization options). Then I can judge how to save it compared to the cost of collecting data from normalized sources. After I built all the variables, if the savings from saving it save me a lot, then I will save it.

A few tips:

If you need a quick search to be fast and atomic (completely modern), you may need a service in which you prefer alignment to normalization and getting into an update.
If you want the update to be fast and access immediately normalizes.
If you need quick search queries, but they do not require completely fresh data, consider building normalized data in batch jobs (possibly using a map / reduce).
If your requests need to be fast and updates are rare and do not necessarily require immediate access to your update or require locking of a transaction level that has passed 100% of the time (to ensure that your update has been written to disk), you can consider writing your updates to the queue, processing them in the background. (In this model, you will probably have to deal with conflict resolution and reconciliation later).
Profile of different models. Create a data request abstraction layer (for example, ORM) in your code so that you can later reorganize the data warehouse structure.

There are many other ideas that you can use. There are many great on-line blogs that go into it like highscalabilty.org, and make sure you understand the CAP theorem.

Also consider a caching layer like Redis or memcache. I will put one of these products in front of my data layer. When I request mongo (which keeps everything normalized), I use the data to build a flattened view and store it in the cache. When I update the data, I will invalidate any data in the cache that refers to what I am updating. (Although you need to take the time to invalidate the data and tracking data in the cache, which is updated based on your scaling factors). Someone once said: "The two most difficult things in Computer Science are also called invalid cache."

Hope this helps!

Peter Bromberg · Answer 2 · 2010-10-24 23:18

Try adding an IList of type UserEvent to the User object. You have not indicated much about how your domain model is designed. Check out the NoRM group http://groups.google.com/group/norm-mongodb/topics for examples.

1Schema.com · Answer 3 · 2016-12-07 04:23

At 1Schema.com we are trying to solve this very problem!

Our goal is to help you create a NoSQL database so that each page request can receive its data in a single read operation.

To this end, we distinguish between “real data” (actually exists inside the document) and “cached data” (exists outside the document, but copied locally)

Updating the "real data" in the document forces you to automatically update all your cached copies.

Conversely, you should never update the “cached data” in a document, since that data is automatically updated using our change propagation code.

To this end, we use different types of edges to denote different types of behavior:

Parent-child edges define data whose existence is tied to the same root document, which gives a nested array of subdocuments ("Aggregate template" in a domain-managed project)

A "foreign key" determines how other documents are cached locally within a given document based on identifier references (cross-aggregated links in DDD)

Instead of using a schema to enforce constraints, we use a schema to automate updating cached data when the original document (s) changes

Please check our export to MongoDB to find out how we automate the distribution of changes ... all exported code works in the Mongo shell, no special libraries needed

Too much data duplication in mongodb?

More articles: