What is the best strategy for storing documents in NoSQL databases?

Question

What is the best strategy for storing documents in NoSQL databases?

NoSQL databases, such as Couchbase, store a lot of documents in memory, therefore, their huge speed, but also make more demand for the memory size of the server (s) on which it runs.

I am looking for a better strategy between several opposing strategies for storing documents in a NoSQL database. It:

Speed optimization

Putting all the information in one (large) document has the advantage that using one GET, information can be obtained from memory or from disk (if it has been deleted from memory before). With NoSQL databases without a schema, I almost wanted it. But in the end, the document will become too large and eat up a lot of memory, fewer documents will be stored in the memory as a whole

Memory optimization

Separating all documents into several documents (for example, using compound keys, as described in this question: Designing record keys for a document-oriented database is best practice , especially when these documents will contain only the information that is needed in a particular read / update operation , allowing to store in memory more (transitional) documents.

The use case I'm looking at is call reporting (CDR) from telecommunication providers. These CDRs all go to hundreds of millions usually per day. However, many of these clients do not provide a single record for each given day (I look at the Southeast Asian market with its predominant dominance and even less data richness). This will mean that usually a large number of documents have Read / update, perhaps in a day, only a small percentage will have several read / update cycles per day.

One solution that I was offered was to build 2 buckets, with more RAM allocated to more transition ones, and less RAM allocated in the second bucket, where large documents are stored. This will allow faster access to more transient data and slower access to a larger document, which, for example, contains profile / user information that does not change at all. I see two drawbacks of this proposal, although one of them is that you cannot create a view (Map / Reduce) in two codes (this is specifically for Couchbase, another NoSQL solution can resolve this), and the second is more overhead in working closely with the balance between memory allocation for both buckets as the user base grows.

Does anyone else dispute this and what is your solution to this problem? What would be the best strategy for your POV and why? It is clear that this is most likely something between the two strategies, having only one document or having one large document, divided into hundreds of documents, cannot be an ideal IMO solution.

EDIT 2014-9-14 Well, although this is close to answering my own question, but in the absence of any proposed solution, there is still a bit more background here and after the comment, as I plan to organize my data trying to achieve a sweet spot between speed and memory consumption

Mobile_No: Profile

it contains profile information from the table, and not directly from the CDR. Less transient data goes here like age, gender and name. A key is a composite key consisting of a mobile number (MSISDN) and a profile of a word, separated by a ":"

Mobile_No: Revenue

it contains temporary information such as usage counters and variables accumulating the total revenue spent by the customer. The key is again a composite key, consisting of a mobile number (MSISDN) and the word revenue, separated by a ":"

Mobile_No: Optina

it contains semi-continuous information about when the client chose the program and when he again refused the program. This can happen several times and is processed through an array. The key is again a composite key consisting of a mobile number (MSISDN) and the word optin, separated by the symbol ":"

connection_id

It contains information about a specific A / B connection (sender / receiver) that was made via voice or video call or SMS / MMS. The key consists of both mobile_no, which are concatenated.

Before these changes in the structure of the document, I put all the information about the profile, revenues and optins into one large document, always saving connection_id as a separate document. This new document storage strategy gives me hope for a better compromise between speed and memory consumption, because I divide the main document into several documents so that each of them has only important information that is read / updated at one step of the application.

It also takes into account different rates of change over time, when some data is very temporary (for example, counters and the cumulative revenue field, which is updated with every CDR input), and the profile information is basically unchanged. I hope this gives a better understanding of what I'm trying to achieve, comments and feedback are more than welcome.

+1

memory-management cloud nosql storage couchbase

a4xrbj1 10 sept. '14 at 18:40

source share

2 answers

I agree with your technique on the efficient use of resources (if they are limited). But, on the other hand, the system can be very talkative. If I understood correctly, your “connection” document design is too narrow and can introduce too many network I / O operations. In my experience, these network I / O operations are very expensive if you are developing a system that makes decisions in real time. You can mathematically evaluate the effect of these different options on the balance of these opposing forces :)

I think the spirit of scalable big data systems is that we will worry less about resource constraints. These no-sql database licenses do not apply to CPU cores. Commodity equipment is cheap. RAM is getting cheaper, as we discuss. Once again, the return on investment of these systems will also affect architectural decisions.

+1

Raja sp 21 sept '14 at 0:34

source share

user1697575 · Accepted Answer · 2014-09-15T15:51:18+0000

Thanks for updating the original question. You are right when you are talking about the right balance between coarse-grained documents and fine grain.

The ultimate document architecture actually matches your specific business domain needs. You must identify in your cases the “chunks” of data that are needed in general, and then base your saved form of documents on this. Here are a few high-level steps that you must follow when developing a document structure:

Identify all consumable use cases for your application / service. (read, read-write, search for items)
Create your documents (most likely you will get several smaller documents against one large document that has everything)
Create your own document keys, which can coexist in the same bucket for different types of documents (for example, use the namespace in the key value)
Make a dry move of the resulting model against your use cases to see that you have optimal transactions (read / write) on noSQL and all the required document data in the transaction.
Run performance testing for your use cases (try to simulate the expected load at least 2 times higher).

Note. . When you design different documents in order to have some redundancy (remember that these are not RDBMS with a normalized form), think of it more as an object-oriented design.

Note2: If you have search items that are outside of your keys (for example, searching for customers by last name “begins with” and some other dynamic search criteria), use the ElasticSearch integration with CB or you can also try the N1QL query language, which Comes with CB3.0.

It seems that you are going in the right direction, dividing all the related MSISDNs into several smaller documents, for example: MSISDN: profile, MSISDN: income, MSISDN: optin. I would pay particular attention to your last type of document type "A / B". It looks like it can generate a lot of volume and is transient in nature ... so you need to figure out how long these documents should live in the Couchbase bucket. You can specify TTL (lifetime) so that old documents are automatically cleared.

What is the best strategy for storing documents in NoSQL databases?

More articles: