NoSQL databases, such as Couchbase, store a lot of documents in memory, therefore, their huge speed, but also make more demand for the memory size of the server (s) on which it runs.
I am looking for a better strategy between several opposing strategies for storing documents in a NoSQL database. It:
Putting all the information in one (large) document has the advantage that using one GET, information can be obtained from memory or from disk (if it has been deleted from memory before). With NoSQL databases without a schema, I almost wanted it. But in the end, the document will become too large and eat up a lot of memory, fewer documents will be stored in the memory as a whole
Separating all documents into several documents (for example, using compound keys, as described in this question: Designing record keys for a document-oriented database is best practice , especially when these documents will contain only the information that is needed in a particular read / update operation , allowing to store in memory more (transitional) documents.
The use case I'm looking at is call reporting (CDR) from telecommunication providers. These CDRs all go to hundreds of millions usually per day. However, many of these clients do not provide a single record for each given day (I look at the Southeast Asian market with its predominant dominance and even less data richness). This will mean that usually a large number of documents have Read / update, perhaps in a day, only a small percentage will have several read / update cycles per day.
One solution that I was offered was to build 2 buckets, with more RAM allocated to more transition ones, and less RAM allocated in the second bucket, where large documents are stored. This will allow faster access to more transient data and slower access to a larger document, which, for example, contains profile / user information that does not change at all. I see two drawbacks of this proposal, although one of them is that you cannot create a view (Map / Reduce) in two codes (this is specifically for Couchbase, another NoSQL solution can resolve this), and the second is more overhead in working closely with the balance between memory allocation for both buckets as the user base grows.
Does anyone else dispute this and what is your solution to this problem? What would be the best strategy for your POV and why? It is clear that this is most likely something between the two strategies, having only one document or having one large document, divided into hundreds of documents, cannot be an ideal IMO solution.
EDIT 2014-9-14 Well, although this is close to answering my own question, but in the absence of any proposed solution, there is still a bit more background here and after the comment, as I plan to organize my data trying to achieve a sweet spot between speed and memory consumption
Mobile_No: Profile
- it contains profile information from the table, and not directly from the CDR. Less transient data goes here like age, gender and name. A key is a composite key consisting of a mobile number (MSISDN) and a profile of a word, separated by a ":"
Mobile_No: Revenue
- it contains temporary information such as usage counters and variables accumulating the total revenue spent by the customer. The key is again a composite key, consisting of a mobile number (MSISDN) and the word revenue, separated by a ":"
Mobile_No: Optina
- it contains semi-continuous information about when the client chose the program and when he again refused the program. This can happen several times and is processed through an array. The key is again a composite key consisting of a mobile number (MSISDN) and the word optin, separated by the symbol ":"
connection_id
- It contains information about a specific A / B connection (sender / receiver) that was made via voice or video call or SMS / MMS. The key consists of both mobile_no, which are concatenated.
Before these changes in the structure of the document, I put all the information about the profile, revenues and optins into one large document, always saving connection_id as a separate document. This new document storage strategy gives me hope for a better compromise between speed and memory consumption, because I divide the main document into several documents so that each of them has only important information that is read / updated at one step of the application.
It also takes into account different rates of change over time, when some data is very temporary (for example, counters and the cumulative revenue field, which is updated with every CDR input), and the profile information is basically unchanged. I hope this gives a better understanding of what I'm trying to achieve, comments and feedback are more than welcome.