Can CouchDB handle thousands of individual databases?

Can CouchDB process thousands of individual databases on a single computer?

Imagine you have a collection of BankTransaction s. There are many thousands of entries. (EDIT: Don't actually store transactions - just think of a very large number of very small, frequently updated records. This is basically a join table from SQL-land.)

Every day you need a consolidated overview of transactions that occurred only at a branch of your local bank. If all records are in the same database, view regeneration will process all transactions from all branches. This is a much larger piece of work and unnecessary for a user who cares only about his specific subset of documents.

This makes it seem that each branch of the bank should be divided into its own database so that the views are generated in small pieces and independently of each other. But I have never heard of anyone doing this, and it looks like an anti-template (for example, duplicating the same design document in thousands of different databases).

Is there any other way to simulate this problem? (If separation occurs between separate machines, rather than separate databases on the same computer?) If not, can CouchDB handle the thousands of databases that will be needed to store small partitions?

(Thanks!)

+6
source share
3 answers

[Warning, I assume that you are using this in some kind of production environment. Just go with the short answer if this is for a school or pet project.]

The short answer is yes.

The longer answer is that there are some things you need to observe ...

  • You will play cue ball with many system settings, such as maximum file descriptors.

  • You will also play whack-a-mole with erlang vm settings.

  • CouchDB has the option "max open databases". Increase this or you will have pending pending requests.

  • It will be PITA to combine multiple reporting databases. You can do this by polling each _changes feed database, modifying the data, and then throwing it back into the central / aggregating database. The toolkit to make this simpler simply does not exist in the CouchDB API. Almost, but not quite.

However, the biggest problem you will encounter if you try to do this is that CouchDB does not scale horizontally [well]. If you add more CouchDB servers, all of them will have duplicate data. Undoubtedly, your maximum open dbs account will scale linearly with each node added, but other things, such as assembly build time, will not (for example, they will all need to do their own viewing assemblies).

While I saw thousands of open databases in a BigCouch cluster. Anecdotal, because of the clustering of dynamos: more nodes do different things in parallel, and are also disconnected from CouchDB servers, copying each other.

Greetings.

+5
source

Multiple databases are possible, but in most cases, I think that an aggregate database will really give better performance to your branches. Keep in mind that you only optimize when updating a document in a view; each document will be processed only once per view.

For polling at the end of the day in the aggregated database, the first branch will process 100% of new documents and pay 100% of the delay. All other branches will pay 0%. Thus, most industries are profitable. For polling at the end of the day in separate databases, all branches pay a portion of the fine in proportion to their volume, so most of them are slightly behind.

For frequent updates during the day, active branches prefer aggregate and low-level branches to prefer separate. If one branch out of 10 adds 99% of the documents, most of the upgrade work will be done on other industry polls, so 9 out of 10 prefer separate dbs.

If this delay matters, and if the couch has some clock cycles that are not in use, you can write a three-line shell loop / view / sleep script that updates some documents before any user waits.

+1
source

I would add that having a large number of databases creates problems associated with compaction and replication. Not only things like continuous replication need to be run on the basis of each database (which means you have to write custom logic to iterate over all the databases), but they also spawn replication daemons on the database. It can quickly become invincible.

0
source

Source: https://habr.com/ru/post/911739/


All Articles