Speeding up performance decreases with increasing collection (number of documents)

Use Case:

I consume REST Api, which gives the results of a battle in a video game. This is a team against a team online game, and each team consists of 3 players who can choose different from 100 different characters. I want to count the number of wins / losses and draws for each combination of teams. I get about 1000 battle results per second. I combine the character identifiers (ascending) of each team, and then I save the gains / losses and draws for each combination.

My current implementation:

const combinationStatsSchema: Schema = new Schema({ combination: { type: String, required: true, index: true }, gameType: { type: String, required: true, index: true }, wins: { type: Number, default: 0 }, draws: { type: Number, default: 0 }, losses: { type: Number, default: 0 }, totalGames: { type: Number, default: 0, index: true }, battleDate: { type: Date, index: true, required: true } }); 

For each log returned, I execute upsert and send these requests in bulk (5-30 lines) to MongoDB:

 const filter: any = { combination: log.teamDeck, gameType, battleDate }; if (battleType === BattleType.PvP) { filter.arenaId = log.arena.id; } const update: {} = { $inc: { draws, losses, wins, totalGames: 1 } }; combiStatsBulk.find(filter).upsert().updateOne(update); 

My problem:

So far I have only a few thousand entries in my combinationStats mongodb collection occupying only 0-2% of the processor. When a collection has several million documents (which happens quite quickly due to the number of possible combinations), MongoDB constantly takes 50-100%. Obviously, my approach does not scale at all.

My question is:

Any of these options may be the solution to my problem above:

  • Can I optimize the performance of my MongoDB solution described above so that it does not take up so much CPU? (I have already indexed the fields into which I am filtering, and I do upserts in bulk). Would it help to create a hash (based on all the filter fields) that I could use to filter the data, and then to improve performance?
  • Is there a better database / technology suitable for aggregating such data? I could introduce a couple more use cases when I need / need to increase the counter for a given identifier.

Edit: After khang commented that this could be related to performance improvements, I replaced $inc with $inc with $set , and indeed, the performance was equally “poor”. So I tried the proposed find() and then the manual update() approach, but the results did not get better.

+5
source share
2 answers

Create a hash in your filter conditions:

I was able to reduce the processor from 80-90% to 1-5% and had higher bandwidth.

Apparently, the problem was in the filter. Instead of filtering on these three conditions: { combination: log.teamDeck, gameType, battleDate } I created a 128-bit hash in my node application. I used this hash to enhance and set the combination, gameType and battleDate as extra fields in my updated document.

To create the hash, I used the metrohash library, which can be found here: https://github.com/jandrewrogers/MetroHash . Unfortunately, I can’t explain why the performance is much better, especially since I indexed all my previous conditions.

+1
source

In (1.), you claim to do bulk upserts. But based on how it seems to be scalable, you probably send too few lines to each batch. Consider doubling the batch size for each doubling of the stored rows. Please send mongo explain () request plan for your installation.

In (2.) you go to, say, mysql or postgres. Yes, that would be an absolutely correct experiment. Again, be sure to send the EXPLAIN output along with your sync data.

There are only a million possible team commands, and there is a distribution among them, some of which are much more popular than others. You only need to maintain a million counters, which is not so much. However, 1e6 I / O may take some time, especially if they are random reads. Consider moving from a resident disk data structure to which you can make frequent COMMITs, and switching to a resident hash or b-tree. This is not like ACID quality assurance is important to your application.

In addition, as soon as you have collected the “large” input packages, of course, more than a thousand and possibly about a million, take care of sorting the batch before processing. Then the counter maintenance problem looks just like a merge, either in the internal memory or in the external storage.

One fundamental approach to scaling your batches is to accumulate observations in some ordered memory buffer with a convenient size and only release aggregate (calculated) observations from this stage of the pipeline, when the number of individual command commands in the buffer is higher than a certain threshold of K. Mongo or whatever the next step in your pipeline. If K is much greater than 1% of 1e6, then even a sequential scan of the counters stored on the disk will have a good chance of finding useful update work for each read block of the disk.

+1
source

Source: https://habr.com/ru/post/1274363/


All Articles