Updating a large number of records - optimizing performance

I have a baseball tool that allows users to analyze player statistics by history. For example, how many hits has A-Rod in the last 7 days at night? I want to expand the timeframe so that the user can analyze the player’s batting statistics up to 365 days. However, this requires some serious performance optimization. Here is my current set of models:

class AtBat < ActiveRecord::Base belongs_to :batter belongs_to :pitcher belongs_to :weather_condition ### DATA MODEL ### # id # batter_id # pitcher_id # weather_condition_id # hit (boolean) ################## end class BattingStat < ActiveRecord::Base belongs_to :batter belongs_to :recordable, :polymorphic => true # eg, Batter, Pitcher, WeatherCondition ### DATA MODEL ### # id # batter_id # recordable_id # recordable_type # hits7 # outs7 # at_bats7 # batting_avg7 # ... # hits365 # outs365 # at_bats365 # batting_avg365 ################## end class Batter < ActiveRecord::Base has_many :batting_stats, :as => :recordable, :dependent => :destroy has_many :at_bats, :dependent => :destroy end class Pitcher < ActiveRecord::Base has_many :batting_stats, :as => :recordable, :dependent => :destroy has_many :at_bats, :dependent => :destroy end class WeatherCondition < ActiveRecord::Base has_many :batting_stats, :as => :recordable, :dependent => :destroy has_many :at_bats, :dependent => :destroy end 

To keep my question a reasonable length, let me talk about what I'm doing to update the batting_stats table, rather than copy a bunch of code. Start with 7 days.

  • Get all at_bat entries in the last 7 days.
  • Iterate over each at_bat entry ...
  • Given the at_bat entry, take the related dough and the appropriate weather condition, find the desired batting_stat entry (BattingStat.find_or_create_by_batter_and_recordable (batter, weather_condition), and then update the batting_stat entry.
  • Repeat step 3 for the test and the jug (record).

Steps 1-4 are repeated for other time periods - 15 days, 30 days, etc.

Now I imagine how painstaking it would be to run the script every day to make these updates, if I would increase the time periods from mangeable 7/15/30 to 7/15/30/45/60/90/180/365.

So my question is: how do you approach this to be performed at the highest level of performance?

+6
source share
4 answers

AR is not really intended for bulk processing like this. You are probably better off doing batch updates by going into SQL correctly and doing INSERT FROM SELECT (or perhaps using the gem that did this for you.)

+3
source

You essentially need to store the data in such a way that you can cancel the last day and replace it with the new first day in such a case that you do not need to recount the total.

One way to do this is to save the previous addition value and subtract the last day value from it, then add a new day value and then divide by 15/30/90/365.

This turns 366 operations into 3. Now reads from the database more slowly than 363 operations?

It also saves you from iterations, so all you have to do is check every day which weather conditions need to be updated.

+1
source

We have a similar problem with periodically downloading 600,000 US rental records every week. It will take more than 24 hours to process each record in sequence. But the database does not have to be a bottleneck, although each insert took a fixed amount of time, the database was not maxxed / pegged / flatlined in activity.

I knew that splitting a file into separate line records was simple and quick. In our case, the input file was in XML form, and I used a simple Java StringTokenizer to split the file into tags ...

This quickly gave me a large array of XML fragments containing rental property information that I needed to parse and import.

Then I used the Java convention ThreadPoolExecutor / FutureTask / Callable to create a pool of 20 threads that will take each XML fragment as input, extract the corresponding data and perform database inserts. I don't know what your equivalent of architecture will be, but I guess there is something like that.

In the end, I was able to adjust the stream size to maximize write throughput by controlling the load on the database server under different testing conditions. We set the thread size to 25.

0
source

When I had to work before this kind of work, I set out my SQL links and update my thoughts on how to perform complex updates. You can usually do a small update in a short time with a good request. In addition, you can find direct help with the request (send your schema and initial requests to gist, if they are really huge)

Recently, I had to align the value of counter_cache, and before doing it like a bunch of ruby ​​code loading parents and counting their children, I gave this request:

 UPDATE rates r SET children_count = child_counts.my_count from (SELECT parent_id, count(*) as my_count FROM rates GROUP BY parent_id having parent_id is not null) as child_counts where child_counts.parent_id = r.id; 

which updated 200k lines in just a few seconds

If you cannot do this in one request and if it is a one-time operation, you can split your process into 2 steps. First make a heavy lift and save the results in a new table, then read from this table and make a final update. Recently, I had to conduct a massive data collection, and the whole heavy lift took 2 days of processing and calculations. The result was placed in a new table with the corresponding row identifier and the total. During production, I just had a quick script that read from this new table and updated the related rows. It also allowed me to stop and restart from where I left off and pre-check the results before updating prod. In addition, he made the prod update very fast.

At the same time, I also learned that it is important to do your work in batches, if possible, and to complete the transaction as often / safely as possible, so you do not carry out a large transaction for too long.

0
source

Source: https://habr.com/ru/post/901580/


All Articles