How should you populate a new table in Rails?

I am creating a new table that needs to be populated with data based on user accounts (over several tens of thousands) with the next one-time rake task.

What I decided to do was create a large INSERT string for every 2000 users and execute this query.

Here's what it looks like:

task :backfill_my_new_table => :environment do inserts = [] User.find_each do |user| tuple = # form the tuple based on user and user associations like (1, 'foo', 'bar', NULL) inserts << tuple end # At this point, the inserts array is of size at least 20,000 conn = ActiveRecord::Base.connection inserts.each_slice(2000) do |slice| sql = "INSERT INTO my_new_table (ref_id, column_a, column_b, column_c) VALUES #{inserts.join(", ")}" conn.execute(sql) end end 

So I'm wondering if there is a better way to do this? What are some of the disadvantages of the approach I took? How can I improve it? What if I didn’t slice the inserts array and just execute a single INSERT with more than two tens of thousands of VALUES tuples? What are the disadvantages of this method?

Thanks!

+5
source share
1 answer

Depending on which version of PG you are using, but in most cases of bulk data loading into a table, a checklist is sufficient:

  • try using COPY instead of INSERT whenever possible;
  • when using multiple INSERTs, disable autocommit and wrap all INSERTs in one transaction, i.e. BEGIN; INSERT ...; INSERT ...; COMMIT;
  • disable indexes and checks / restrictions in / target table;
  • disable table triggers;
  • change the table so that it becomes unlogged (starting with PG 9.5, do not forget to enable logging after importing data) or increase max_wal_size so that the WAL will not be flooded

20 thousand lines are not so important for PG, so inserts in two lines of inserts in one transaction will be fine if you do not use three strong / strong triggers / checks of very complex ones . It's also worth reading the PG bulk download section of the guide .

UPD: a little old but wonderful thing from depesz , excerpt:

therefore, if you want to insert data as quickly as possible, use a copy (or better yet, pgbulkload). if for some reason you cannot use the copy, use multi-line inserts (new in 8.2!). then, if you can, link them in transactions and use prepared transactions, but in general - they do not give you much.

0
source

Source: https://habr.com/ru/post/1243972/


All Articles