How to make big non-blocking updates in PostgreSQL?

Question

How to make big non-blocking updates in PostgreSQL?

I want to make a big update to a table in PostgreSQL, but I don't need transaction integrity to be maintained throughout the operation because I know that the column that I am editing will not be written to or read during the update. I want to know if there is an easy way in psql console to speed up these types of operations.

For example, let's say I have a table called “orders” with 35 million rows, and I want to do this:

UPDATE orders SET status = null;

To avoid being redirected to an offtopic discussion, suppose all status values for 35 million columns are currently set to the same (non-zero) value, which makes the index useless.

The problem with this statement is that it takes a very long time for it to take effect (solely due to a lock), and all changed rows are locked until the entire update is completed. This update may take 5 hours, while something like

 UPDATE orders SET status = null WHERE (order_id > 0 and order_id < 1000000);

may take 1 minute. More than 35 million lines, doing the above and breaking it into pieces of 35, will take only 35 minutes and save me 4 hours and 25 minutes.

I could break it even further with a script (using pseudocode here):

 for (i = 0 to 3500) { db_operation ("UPDATE orders SET status = null WHERE (order_id >" + (i*1000)" + " AND order_id <" + ((i+1)*1000) " + ")"); }

This operation can complete in just a few minutes, not 35.

So it comes down to what I really ask. I do not want to write a freaking script to break down operations every time I want to do such a one-time update. Is there a way to accomplish what I want completely in SQL?

+47

sql-update plpgsql postgresql transactions dblink

SD Jul 11 '09 at 8:46

source share

8 answers

Erwin Brandstetter · Answer 1 · 2014-03-04 05:27

Column / Row

... I do not need transaction integrity for the whole operation, because I know that the column that I am changing will not be written or read during the update.

Any UPDATE in the PostgreSQL MVCC Model creates a new version of the entire row . If concurrent transactions change any column of the same row, laborious concurrency problems arise. Details in the manual. Knowing the same column will not be affected by concurrent transactions, avoiding some possible complications, but not others.

Index

To avoid being redirected to an offtopic discussion, suppose that all status values for 35 million columns are currently set to the same (non-zero) value, which makes the index useless.

When updating an entire table (or its main parts), Postgres never uses an index . Sequential scanning is faster when all or most of the lines need to be read. On the contrary: index maintenance means extra cost for UPDATE .

Performance

For example, let's say I have a table called “orders” with 35 million rows, and I want to do this:
 UPDATE orders SET status = null; 

I understand that you are aiming for a more general solution (see below). But to solve the question asked : this can be solved in milliseconds , regardless of the size of the table:

 ALTER TABLE orders DROP column status , ADD column status text;

In the documentation:

When a column is added with ADD COLUMN , all existing rows in the table are initialized with the default column value ( NULL if no DEFAULT condition is specified). If there is no DEFAULT clause, this is just a metadata change ...

and

The DROP COLUMN form does not physically delete the column, but simply makes it invisible to SQL operations. Subsequent insertion and update operations in the table will store a null value for the column. Thus, deleting a column is quick, but it will not immediately reduce the size of the disk on your disk, since the space occupied by the reset column is not fixed. The space will be reclaimed over time as existing rows are updated. (These statements do not apply when deleting an oid system column; this is done with an immediate rewrite.)

Make sure you have no objects depending on the column (foreign key constraints, indexes, views, ...). You will need to discard / recreate them. pg_attribute tiny operations in the pg_attribute system catalog pg_attribute doing this work. Exclusive table locking is required, which can be a problem for large simultaneous load. Since it only takes a few milliseconds, you should still be fine.

If you have a default value that you want to keep, add it to a separate command. Executing the same command would immediately apply it to all lines, freeing the effect. Then you can update existing columns in batches . Follow the documentation link and read the notes in the manual.

Common decision

dblink is mentioned in another answer. It allows you to access the "remote" Postgres databases in implicit separate connections. The "remote" database can be current, thereby achieving "autonomous transactions": what the function writes to the "remote" db is done and cannot be undone.

This allows you to run one function that updates a large table in smaller parts, and each part is executed separately. It avoids increasing transaction costs for a very large number of rows and, more importantly, releases locks after each part. This allows parallel operations to be performed without much delay and makes blocking less likely.

Unless you have concurrent access, this is hardly useful - except to avoid ROLLBACK after the exception. Also consider SAVEPOINT for this case.

Renouncement

First of all, many small transactions are actually more expensive. This only makes sense for large tables . A sweet spot depends on many factors.

If you are not sure what you are doing: one transaction is a safe method . For this to work correctly, parallel operations in the table must be reproduced. For example: simultaneous recording can move a line to a section that is supposedly already processed. Or simultaneous readings can see conflicting intermediate states. You have been warned.

Step by step instructions

First you need to install an additional dblink module:

How to use (install) dblink in PostgreSQL?

Setting up a connection to dblink is very much dependent on setting up your database cluster and in-place security policies. It can be tricky. A related later answer with more on how to connect with dblink :

Permanent inserts in UDF, even if the function is interrupted

Create a FOREIGN SERVER and USER MAPPING as described there to simplify and optimize the connection (unless you have one).
Assuming a serial PRIMARY KEY with or without some spaces.

 CREATE OR REPLACE FUNCTION f_update_in_steps() RETURNS void AS $func$ DECLARE _step int; -- size of step _cur int; -- current ID (starting with minimum) _max int; -- maximum ID BEGIN SELECT INTO _cur, _max min(order_id), max(order_id) FROM orders; -- 100 slices (steps) hard coded _step := ((_max - _cur) / 100) + 1; -- rounded, possibly a bit too small -- +1 to avoid endless loop for 0 PERFORM dblink_connect('myserver'); -- your foreign server as instructed above FOR i IN 0..200 LOOP -- 200 >> 100 to make sure we exceed _max PERFORM dblink_exec( $$UPDATE public.orders SET status = 'foo' WHERE order_id >= $$ || _cur || $$ AND order_id < $$ || _cur + _step || $$ AND status IS DISTINCT FROM 'foo'$$); -- avoid empty update _cur := _cur + _step; EXIT WHEN _cur > _max; -- stop when done (never loop till 200) END LOOP; PERFORM dblink_disconnect(); END $func$ LANGUAGE plpgsql;

Call:

 SELECT f_update_in_steps();

You can parameterize any part according to your needs: table name, column name, value, ... just remember to clear the identifiers to avoid SQL injection:

Table name as parameter of PostgreSQL function

To avoid an empty UPDATE:

How can I (or can I) select SELECT DISTINCT in multiple columns?

user80168 · Answer 2 · 2009-07-11 10:24

First of all, are you sure you need to update all the rows?

Perhaps some of the lines already have status NULL?

If yes, then:

 UPDATE orders SET status = null WHERE status is not null;

As for splitting into change, this is not possible in pure sql. All updates are in one transaction.

One possible way to do this in “pure sql” would be to install dblink, connect to the same database using dblink, and then release a lot of updates on dblink, but this seems redundant for such a simple task.

Usually just adding the right where solves the problem. If this is not the case, just separate it manually. The spelling of the script is too large - you can usually do this in a simple single line line:

 perl -e ' for (my $i = 0; $i <= 3500000; $i += 1000) { printf "UPDATE orders SET status = null WHERE status is not null and order_id between %u and %u;\n", $i, $i+999 } '

I wrapped the lines here for readability, usually a single line. The output from the command above can be directly passed to psql:

 perl -e '...' | psql -U ... -d ...

Or write the file first, and then in psql (if you need the file later):

 perl -e '...' > updates.partitioned.sql psql -U ... -d ... -f updates.partitioned.sql

Tometzky · Answer 3 · 2009-07-14 11:50

You should delegate this column to another table as follows:

 create table order_status ( order_id int not null references orders(order_id) primary key, status int not null );

Then your operation of setting status = NULL will be instantaneous:

 truncate order_status;

mys · Answer 4 · 2011-08-18 11:12

I would use CTAS:

 begin; create table T as select col1, col2, ..., <new value>, colN from orders; drop table orders; alter table T rename to orders; commit;

Martin v. · Answer 5 · 2009-07-11 09:17

Postgres uses MVCC (managing multiple versions of concurrency), thereby avoiding any blocking if you're the only writer; any number of simultaneous readers can work on the table, and there will be no lock.

So, if it really takes 5 hours, it should be for a different reason (for example, you have a simultaneous recording, as opposed to your claim that you are not doing this).

mikl · Answer 6 · 2009-07-11 09:25

I am by no means a database administrator, but a database design where you often have to update 35 million rows can have ... problems.

A simple WHERE status IS NOT NULL may speed things up a bit (assuming you have a status index) - without knowing the actual use case, I assume that if this is done often, most of the 35 million rows may already have zero status.

However, you can create loops in a query using the LOOP statement . I will just prepare a small example:

 CREATE OR REPLACE FUNCTION nullstatus(count INTEGER) RETURNS integer AS $$ DECLARE i INTEGER := 0; BEGIN FOR i IN 0..(count/1000 + 1) LOOP UPDATE orders SET status = null WHERE (order_id > (i*1000) and order_id <((i+1)*1000)); RAISE NOTICE 'Count: % and i: %', count,i; END LOOP; RETURN 1; END; $$ LANGUAGE plpgsql;

Then it can be started by doing something like:

 SELECT nullstatus(35000000);

You can choose the number of rows, but be careful that the exact number of rows can take a long time. The PostgreSQL wiki has an article on slow counting and how to avoid it .

In addition, part of RAISE NOTICE is there to track how far the script has moved. If you do not track notifications or do not care, it would be better to leave this.

Martin Torhage · Answer 7 · 2009-07-14 21:07

Are you sure this is due to blocking? I don’t think so, and there are many other possible reasons. To find out, you can always try to do only a lock. Try this: START; SELECT NOW (); SELECT * FROM order FOR UPDATE; SELECT NOW (); ROLLBACK;

To understand what really happens, you must first run EXPLAIN (EXPLAIN UPDATE orders the status SET ...) and / or EXPLAIN ANALYZE. You may find that there is not enough memory to run UPDATE efficiently. If so, SET work_mem TO 'xxxMB'; may be a simple solution.

Also, start the PostgreSQL log to see if there are performance issues.

rogerdpack · Answer 8 · 2017-11-23 20:07

Some options that were not mentioned:

Use the new table tag. It is likely that you had to do in your case, write several triggers to handle it, so that the changes in the source table also apply to your copy of the table, something like this ... ( percona is an example of what makes this a trigger). Another option would be to “create a new column and then replace the old” trick to avoid blocking (it is not clear if speed helps).

Perhaps calculate max ID, then generate "all the queries you need" and pass them as a single query, such as update X set Y = NULL where ID < 10000 and ID >= 0; update X set Y = NULL where ID < 20000 and ID > 10000; ... update X set Y = NULL where ID < 10000 and ID >= 0; update X set Y = NULL where ID < 20000 and ID > 10000; ... update X set Y = NULL where ID < 10000 and ID >= 0; update X set Y = NULL where ID < 20000 and ID > 10000; ... then it may not do so much locking and still be all SQL, although you have additional logic before it: (

How to make big non-blocking updates in PostgreSQL?

Column / Row

Index

Performance

Common decision

Renouncement

Step by step instructions

More articles: