Column / Row
... I do not need transaction integrity for the whole operation, because I know that the column that I am changing will not be written or read during the update.
Any UPDATE in the PostgreSQL MVCC Model creates a new version of the entire row . If concurrent transactions change any column of the same row, laborious concurrency problems arise. Details in the manual. Knowing the same column will not be affected by concurrent transactions, avoiding some possible complications, but not others.
Index
To avoid being redirected to an offtopic discussion, suppose that all status values ​​for 35 million columns are currently set to the same (non-zero) value, which makes the index useless.
When updating an entire table (or its main parts), Postgres never uses an index . Sequential scanning is faster when all or most of the lines need to be read. On the contrary: index maintenance means extra cost for UPDATE .
Performance
For example, let's say I have a table called “orders” with 35 million rows, and I want to do this:
UPDATE orders SET status = null;
I understand that you are aiming for a more general solution (see below). But to solve the question asked : this can be solved in milliseconds , regardless of the size of the table:
ALTER TABLE orders DROP column status , ADD column status text;
In the documentation:
When a column is added with ADD COLUMN , all existing rows in the table are initialized with the default column value ( NULL if no DEFAULT condition is specified). If there is no DEFAULT clause, this is just a metadata change ...
and
The DROP COLUMN form does not physically delete the column, but simply makes it invisible to SQL operations. Subsequent insertion and update operations in the table will store a null value for the column. Thus, deleting a column is quick, but it will not immediately reduce the size of the disk on your disk, since the space occupied by the reset column is not fixed. The space will be reclaimed over time as existing rows are updated. (These statements do not apply when deleting an oid system column; this is done with an immediate rewrite.)
Make sure you have no objects depending on the column (foreign key constraints, indexes, views, ...). You will need to discard / recreate them. pg_attribute tiny operations in the pg_attribute system catalog pg_attribute doing this work. Exclusive table locking is required, which can be a problem for large simultaneous load. Since it only takes a few milliseconds, you should still be fine.
If you have a default value that you want to keep, add it to a separate command. Executing the same command would immediately apply it to all lines, freeing the effect. Then you can update existing columns in batches . Follow the documentation link and read the notes in the manual.
Common decision
dblink is mentioned in another answer. It allows you to access the "remote" Postgres databases in implicit separate connections. The "remote" database can be current, thereby achieving "autonomous transactions": what the function writes to the "remote" db is done and cannot be undone.
This allows you to run one function that updates a large table in smaller parts, and each part is executed separately. It avoids increasing transaction costs for a very large number of rows and, more importantly, releases locks after each part. This allows parallel operations to be performed without much delay and makes blocking less likely.
Unless you have concurrent access, this is hardly useful - except to avoid ROLLBACK after the exception. Also consider SAVEPOINT for this case.
Renouncement
First of all, many small transactions are actually more expensive. This only makes sense for large tables . A sweet spot depends on many factors.
If you are not sure what you are doing: one transaction is a safe method . For this to work correctly, parallel operations in the table must be reproduced. For example: simultaneous recording can move a line to a section that is supposedly already processed. Or simultaneous readings can see conflicting intermediate states. You have been warned.
Step by step instructions
First you need to install an additional dblink module:
- How to use (install) dblink in PostgreSQL?
Setting up a connection to dblink is very much dependent on setting up your database cluster and in-place security policies. It can be tricky. A related later answer with more on how to connect with dblink :
Create a FOREIGN SERVER and USER MAPPING as described there to simplify and optimize the connection (unless you have one).
Assuming a serial PRIMARY KEY with or without some spaces.
CREATE OR REPLACE FUNCTION f_update_in_steps() RETURNS void AS $func$ DECLARE _step int; -- size of step _cur int; -- current ID (starting with minimum) _max int; -- maximum ID BEGIN SELECT INTO _cur, _max min(order_id), max(order_id) FROM orders; -- 100 slices (steps) hard coded _step := ((_max - _cur) / 100) + 1; -- rounded, possibly a bit too small -- +1 to avoid endless loop for 0 PERFORM dblink_connect('myserver'); -- your foreign server as instructed above FOR i IN 0..200 LOOP -- 200 >> 100 to make sure we exceed _max PERFORM dblink_exec( $$UPDATE public.orders SET status = 'foo' WHERE order_id >= $$ || _cur || $$ AND order_id < $$ || _cur + _step || $$ AND status IS DISTINCT FROM 'foo'$$); -- avoid empty update _cur := _cur + _step; EXIT WHEN _cur > _max; -- stop when done (never loop till 200) END LOOP; PERFORM dblink_disconnect(); END $func$ LANGUAGE plpgsql;
Call:
SELECT f_update_in_steps();
You can parameterize any part according to your needs: table name, column name, value, ... just remember to clear the identifiers to avoid SQL injection:
- Table name as parameter of PostgreSQL function
To avoid an empty UPDATE:
- How can I (or can I) select SELECT DISTINCT in multiple columns?