Best way to delete millions of rows by ID

I need to delete about 2 million rows from my PG database. I have a list of identifiers that I need to delete. However, any way I try to do this is to take several days.

I tried to put them in a table and do it in batches of 100. After 4 days, this is still done, deleting only 297268 rows. (I needed to select 100 identifiers from the identifier table, delete where IN is in this list, remove the 100 that I selected from the identifier table).

I tried:

DELETE FROM tbl WHERE id IN (select * from ids) 

It is also forever. It is difficult to determine how much time, since I do not see its progress to completion, but the request still works after 2 days.

Just find the most efficient way to remove from the table when I know which identifier is deleted and there are millions of identifiers.

+42
sql sql-delete postgresql postgresql-performance bigdata
Nov 28 '11 at 2:29
source share
7 answers

It all depends ...

  • Delete all indexes (except the ones you need to delete)

    Match them later (= much faster than incremental index updates)

  • Check if you have triggers that can be temporarily removed / disabled

  • Do foreign keys help your table? Can they be removed? Temporarily deleted?

  • Depending on your autovacuum settings, this may help to perform VACUUM ANALYZE before the operation.

  • If you delete large parts of the table, and the rest fit into RAM, the fastest and easiest way would be the following:

 SET temp_buffers = 1000MB -- or whatever you can spare temporarily CREATE TEMP TABLE tmp AS SELECT t.* FROM tbl t LEFT JOIN del_list d USING (id) WHERE d.id IS NULL; -- copy surviving rows into temporary table TRUNCATE tbl; -- empty table - truncate is very fast for big tables INSERT INTO tbl SELECT * FROM tmp; -- insert back surviving rows. 

Thus, you do not need to recreate views, foreign keys, or other dependent objects. Check out the temp_buffers parameter in the manual . This method works as long as the table fits into memory, or at least in most cases. Keep in mind that you can lose data if your server crashes in the middle of this operation. You can transfer all this to a transaction to make it more secure.

In addition, it is recommended :

TRUNCATE cannot be used in a table with references to foreign keys from other tables, unless all such tables are also truncated in one command.

Run ANALYZE after. Either VACUUM ANALYZE if you did not send a truncation route, or VACUUM FULL ANALYZE if you want to bring it to a minimum size. For large tables, consider alternatives to CLUSTER / pg_repack :

  • Optimize Postgres Print Time Request Range

For small tables, a simple DELETE instead of TRUNCATE often faster:

 DELETE FROM tbl t USING del_list d WHERE t.id = d.id; 
+50
Nov 28 2018-11-11T00
source share

We know that the performance of updating / removing PostgreSQL is not as strong as Oracle. when we need to delete millions or 10 million rows, it is really complicated and time consuming.

However, we can still do this in dbs production. Here is my idea:

First, we need to create a log table with two columns - id and flag ( id means the identifier you want to delete; flag can be Y or null , with Y meaning that the record was deleted successfully).

Later we create a function. We do the task of deleting every 10,000 lines. You can see more information about my blog . Although in Chinese, you can still get the information you need from the SQL code.

Make sure the id column for both tables are indexes, as it will work faster.

+3
Nov 28 2018-11-11T00:
source share

You can try to copy all the data from the table except the identifiers you want to delete into a new table, then rename and then exchange the tables (provided that you have enough resources for this).

This is not expert advice.

+2
Nov 28 2018-11-11T00:
source share

The easiest way to do this is to remove all your restrictions and then delete.

+1
Nov 28 2018-11-11T00:
source share

Two possible answers:

  • When you try to delete an entry in your table, there may be many restrictions or triggers. This will require many processor cycles and checks from other tables.

  • You may need to include this statement in a transaction.

+1
Nov 28 '11 at 2:40
source share

First, make sure that you have an index in the ID fields, both in the table you want to delete and in the table that you use for delete identifiers.

100 at a time seems too small. Try 1000 or 10000.

There is no need to delete anything from the deletion identifier table. Add a new column for the batch number and fill it in 1000 for batch 1, 1000 for batch 2, etc. And make sure the deletion request includes the batch number.

+1
Nov 28 2018-11-11T00
source share

If some_other_table referencing in the table you are some_other_table (and you do not want to temporarily discard foreign keys), make sure you have an index in the link column in some_other_table !

I had a similar problem and I used auto_explain with auto_explain.log_nested_statements = true , which showed that delete actually executed seq_scans on some_other_table :

  Query Text: SELECT 1 FROM ONLY "public"."some_other_table" x WHERE $1 OPERATOR(pg_catalog.=) "id" FOR KEY SHARE OF x LockRows (cost=[...]) -> Seq Scan on some_other_table x (cost=[...]) Filter: ($1 = id) 

Apparently he is trying to block references to rows in another table (which should not exist, or deletion will fail). After I created the indexes in the reference tables, the deletion was an order of magnitude faster.

0
Nov 10 '17 at 5:53 on
source share



All Articles