PostgreSQL removes everything except the oldest entries

Question

PostgreSQL removes everything except the oldest entries

I have a PostgreSQL database that has several records for objectid , on several devicenames , but there is a unique timestamp for each record. The table looks something like this:

 address | devicename | objectid | timestamp --------+------------+---------------+------------------------------ 1.1.1.1 | device1 | vs_hub.ch1_25 | 2012-10-02 17:36:41.011629+00 1.1.1.2 | device2 | vs_hub.ch1_25 | 2012-10-02 17:48:01.755559+00 1.1.1.1 | device1 | vs_hub.ch1_25 | 2012-10-03 15:37:09.06065+00 1.1.1.2 | device2 | vs_hub.ch1_25 | 2012-10-03 15:48:33.93128+00 1.1.1.1 | device1 | vs_hub.ch1_25 | 2012-10-05 16:01:59.266779+00 1.1.1.2 | device2 | vs_hub.ch1_25 | 2012-10-05 16:13:46.843113+00 1.1.1.1 | device1 | vs_hub.ch1_25 | 2012-10-06 01:11:45.853361+00 1.1.1.2 | device2 | vs_hub.ch1_25 | 2012-10-06 01:23:21.204324+00

I want to delete everything except the oldest entry for each odjectid and devicename . In this case, I want to delete everything except:

 1.1.1.1 | device1 | vs_hub.ch1_25 | 2012-10-02 17:36:41.011629+00 1.1.1.2 | device2 | vs_hub.ch1_25 | 2012-10-02 17:48:01.755559+00

Is there any way to do this? Or can I select the oldest entries for " objectid and devicename " in the temp table?

+4

sql duplicate-removal postgresql

dars33 Oct 10 '12 at 15:00

source share

5 answers

This should do it:

 delete from devices using ( select ctid as cid, row_number() over (partition by devicename, objectid order by timestamp asc) as rn from devices ) newest where newest.cid = devices.ctid and newest.rn <> 1;

It creates a view that will assign unique numbers to each combination (address, devicename, objectid), giving the earliest (the one with the lowest timestamp value) number 1. Then this result is used to delete all those that do not have number 1. The ctid virtual column ctid used to uniquely identify these rows (this is the internal identifier provided by Postgres).

Note that to remove a really large number of lines, Erwin's approach will certainly be faster.

SQLFiddle demo: http://www.sqlfiddle.com/#!1/5d9fe/2

+4

a_horse_with_no_name Oct 10 '12 at 18:19

source share

DELETE FROM DEVICES D WHERE d.timestamp = (SELECT min (timestamp) FROM DEVICES WHERE objectid = d.objectid and device = d.device)

0

Hola Soy Edu Feliz Navidad Oct 10 '12 at 15:09

source share

This should work, assuming address, devicename and objectid is a unique identifier

 DELETE FROM tablename WHERE address || devicename || objectid || timestamp NOT IN (SELECT address || devicename || objectid || min(timestamp) FROM tablename GROUP BY address, devicename, objectid)

In this case, a concatenated row is used, consisting of unique columns for linking the selected elements. Found the minimum date for this unique combination, and then removes these entries from the table. Probably not the most effective, but it should work.

0

jcern Oct 10 '12 at 15:30

source share

My suggestion is to use a subquery that checks for an entry with an old timestamp:

 DELETE FROM tablename WHERE EXISTS( SELECT * FROM tablename a WHERE tablenmae.address = a.address AND tablename.devicename = a.devicename AND tablename.objectid = a.objectid AND a.timestamp < tablename.timestamp )

The query to select the oldest entries will look like this:

 SELECT address, devicename, objectid, MIN(timestamp) FROM tablename GROUP BY address, devicename, objectid

0

Aleksandr Dezhin Oct 10 '12 at 15:42

source share

Erwin brandstetter · Accepted Answer · 2012-10-10T16:36:21+0000

To overtake the described result, this would probably be the easiest and fastest:

 SELECT DISTINCT ON (devicename, objectid) * FROM tbl ORDER BY devicename, objectid, ts DESC;

Details and explanation in this related answer .

From your sample data, I concluded that you are going to delete large parts of the original table. Most likely, it’s faster to TRUNCATE table (or DROP and recreate it, since you have to add the surrogate column pk) and write the remaining rows. It will also provide you with a table of throne, an implicitly clustered (ordered) way that best suits your needs, and saves the work that VACUUM would have to do otherwise. And this is probably even faster:

I also highly recommend adding a primary surrogate key to your table, preferably a serial column.

 BEGIN; CREATE TEMP TABLE tmp_tbl ON COMMIT DROP AS SELECT DISTINCT ON (devicename, objectid) * FROM tbl ORDER BY devicename, objectid, ts DESC; TRUNCATE tbl; ALTER TABLE tbl ADD column tbl_id serial PRIMARY KEY; -- or, if you can afford to drop & recreate: -- DROP TABLE tbl; -- CREATE TABLE tbl ( -- tbl_id serial PRIMARY KEY -- , address text -- , devicename text -- , objectid text -- , ts timestamp); INSERT INTO tbl (address, devicename, objectid, ts) SELECT address, devicename, objectid, ts FROM tmp_tbl; COMMIT;

Do all this inside the transaction to make sure that you don't work halfway.

This is fast, while your setup for temp_buffers is large enough to hold a temporary table. In addition, the system will begin to replace data with the disk, and performance requires immersion. You can set temp_buffers only for the current session as follows:

 SET temp_buffers = 1000MB;

Thus, you do not lose memory, which is usually not required for temp_buffers . Must be before the first use of temporary objects in the session. More info in this related answer .

In addition, since INSERT follows TRUNCATE within a transaction, it will be easy on Write Ahead Log - improved performance.

Consider CREATE TABLE AS for an alternative route:

What makes a big INSERT slow down and use a disk to explode?

The only drawback: you need an exclusive lock . This can be a problem in databases with a large simultaneous load.

Finally, never use timestamp as the name of a column. This is a reserved word in every SQL standard and type name in PostgreSQL. I renamed the column to ts , as you may have noticed.

PostgreSQL removes everything except the oldest entries

More articles: