I am not familiar with the SQL option you are using. However, if you get better performance or not, you should use the ANSI join syntax. Here is how it will look in T-SQL, change it for your system:
UPDATE N SET N.2ND_ID = O.2ND_ID FROM NEW_TABLE AS N INNER JOIN OLD_TABLE AS O ON N.PK_ID = O.PK_ID WHERE N.2ND_ID <> O.2ND_ID AND N.3RD_ID IS NOT NULL AND O.3RD_ID IS NULL AND N.CODE IS NOT NULL AND O.CODE IS NULL
Note that the additional conditions that I deleted are not needed, for example N.2ND_ID <> O.2ND_ID already guarantees that these two columns are not null.
However, on two 5 million row tables, you get terrible performance. Here are some ideas to speed it up. I bet you can get it in less than an hour with the right combination of strategies.
Divide the update into batches (small fragments by going all over the set). Although this sounds inconsistent with the usual “don't quote, use sets” database recommendation, it really isn’t: you just use smaller sets, rather than looping at the row level. The best way to batch upgrade like this is to "move the clustered index". I'm not sure if this term makes sense in the DBMS you use, but essentially it means that you select the pieces that you update during each cycle, depending on what order they will be found in the updated table object. PK_ID sounds like it's a candidate for use, but if the original table data is not sorted by this column, then it will become more complex. In T-SQL, a dosing cycle may look like this:
DECLARE @ID int, @Count int SET @ID = 1 SET @Count = 1 WHILE @Count > 0 BEGIN UPDATE N SET N.2ND_ID = O.2ND_ID FROM NEW_TABLE AS N INNER JOIN OLD_TABLE AS O ON N.PK_ID = O.PK_ID WHERE N.2ND_ID <> O.2ND_ID AND N.3RD_ID IS NOT NULL AND O.3RD_ID IS NULL AND N.CODE IS NOT NULL AND O.CODE IS NULL AND N.PK_ID BETWEEN @ID AND @ID + 4999 SET @Count = @@RowCount SET @ID = @ID + 5000 END
This example assumes that your PK_ID column is tightly packed, so that each update will really hit 5000 rows. If this is not the case, switch to the method using TOP 5000 and either output the updated PK_ID to the table, or find @StartID and @EndID for the next update in one step, then perform it.
In my experience, good batch sizes typically range from 1,000 to 20,000 rows. In the MS-SQL server, the sweet spot seems to be slightly lower than the number, which forces you to switch from search to scan (because in the end, the db mechanism assumes that one scan is cheaper than many queries, although it is often incorrect when working with tables of 5 million rows )
First, select the identifiers and the data you want to update in the worksheet / temporary table, then attach to it. The idea is to capture a huge scan with a simple INSERT statement, then add indexes to the temp table and update without the need for a complicated WHERE clause. After the table contains only updated rows and the required columns, the WHERE clause can not only lose most of its conditions, but the temp table has a much smaller number of rows and many more rows per page (since it has no extraneous columns), which will significantly improve performance. This can even be done in stages, when the “shadow” of the new table is created, then the “shadow” of the old table, then the connection between them and, finally, the connection to the new table to update it. Although this sounds like a lot of work, I think you will be surprised at the completely insane completion rate that this can offer.
All you can do to convert the reading from the old table, which will be looking instead of scanning, will help. All you can do to reduce the number of disks used to store temporary data (such as giant hash tables for 5 million rows) will help.
ErikE source share