Optimization of clauses required for SQL UPDATE status. Used 2 ~ 5 million recording tables

I am looking for any suggestions for optimizing the next PROC SQL statement from the SAS program. These two tables contain about 5 million records, and the runtime is about 46 hours.

The statement proposes updating the "new" version of the "old" table. Note the column if the "old" table for "PK_ID" was specified without a value for "3RD_ID" and "CODE", but in the "new" table for "PK_ID" it is now indicated using the value for "3RD_ID" and "CODE".

Thanks for any suggestions ... (The code is really formatted below! For some reason, my spaces are not displayed for indentation ...)

PROC SQL _METHOD; UPDATE NEW_TABLE AS N SET NEW_2ND_ID=(SELECT 2ND_ID FROM OLD_TABLE AS O WHERE N.PK_ID=0.PK_ID AND N.2ND_ID<>O.2ND_ID AND O.3RD_ID IS NULL AND O.CODE IS NULL AND N.3RD_ID IS NOT NULL AND N.CODE IS NOT NULL AND N.2ND_ID IS NOT NULL) WHERE N.3RD_ID IS NOT NULL AND N.PK_ID IS NOT NULL AND N.CODE IS NOT NULL AND N.2ND_ID IS NOT NULL; QUIT; 
+4
source share
6 answers

All answers are still firmly oriented in the SQL part of your question, but to some extent neglect the part of SAS. I highly recommend trying the data update / modify / merge step instead of proc sql for such an update. It should be possible to sort both tables and apply the same logic from your SQL to ensure that the correct rows / columns are updated.

I have seen that such updates are updated in minutes on 20 million or more lines.

Also, check out http://runsubmit.com , a site such as stackoverflow for SAS, for more specific SAS answers.

Disclosure: I am an employee of SAS. I have nothing to do with runubmit, which runs independently.

+3
source

I am not familiar with the SQL option you are using. However, if you get better performance or not, you should use the ANSI join syntax. Here is how it will look in T-SQL, change it for your system:

 UPDATE N SET N.2ND_ID = O.2ND_ID FROM NEW_TABLE AS N INNER JOIN OLD_TABLE AS O ON N.PK_ID = O.PK_ID WHERE N.2ND_ID <> O.2ND_ID AND N.3RD_ID IS NOT NULL AND O.3RD_ID IS NULL AND N.CODE IS NOT NULL AND O.CODE IS NULL 

Note that the additional conditions that I deleted are not needed, for example N.2ND_ID <> O.2ND_ID already guarantees that these two columns are not null.

However, on two 5 million row tables, you get terrible performance. Here are some ideas to speed it up. I bet you can get it in less than an hour with the right combination of strategies.

  • Divide the update into batches (small fragments by going all over the set). Although this sounds inconsistent with the usual “don't quote, use sets” database recommendation, it really isn’t: you just use smaller sets, rather than looping at the row level. The best way to batch upgrade like this is to "move the clustered index". I'm not sure if this term makes sense in the DBMS you use, but essentially it means that you select the pieces that you update during each cycle, depending on what order they will be found in the updated table object. PK_ID sounds like it's a candidate for use, but if the original table data is not sorted by this column, then it will become more complex. In T-SQL, a dosing cycle may look like this:

     DECLARE @ID int, @Count int SET @ID = 1 SET @Count = 1 WHILE @Count > 0 BEGIN UPDATE N SET N.2ND_ID = O.2ND_ID FROM NEW_TABLE AS N INNER JOIN OLD_TABLE AS O ON N.PK_ID = O.PK_ID WHERE N.2ND_ID <> O.2ND_ID AND N.3RD_ID IS NOT NULL AND O.3RD_ID IS NULL AND N.CODE IS NOT NULL AND O.CODE IS NULL AND N.PK_ID BETWEEN @ID AND @ID + 4999 SET @Count = @@RowCount SET @ID = @ID + 5000 END 

    This example assumes that your PK_ID column is tightly packed, so that each update will really hit 5000 rows. If this is not the case, switch to the method using TOP 5000 and either output the updated PK_ID to the table, or find @StartID and @EndID for the next update in one step, then perform it.

    In my experience, good batch sizes typically range from 1,000 to 20,000 rows. In the MS-SQL server, the sweet spot seems to be slightly lower than the number, which forces you to switch from search to scan (because in the end, the db mechanism assumes that one scan is cheaper than many queries, although it is often incorrect when working with tables of 5 million rows )

  • First, select the identifiers and the data you want to update in the worksheet / temporary table, then attach to it. The idea is to capture a huge scan with a simple INSERT statement, then add indexes to the temp table and update without the need for a complicated WHERE clause. After the table contains only updated rows and the required columns, the WHERE clause can not only lose most of its conditions, but the temp table has a much smaller number of rows and many more rows per page (since it has no extraneous columns), which will significantly improve performance. This can even be done in stages, when the “shadow” of the new table is created, then the “shadow” of the old table, then the connection between them and, finally, the connection to the new table to update it. Although this sounds like a lot of work, I think you will be surprised at the completely insane completion rate that this can offer.

  • All you can do to convert the reading from the old table, which will be looking instead of scanning, will help. All you can do to reduce the number of disks used to store temporary data (such as giant hash tables for 5 million rows) will help.

+3
source

Do not use the update, create a similar new table and use the inserts in the (fields) fields to select from both tables.

  • Drop the indexes before running the query.
  • Triggers before running a request.

Sort of:

 insert into NEW_TABLE (field1,field2,NEW_2ND_ID) select field1, field2, (SELECT 2ND_ID FROM OLD_TABLE....) from NEW_TABLE 
  • Restore indexes after query completion.
  • Restore triggers after the request is complete.

(In the end, you will replace your existing table with this new table)

+1
source

I think this is a nested subquery now, so the Select statement will be run for the number of records that match the where clause.

However, I would recommend you switch to SQL - Update using Join. An explanation can be found here: http://bytes.com/topic/oracle/answers/65819-sql-update-join-syntax .

When you have an update using JOIN in place, apply the appropriate indexes.

In addition, you should not use indexes that include 2nd_id, they should be disabled and then restored after the update, as this may be a massive data update.

0
source
 UPDATE ( SELECT O.2ND_ID, N.2ND_ID FROM OLD_TABLE AS O INNER JOIN NEW_TABLE AS N on O.PK_ID=N.PK_ID WHERE N.2ND_ID <> O.2ND_ID AND O.3RD_ID IS NULL AND O.CODE IS NULL AND N.3RD_ID IS NOT NULL AND N.CODE IS NOT NULL AND N.2ND_ID IS NOT NULL ) t set N.2ND_ID = O.2ND_ID 
0
source

You can also try putting the new table on a different device than the old one in order to use its parallelism. If you can convince DBA, of course.

0
source

Source: https://habr.com/ru/post/1299656/


All Articles