INSERT INTO <table> SELECT <natural keys>, <other stuff...> FROM <table> WHERE NOT EXISTS
- in the first INSERT there is a race condition. The key cannot exist during the internal SELECT query, but exists during INSERT, which leads to a violation of the key.
- there is a race condition between INSERT and UPDATE. A key can exist when it is checked in an internal INSERT query, but has gone by the time UPDATE ends.
For the second condition of the race, it can be argued that the key would have been deleted in any case by a parallel stream, so this is not really a lost update.
The optimal solution usually is to try out the most likely case and handle the error if it fails (inside the transaction, of course):
- If the key is probably missing, always insert it first. Refer to a unique violation of the restriction, cancel the update.
- If the key is probably present, always update first. Insert if row is not found. Handle a possible violation of a unique restriction, refuse to update.
In addition to being correct, this pattern is also optimal for speed: it is more efficient to try to insert and handle an exception than to make false locks. Locks mean reading logical pages (which may mean reading physical pages), and IO (even logical) is more expensive than SEH.
Update @Peter
Why is not one atomic operator? Say we have a trivial table:
create table Test (id int primary key);
Now, if I were to run this single operator from two threads, in the loop it would be βatomicβ, as you say, there cannot exist a race condition:
insert into Test (id) select top (1) id from Numbers n where not exists (select id from Test where id = n.id);
However, a violation of the primary key occurs in a couple of seconds:
Msg 2627, Level 14, State 1, Line 4
Violation of the PRIMARY KEY constraint "PK__Test__24927208". Unable to insert duplicate key in dbo.Test object.
Why? You are right that in terms of SQL queries there will be a βright thingβ on DELETE ... FROM ... JOIN , on WITH cte AS (SELECT...FROM ) DELETE FROM cte and in many other cases. But in these cases there is a crucial difference: the "subquery" refers to the purpose of the update or delete operation. For such cases, the corresponding lock will actually be used in the query plan, in fact, this behavior is crucial in some cases, for example, when implementing queues. Using tables as queues .
But in the original question, as in my example, the subquery is considered by the query optimizer as a subquery in the query, and not as a special request such as "check for update", which requires special protection against blocking. As a result, the execution of the subquery can be observed as a separate operation of the contemplative observer, thereby violating the "atomic" behavior of the operator. Unless special precautions are taken, multiple threads may try to insert the same value, both are convinced that they checked, and the value does not yet exist. Only one can succeed, the other will be in violation of the PC. Q.E.D.