I opened a case with Microsoft technical support, and after a small amount back and forth, Pradeep MM (SQL Server Technical Support Technical Manager) completely coped.
The general process: read the list of files in the folder and, one after another, perform a series of voluminous inserts in these files (first, to read the first row, which we analyze for columns, and then to read data from the second + line). All massive inserts use the "ErrorFile" parameter to provide users with what information we can when their data is incorrectly formatted. The process has been working for 3 years, but in the context of recent stress tests (up to 8 simultaneous runs performed by one instance of SQL Server with all formatted files), we got the errors listed above.
We initially, although there were errors with GUID generation, due to this โalready openโ error, but this idea was ultimately discarded - if the newid () function did not function properly, much more people would have much more serious problems.
According to the Pradeep procedure, it is a step by step. Bulk Insert Operation Process:
- BULK INSERT A command is sent and parsed for syntax errors.
- Then the BULK INSERT command is compiled to generate the execution plan for the same
- During the compilation phase, if in the request, if we specified ERRORFILE, then we will create ErrorFile.log and ErrorFile.Error.Txt in the specified folder (the thing is important to understand here, the file will be 0kb in size)
- After the file is created, we will delete both files using the Windows API Calls
- Once the execution plan is ready, we will go to the execution stage and try to run the Bulk Insert command as part of this, we will recreate the folder (according to the documentation for the electronic book) in the folder ErrorFile.log and ErrorFile.Error.Txt must be in this place, otherwise we are executing http://msdn.microsoft.com/en-us/library/ms188365.aspx
- As soon as the execution is completed, if there are any errors in the Bulk insert corresponding errors are written to the error files if there are no errors, these 2 files will be deleted.
Running ProcMon (Process Monitor) during failed runs showed that the ErrorFile was successfully created and opened in step 3, but was NOT closed in step 4, as a result of which step 5 generated the error we saw. (For successful launches, the file was created and closed as expected.)
Further analysis of ProcMon showed that another process executing CMD.EXE returned operations "close the descriptor" in the file after an attempt to bulk insert. We use a procedure including xp_cmdshell to retrieve a list of files that need to be processed, and this will cause the CMD.EXE process. Heres the kicker:
... there is some business logic that runs CMD.EXE inside SQL Server, and since CMD.EXE is a child process, it inherits all the descriptors opened by the parent process (maybe this is some kind of synchronization problem when in CMD. EXE stores the descriptors of files opened at startup, and all those files whose descriptor is inherited by CMD.EXE cannot be deleted and can only be released after the destruction of CMD.EXE)
And thatโs all. A single run never touches this issue, as its xp_cmdshell call completes before the release of bulk inserts. But with parallel starts, especially with many parallel starts (I only got into a problem with 5 or more transitions), there was a synchronization problem, so:
- One of the SSIS packages Executes and calls a stored procedure that internally uses XP_CMDSHELL and runs CMD.EXE to list the files
- The same connection to the SQL server completes the file enumeration and then starts the bulk load operation and its phase compilation for the BULK INSERT command.
- In accordance with the design of Bulk Insert, we create an ErrorFile at the compilation stage, and then delete it after compilation. The phase is completed.
- At the same moment, another SSIS package starts and it calls a stored procedure that internally uses XP_CMDSHELL and runs CMD.EXE to list all the files
- CMD.EXE is a child process launched under the parent Process SQLServr.exe, therefore it by default inherits all the Handles created by SQLServr.exe (thus, this process receives all the handles for ERRORFILE that were created by BULK INSERT in the First Connection)
- Now, in the first connection, the compilation phase is completed and therefore we are trying to delete the file, during which we must close all the handles. We see that CMD.EXE holds the handle of the file and it is still open, and therefore we cannot delete the file. So without deleting the file, we proceed to the execution phase and in the execution phase we try to create a new ERRORFILE with the same name, but since the file already exists, we fail with the error "Operating system error code 80 (File exists.).".
My short-term workaround was: (1) implement a retry loop, create a new ErrorFile name and try to add a new bulk insert up to three times before giving up, and (2) create another procedure for our nightly processes to delete all the files found in our folder "ErrorFile".
The long-term fix is โโto revise our code so that we donโt list files through xp_cmdshell. This seems doable since the entire ETL process is terminated and managed by the SSIS package; Alternatively, CLR routines can be created and processed. So far, given our expected workload, enough work (especially considering that everything else works just now), so it may be a little before we begin the final fix.
Published to posterity if this ever happens to you!