Azure data factory copy operation from storage to SQL: freezes at 70,000 rows

I have factory data with a pipeline copy operation as follows:

{ "type": "Copy", "name": "Copy from storage to SQL", "inputs": [ { "name": "storageDatasetName" } ], "outputs": [ { "name": "sqlOutputDatasetName" } ], "typeProperties": { "source": { "type": "BlobSource" }, "sink": { "type": "SqlSink" } }, "policy": { "concurrency": 1, "retry": 3 }, "scheduler": { "frequency": "Month", "interval": 1 } } 

The input data is about 90 MB, about 1.5 million lines, divided into approx. 20 x 4.5MB block block files in Azure Storage. Here is a sample data (CSV):

A81001,1,1,1,2,600,3.0,0.47236654,141.70996,0.70854986 A81001,4,11,0,25,588,243.0,5.904582,138.87576,57.392536 A81001,7,4,1,32,1342,278.0,7.5578647,316.95795, 65.65895

Sink - This is Azure SQL Server type S2, which is designed for 50 DTUs. I created a simple table with reasonable data types, as well as no keys, indexes or anything interesting, just columns:

 CREATE TABLE [dbo].[Prescriptions]( [Practice] [char](6) NOT NULL, [BnfChapter] [tinyint] NOT NULL, [BnfSection] [tinyint] NOT NULL, [BnfParagraph] [tinyint] NOT NULL, [TotalItems] [int] NOT NULL, [TotalQty] [int] NOT NULL, [TotalActCost] [float] NOT NULL, [TotalItemsPerThousand] [float] NOT NULL, [TotalQtyPerThousand] [float] NOT NULL, [TotalActCostPerThousand] [float] NOT NULL ) 

The source, receiver, and factory data are in the same region (Northern Europe).

According to Microsoft's “Copy Activity and Configuration Guide,” for Azure Storage Source and Azure SQL S2, I should get about 0.4 Mbps. According to my calculations, this means that 90 MB should be transferred in about half an hour (is this right?).

enter image description here

For some reason, it copies 70,000 lines very quickly, and then it seems to freeze. Using SQL Management Studio, I see that the number of rows in the database table is exactly 70,000 and has not increased at all in 7 hours. However, the copy task still works without errors:

enter image description here

Any ideas why this hangs on 70,000 lines? I do not see anything unusual in the data line 70 001st, which will cause the problem. I tried inexorably to trick factory data and start again, and I always get the same behavior. I have another copy operation with a smaller table (8000 rows) that completes after 1 minute.

+5
source share
1 answer

Just to answer my own question if it helps someone else:

The problem was with null values. The reason my work hung on 70,000 lines was because line 76560 of my source blob file had one value in one of the columns. The HIVE script I used to create this blob file recorded a null value as '\ n'. In addition, my SQL Sink table indicated “NOT NULL” as part of the column, and the column was a FLOAT value.

So, I made two changes: the following property was added to the definition of my dataset:

 "nullValue": "\\N" 

And made the column of the SQL table null. Now it works completely and does not hang! :)

The problem is that the Data Factory wasn’t mistaken, it was just stuck - it would be nice if the job didn’t work with a useful error message and told me which row of data was the problem. I think because the default recording batch size is 10,000, so it is stuck at 70,000 and not at 76560.

+9
source

Source: https://habr.com/ru/post/1245573/


All Articles