How to split flat file data and load into parent-child tables in a database?

I have denormalized data (coming from a file) that needs to be imported into parent-child tables. The source data looks something like this:

Account# Name Membership Email 101 J Burns Gold alpha@foo.com 101 J Burns Gold bravo@foo.com 101 J Burns Gold charlie@yay.com 227 H Gordon Silver red@color.com 350 B Clyde Silver italian@food.com 350 B Clyde Silver mexican@food.com 

What are the parts, parts or tactics of SSIS that I should use to read the first three columns in the parent table and the fourth column (email) in the child table? I have several options for the parent key that I am allowed to:

  • Use account # directly as primary key
  • Use the surrogate key generated by SSIS during the import process.
  • Configure Primary Identity Key

I am sure that I have listed my main key parameters in increasing order of complexity. I would be interested to know how to make the first and last option - I will conclude how to achieve the middle option. To emphasize again, I am interested in a decisive SSIS solution; I am looking for an answer that uses SSIS, not a procedural, technological neutral answer.

My question is somewhat similar to another SO question having an answer of uncertain viability. I hope a more detailed guide can be given. I already know how to solve this problem by creating an “intermediate” intermediate stage, where the separation between parents and children is actually processed by direct SQL. However, I wonder how this can be done without such an average step.

It seems to me that such an import would be so widespread that there would be a well-published boilerplate way to handle this - a method that SSIS has. So far, I do not quite understand the answer to this question.

Update #1 . Based on the comments, I adjusted the sample data to be more clearly denormalized. I also removed the “flat” from the “flat file” so that the semantics do not interfere with the question.

Update #2 . I have increased my interest in the solution used in SSIS.

+6
source share
2 answers

Here is one of the possible options that you can consider when downloading parent-child data. This parameter consists of two steps. In step one, read the source file and write the data to the parent table. In step two , read the source file again and use the transform transform to get the parent information to write the data to the child table. The following example uses the data provided in the question. This example was created using the SSIS 2008 R2 database and SQL Server 2008.

Step by step:

  • Create a sample flat file named Source.txt , as shown in screenshot # 1 .

  • In the SQL database, create two tables named dbo.Parent and dbo.Child using the scripts specified in the SQL Scripts section. Both tables have an auto-generated identity column.

  • In the package, place OLE DB connection to connect to SQL Server and Flat File connection to read the source file, as shown in screenshot # 2 . Set up a flat file connection as shown in screenshots # 3 - # 9 .

  • On the Control Flow tab, place two Data Flow Tasks , as shown in screenshot # 10 .

  • As part of a data flow task named Parent, place the flat file source, sort transformation, and OLE DB assignment, as shown in screenshot < 11 .

  • Set the flat file source as shown in screenshots # 12 and # 13 . We need to read a file with a flat file.

  • Set the sort transformation as shown in screenshot # 14 . We need to eliminate duplicate values ​​so that only unique records are inserted into the parent table dbo.Parent .

  • Set the ole db assignment as shown in screenshots # 15 and # 16 . We need to insert the data into the parent table dbo.Parent .

  • Inside the data flow task named Child, place the source of the flat file, the search transform, and the OLE DB assignment, as shown in screenshot # 17 .

  • Set the flat file source as shown in screenshots # 12 and # 13 . This configuration is similar to the source file with flat files in the previous data flow task.

  • Set up search conversion as shown in screenshots # 18 and # 20 . We need to find the parent id from the dbo.Parent table using the other key columns present in the file. The key columns here are account, name and email address. If a unique column occurred in the file, you can simply use this column to get the parent identifier.

  • Set the ole db assignment as shown in screenshots # 21 and # 22 . We need to insert the Email column along with the parent ID in the dbo.Child table.

  • Screenshot # 23 shows the data in the tables before executing the package.

  • Screenshots # 24 and # 25 show an example of package execution.

  • Screenshot # 26 shows the data in the tables after the package execution.

Hope this helps.

SQL scripts:

 CREATE TABLE [dbo].[Child]( [ChildId] [int] IDENTITY(1,1) NOT NULL, [ParentId] [int] NULL, [Email] [varchar](21) NULL, CONSTRAINT [PK_Child] PRIMARY KEY CLUSTERED ([ChildId] ASC)) ON [PRIMARY] GO CREATE TABLE [dbo].[Parent]( [ParentId] [int] IDENTITY(1,1) NOT NULL, [Account] [varchar](12) NULL, [Name] [varchar](12) NULL, [Membership] [varchar](14) NULL, CONSTRAINT [PK_Parent] PRIMARY KEY CLUSTERED ([ParentId] ASC)) ON [PRIMARY] GO 

Screenshot # 1:

1

Screenshot No. 2:

2

Screenshot 3:

3

Screenshot 4:

4

Screenshot No. 5:

5

Screenshot No. 6:

6

Screenshot No. 7:

7

Screenshot # 8:

8

Screenshot No. 9:

9

Screenshot No. 10:

10

Screenshot No. 11:

eleven

Screenshot No. 12:

<T411>

Screenshot No. 13:

thirteen

Screenshot No. 14:

14

Screenshot No. 15:

fifteen

Screenshot No. 16:

16

Screenshot No. 17:

17

Screenshot No. 18:

18

Screenshot # 19:

19

Screenshot No. 20:

20

Screenshot No. 21:

21

Screenshot No. 22:

22

Screenshot No. 23:

23

Screenshot No. 24:

24

Screenshot No. 25:

25

Screenshot No. 26:

26

+28
source

If the data is sorted and Account # is an integer, I would:

Insert letters into the table (add auto-increase column, this is best practice).

 1 101 alpha@foo.com 2 101 bravo@foo.com 3 101 charlie@yay.com etc. 

Then I would insert other records into the parent table.

  • using account # as primary key
  • lack of email addresses
  • skipping duplicates (easy if the data is sorted).

If you have a connection with a foreign key, first you need to take the second step (so as not to have any orphan entries).

My two cents: I don’t know what your requirements are, but it seems to be a bit reconfigured. If there is a slight restriction on the number of email addresses, I would consider adding multiple email columns to the main table ... for speed and simplicity.

0
source

Source: https://habr.com/ru/post/892545/


All Articles