Filter duplicates from a loaded dataset in SSIS

I am doing some ETL in SSIS to create some datasets. One of them is the date. When creating a set of dates for a dimension, I can use the search compared to what is already in the date dimension, and redirect any of them, which are considered new dates, and then added to the table.

The problem is that the dataset that I have may contain duplicate dates. This will lead to errors with unique date keys when inserted into the dimension table. Therefore, I am looking for a way to filter inside a dataset that is loaded into the SSIS pipeline.

I could use DISTINCT when loading the date, but the date in this case is DATETIME. I need to use data conversion conversion later to turn this into a DATE, just accepting the date component. I am looking for unique days, and the distinctive element of DATETIME will not give me this.

I can’t use SSIS search as before, because it requires a connection manager pointing to the database.

I could say that the OLE DB assignment does not use bulk insert, ignoring any errors. This suggests, however, that double dates will be the only errors.

I am new to SSIS and could not find a conversion tool that will allow me to compare with other lines in the set.

+6
source share
1 answer

You can use the sort transformation and choose to remove duplicates, or use the Aggregate transformation and use only the group (which will be more or less similar to DISTINCT). Note that these operations are asynchronous, that is, all lines must enter this task before continuing, unlike synchronization tasks, which simply eat and spill line buffers as they arrive.

+7
source

Source: https://habr.com/ru/post/902210/


All Articles