Necessary approach for working with small subsets of a large dataset

I ran into a conceptual problem that is difficult for me to overcome difficulties. I hope SO people can help me overcome this with a push in the right direction.

I do some ETL work when the source data is very similar and very large. I load it into a table intended for replication, and I need only the most basic information in this target table.

My source table looks something like this:

alt text

I need my target table to reflect it as such:

alt text

As you can see, I did not duplicate the status of InTransit, where it was duplicated in the source table. The steps I'm trying to figure out how to achieve

  • Get new new lines entered since the last query. (Easy)
  • For each TrackingId, I need to check if each new status is already the latest status in the target, and if so ignore it, then insert it. This means that I must start from the earliest of the new statuses and go from there. (I don’t have * (! # In the hint how I will do it)
  • Do this every 15 minutes, so that the statuses are saved recently, so step number 2 must be completed.

My source table may consist of 100k + rows, but having to run it every 15 minutes requires me to make sure that it is very efficient, so I really try to avoid cursors.

Right now the only way I can do this is to use CLR sproc, but I think there may be better ways, so I hope you guys can push me in the right direction.

, , , -, , , , , , .

!

EDIT: , . . 100k + , mulitple TrackingId . , , , .

+3
5

. . , . SQL 2008 R2, CAST DATE.

    declare @tbl1 table(
id int, Trackingid int, Status varchar(50), StatusDate datetime
)

declare @tbl2 table(
id int, Trackingid int, Status varchar(50), StatusDate datetime
)

----Source data
insert into @tbl1 (id, trackingid, status, statusdate) values(1,1,'PickedUp','10/01/10  1:00') --
insert into @tbl1 (id, trackingid, status, statusdate) values(2,1,'InTransit','10/02/10 1:00') --
insert into @tbl1 (id, trackingid, status, statusdate) values(8,1,'InTransit','10/02/10  3:00')
insert into @tbl1 (id, trackingid, status, statusdate) values(4,1,'Delayed','10/03/10 1:00')
insert into @tbl1 (id, trackingid, status, statusdate) values(5,1,'InTransit','10/03/10 1:01')
insert into @tbl1 (id, trackingid, status, statusdate) values(6,1,'AtDest','10/03/10 2:00')
insert into @tbl1 (id, trackingid, status, statusdate) values(7,1,'Deliv','10/03/10 3:00') --
insert into @tbl1 (id, trackingid, status, statusdate) values(3,2,'InTransit','10/03/10 1:00')
insert into @tbl1 (id, trackingid, status, statusdate) values(9,2,'AtDest','10/04/10 1:00')
insert into @tbl1 (id, trackingid, status, statusdate) values(10,2,'Deliv','10/04/10 1:05')
insert into @tbl1 (id, trackingid, status, statusdate) values(11,1,'Delayed','10/02/10 2:05')

----Target data
insert into @tbl2 (id, trackingid, status, statusdate) values(1,1,'PickedUp','10/01/10  1:00')
insert into @tbl2 (id, trackingid, status, statusdate) values(2,1,'InTransit','10/02/10 1:00')
insert into @tbl2 (id, trackingid, status, statusdate) values(3,1,'Deliv','10/03/10 3:00')


select d.* from
(
    select 
    * ,
    ROW_NUMBER() OVER(PARTITION BY trackingid, CAST((STR( YEAR( statusdate ) ) + '/' +STR( MONTH(statusdate ) ) + '/' +STR( DAY( statusdate ) )) AS DATETIME) ORDER BY statusdate) AS 'RN'
    from @tbl1
) d

where 
not exists
(
    select RN from
    (
        select 
        * ,
        ROW_NUMBER() OVER(PARTITION BY trackingid, CAST((STR( YEAR( statusdate ) ) + '/' +STR( MONTH(statusdate ) ) + '/' +STR( DAY( statusdate ) )) AS DATETIME) ORDER BY statusdate) AS 'RN'
        from @tbl1
    )f where f.RN = d.RN + 1 and d.status = f.status and f.trackingid = d.trackingid and 
    CAST((STR( YEAR( f.statusdate ) ) + '/' +STR( MONTH(f.statusdate ) ) + '/' +STR( DAY( f.statusdate ) )) AS DATETIME) =
            CAST((STR( YEAR( d.statusdate ) ) + '/' +STR( MONTH(d.statusdate ) ) + '/' +STR( DAY( d.statusdate ) )) AS DATETIME)
)

and
not exists 
(
    select 1 from @tbl2 t2
    where (t2.trackingid = d.trackingid
    and t2.statusdate = d.statusdate
    and t2.status = d.status)
)
and (
    not exists
    (
        select 1 from
        (
            select top 1 * from @tbl2 t2 
            where t2.trackingid = d.trackingid
            order by t2.statusdate desc
        ) g
        where g.status = d.status
    )
    or not exists
    (
        select 1 from
        (
            select top 1 * from @tbl2 t2 
            where t2.trackingid = d.trackingid
            and t2.statusdate <= d.statusdate
            order by t2.statusdate desc
        ) g
        where g.status = d.status
    )
)
order by trackingid,statusdate
+1

:

WITH    q AS
        (
        SELECT  *,
                ROW_NUMBER() OVER (ORDER BY statusDate) AS rn,
                ROW_NUMBER() OVER (PARTITION BY status ORDER BY statusDate) AS rns
        FROM    tracking
        WHERE   tackingId = @id
        ),
        qs AS
        (
        SELECT  *,
                ROW_NUMBER() OVER (PARTITION BY rn - rns ORDER BY statusDate) AS rnn
        FROM    q
        )
SELECT  *
FROM    qs
WHERE   rnn = 1
ORDER BY
        statusDate

a script :

DECLARE @tracking TABLE
        (
        id INT NOT NULL PRIMARY KEY,
        trackingId INT NOT NULL,
        status INT,
        statusDate DATETIME
        )

INSERT
INTO    @tracking
SELECT  1, 1, 1, DATEADD(d, 1, '2010-01-01')
UNION ALL
SELECT  2, 1, 2, DATEADD(d, 2, '2010-01-01')
UNION ALL
SELECT  3, 1, 2, DATEADD(d, 3, '2010-01-01')
UNION ALL
SELECT  4, 1, 2, DATEADD(d, 4, '2010-01-01')
UNION ALL
SELECT  5, 1, 3, DATEADD(d, 5, '2010-01-01')
UNION ALL
SELECT  6, 1, 3, DATEADD(d, 6, '2010-01-01')
UNION ALL
SELECT  7, 1, 4, DATEADD(d, 7, '2010-01-01')
UNION ALL
SELECT  8, 1, 2, DATEADD(d, 8, '2010-01-01')
UNION ALL
SELECT  9, 1, 2, DATEADD(d, 9, '2010-01-01')
UNION ALL
SELECT  10, 1, 1, DATEADD(d, 10, '2010-01-01')
;
WITH    q AS
        (
        SELECT  *,
                ROW_NUMBER() OVER (ORDER BY statusDate) AS rn,
                ROW_NUMBER() OVER (PARTITION BY status ORDER BY statusDate) AS rns
        FROM    @tracking
        ),
        qs AS
        (
        SELECT  *,
                ROW_NUMBER() OVER (PARTITION BY rn - rns ORDER BY statusDate) AS rnn
        FROM    q
        )
SELECT  *
FROM    qs
WHERE   rnn = 1
ORDER BY
        statusDate
+2

, , TrackingID , CTE :

CREATE TABLE #foo
(
    TrackingID INT,
    [Status] VARCHAR(32),
    StatusDate SMALLDATETIME
);

INSERT #foo SELECT 1, 'PickedUp',  '2010-10-01 08:15';
INSERT #foo SELECT 1, 'InTransit', '2010-10-02 03:07';
INSERT #foo SELECT 1, 'InTransit', '2010-10-02 10:28';
INSERT #foo SELECT 1, 'Delayed',   '2010-10-03 09:52';
INSERT #foo SELECT 1, 'InTransit', '2010-10-03 20:09';
INSERT #foo SELECT 1, 'AtDest',    '2010-10-04 13:42';
INSERT #foo SELECT 1, 'Deliv',     '2010-10-04 17:05';

WITH src AS
(
    SELECT 
        TrackingID,
        [Status],
        StatusDate, 
        ab = ROW_NUMBER() OVER (ORDER BY [StatusDate])
    FROM #foo
    WHERE TrackingID = 1
),
realsrc AS
(
    SELECT 
        a.TrackingID,
        leftrow         = a.ab,
        rightrow        = b.ab,
        leftstatus      = a.[Status],
        leftstatusdate  = a.StatusDate,
        rightstatus     = b.[Status],
        rightstatusdate = b.StatusDate 
    FROM src AS a
    LEFT OUTER JOIN src AS b
    ON a.ab = b.ab - 1
)
SELECT 
    Id = ROW_NUMBER() OVER (ORDER BY [leftstatusdate]),
    TrackingID,
    [Status] = leftstatus,
    [StatusDate] = leftstatusdate
FROM
    realsrc
WHERE
    rightrow IS NULL
    OR (leftrow = rightrow - 1 AND leftstatus <> rightstatus)
ORDER BY 
    [StatusDate];
GO
DROP TABLE #foo;

TrackingID :

CREATE TABLE #foo
(
    TrackingID INT,
    [Status] VARCHAR(32),
    StatusDate SMALLDATETIME
);

INSERT #foo SELECT 1, 'PickedUp',  '2010-10-01 08:15';
INSERT #foo SELECT 1, 'InTransit', '2010-10-02 03:07';
INSERT #foo SELECT 1, 'InTransit', '2010-10-02 10:28';
INSERT #foo SELECT 1, 'Delayed',   '2010-10-03 09:52';
INSERT #foo SELECT 1, 'InTransit', '2010-10-03 20:09';
INSERT #foo SELECT 1, 'AtDest',    '2010-10-04 13:42';
INSERT #foo SELECT 1, 'Deliv',     '2010-10-04 17:05';
INSERT #foo SELECT 2, 'InTransit', '2010-10-02 10:28';
INSERT #foo SELECT 2, 'Delayed',   '2010-10-03 09:52';
INSERT #foo SELECT 2, 'InTransit', '2010-10-03 20:09';
INSERT #foo SELECT 2, 'AtDest',    '2010-10-04 13:42';

WITH src AS
(
    SELECT 
        TrackingID,
        [Status],
        StatusDate, 
        ab = ROW_NUMBER() OVER (ORDER BY [StatusDate])
    FROM #foo
),
realsrc AS
(
    SELECT 
        a.TrackingID,
        leftrow         = a.ab,
        rightrow        = b.ab,
        leftstatus      = a.[Status],
        leftstatusdate  = a.StatusDate,
        rightstatus     = b.[Status],
        rightstatusdate = b.StatusDate 
    FROM src AS a
    LEFT OUTER JOIN src AS b
    ON a.ab = b.ab - 1
    AND a.TrackingID = b.TrackingID
)
SELECT 
    Id = ROW_NUMBER() OVER (ORDER BY TrackingID, [leftstatusdate]),
    TrackingID,
    [Status] = leftstatus,
    [StatusDate] = leftstatusdate
FROM
    realsrc
WHERE
    rightrow IS NULL
    OR (leftrow = rightrow - 1 AND leftstatus <> rightstatus)
ORDER BY 
    TrackingID, 
    [StatusDate];
GO
DROP TABLE #foo;
+1

If this is SQL 2005, you can use ROW_NUMBER with an additional query or CTE: If the data set is really huge and performance is a problem, then one of the above that was inserted when I try to get the code block to work may well be more efficient.

/**
*  This is just to create a sample table to use in the test query
**/

DECLARE @test TABLE(ID INT, TrackingID INT, Status VARCHAR(20), StatusDate DATETIME)
INSERT    @test
SELECT    1,1,'PickedUp', '01 jan 2010 08:00' UNION
SELECT    2,1,'InTransit', '01 jan 2010 08:01' UNION
SELECT    3,1,'InTransit', '01 jan 2010 08:02' UNION
SELECT    4,1,'Delayed', '01 jan 2010 08:03' UNION
SELECT    5,1,'InTransit', '01 jan 2010 08:04' UNION
SELECT    6,1,'AtDest', '01 jan 2010 08:05' UNION
SELECT    7,1,'Deliv', '01 jan 2010 08:06'


/**
*  This would be the select code to exclude the duplicate entries. 
*  Sorting desc in row_number would get latest instead of first
**/
;WITH n AS
(
    SELECT    ID,
            TrackingID,
            Status,
            StatusDate,
            --For each Status for a tracking ID number by ID (could use date but 2 may be the same)
            ROW_NUMBER() OVER(PARTITION BY TrackingID, Status ORDER BY ID) AS [StatusNumber]
    FROM    @test
)
SELECT    ID,
        TrackingID,
        Status,
        StatusDate
FROM    n
WHERE    StatusNumber = 1
ORDER    BY ID
0
source

I think this example will do what you are looking for:

CREATE TABLE dbo.srcStatus (
 Id INT IDENTITY(1,1),
 TrackingId INT NOT NULL,
 [Status] VARCHAR(10) NOT NULL,
 StatusDate DATETIME NOT NULL
);

CREATE TABLE dbo.tgtStatus (
 Id INT IDENTITY(1,1),
 TrackingId INT NOT NULL,
 [Status] VARCHAR(10) NOT NULL,
 StatusDate DATETIME NOT NULL
);

INSERT INTO dbo.srcStatus ( TrackingId, [Status], StatusDate ) VALUES  ( 1,'PickedUp','10/1/2010 8:15 AM');
INSERT INTO dbo.srcStatus ( TrackingId, [Status], StatusDate ) VALUES  ( 1,'InTransit','10/2/2010 3:07 AM');
INSERT INTO dbo.srcStatus ( TrackingId, [Status], StatusDate ) VALUES  ( 1,'InTransit','10/2/2010 10:28 AM');
INSERT INTO dbo.srcStatus ( TrackingId, [Status], StatusDate ) VALUES  ( 2,'PickedUp','10/1/2010 8:15 AM');
INSERT INTO dbo.srcStatus ( TrackingId, [Status], StatusDate ) VALUES  ( 2,'InTransit','10/2/2010 3:07 AM');
INSERT INTO dbo.srcStatus ( TrackingId, [Status], StatusDate ) VALUES  ( 2,'Delayed','10/2/2010 10:28 AM');
INSERT INTO dbo.srcStatus ( TrackingId, [Status], StatusDate ) VALUES  ( 1,'Delayed','10/3/2010 9:52 AM');
INSERT INTO dbo.srcStatus ( TrackingId, [Status], StatusDate ) VALUES  ( 1,'InTransit','10/3/2010 8:09 PM');
INSERT INTO dbo.srcStatus ( TrackingId, [Status], StatusDate ) VALUES  ( 1,'AtDest','10/4/2010 1:42 PM');
INSERT INTO dbo.srcStatus ( TrackingId, [Status], StatusDate ) VALUES  ( 1,'Deliv','10/4/2010 5:05 PM');
INSERT INTO dbo.srcStatus ( TrackingId, [Status], StatusDate ) VALUES  ( 2,'InTransit','10/3/2010 9:52 AM');
INSERT INTO dbo.srcStatus ( TrackingId, [Status], StatusDate ) VALUES  ( 2,'InTransit','10/3/2010 8:09 PM');
INSERT INTO dbo.srcStatus ( TrackingId, [Status], StatusDate ) VALUES  ( 2,'AtDest','10/4/2010 1:42 PM');
INSERT INTO dbo.srcStatus ( TrackingId, [Status], StatusDate ) VALUES  ( 2,'Deliv','10/4/2010 5:05 PM');

WITH    cteSrcTrackingIds
          AS ( SELECT DISTINCT
                        TrackingId
               FROM     dbo.srcStatus
             ),
        cteAllTrackingIds
          AS ( SELECT   TrackingId ,
                        [Status] ,
                        StatusDate
               FROM     dbo.srcStatus
               UNION
               SELECT   tgtStatus.TrackingId ,
                        tgtStatuS.[Status] ,
                        tgtStatus.StatusDate
               FROM     cteSrcTrackingIds
                        INNER JOIN dbo.tgtStatus ON cteSrcTrackingIds.TrackingId = tgtStatus.TrackingId
             ),
        cteAllTrackingIdsWithRownums
          AS ( SELECT   TrackingId ,
                        [Status] ,
                        StatusDate ,
                        ROW_NUMBER() OVER ( PARTITION BY TrackingId ORDER BY StatusDate ) AS rownum
               FROM     cteAllTrackingIds
             ),
        cteTrackingIdsWorkingSet
          AS ( SELECT   src.rownum AS [id] ,
                        src2.rownum AS [id2] ,
                        src.TrackingId ,
                        src.[Status] ,
                        src.StatusDate ,
                        ROW_NUMBER() OVER ( PARTITION BY src.TrackingId,
                                            src.rownum ORDER BY src.StatusDate ) AS rownum
               FROM     cteAllTrackingIdsWithRownums AS [src]
                        LEFT OUTER JOIN cteAllTrackingIdsWithRownums AS [src2] ON src.TrackingId = src2.TrackingId
                                                              AND src.rownum < src2.rownum
                                                              AND src.[Status] != src2.[Status]
             ),
        cteTrackingIdsSubset
          AS ( SELECT   id ,
                        TrackingId ,
                        [Status] ,
                        StatusDate ,
                        ROW_NUMBER() OVER ( PARTITION BY TrackingId, id2 ORDER BY id ) AS rownum
               FROM     cteTrackingIdsWorkingSet
               WHERE    rownum = 1
             )
    INSERT  INTO dbo.tgtStatus
            ( TrackingId ,
              [status] ,
              StatusDate
            )
            SELECT  cteTrackingIdsSubset.TrackingId ,
                    cteTrackingIdsSubset.[status] ,
                    cteTrackingIdsSubset.StatusDate
            FROM    cteTrackingIdsSubset
                    LEFT OUTER JOIN dbo.tgtStatus ON cteTrackingIdsSubset.TrackingId = tgtStatus.TrackingId
                                                     AND cteTrackingIdsSubset.[status] = tgtStatus.[status]
                                                     AND cteTrackingIdsSubset.StatusDate = tgtStatus.StatusDate
            WHERE   cteTrackingIdsSubset.rownum = 1
                    AND tgtStatus.id IS NULL
            ORDER BY cteTrackingIdsSubset.TrackingId ,
                    cteTrackingIdsSubset.StatusDate;
0
source

Source: https://habr.com/ru/post/1768063/


All Articles