How to remove contiguous sequences of nearly identical records from a database

I have a SQL Server database containing real-time stock quotes.

There is a quote table that contains what you expect - serial number, ticker symbol, time, price, bid, bid size, request, request size, etc.

The sequence number corresponds to the message that was received containing data for the tracked character set of the ticker. A new message (with a new, increasing sequence number) is received whenever something changes for any monitored character. The message contains data for all characters (even for those where nothing has changed).

When the data was placed in the database, a record was inserted for each character in each message, even for characters where nothing has changed since the previous message. Thus, many records contain redundant information (only the sequence number is changed), and I want to delete these redundant records.

This is not the same as deleting all records except one from the entire database for a combination of identical columns (already answered). Rather, I want to compress each continuous block of identical records (identical, with the exception of the sequence number) into one record. Upon completion, there may be duplicate entries, but with different entries between them.

My approach was to find continuous ranges of entries (for a ticker symbol) where everything is the same except for the serial number.

, , . Sequence + Symbol ( ). , Price , ( ). X , [1, 6], Y [1, 2], [4, 5] [7, 7]:

:

Sequence  Symbol  Price
   0        X      $10
   0        Y      $ 5
   1        X      $10
   1        Y      $ 5
   2        X      $10
   2        Y      $ 5
   3        X      $10
   3        Y      $ 6
   4        X      $10
   4        Y      $ 6
   5        X      $10
   5        Y      $ 6
   6        X      $10
   6        Y      $ 5
   7        X      $11
   7        Y      $ 5

:

Sequence  Symbol  Price
   0        X      $10
   0        Y      $ 5
   3        Y      $ 6
   6        Y      $ 5
   7        X      $11

, (Y, $5) , (Y, $6) .

, . , ( , ), BETWEEN , , ( , ). - " WHERE Sequence BETWEEN StartOfRange AND EndOfRange".

SELECT
   GroupsOfIdenticalRecords.Symbol,
   MIN(GroupsOfIdenticalRecords.Sequence)+1 AS StartOfRange,
   MAX(GroupsOfIdenticalRecords.Sequence) AS EndOfRange
FROM
   (
   SELECT
      Q1.Symbol,
      Q1.Sequence,
      MAX(Q2.Sequence) AS ClosestEarlierDifferentRecord
   FROM
      Quotes AS Q1
   LEFT OUTER JOIN
      Quotes AS Q2
   ON
          Q2.Sequence BETWEEN Q1.Sequence-100 AND Q1.Sequence-1
      AND Q2.Symbol=Q1.Symbol
      AND Q2.Price<>Q1.Price
   GROUP BY
      Q1.Sequence,
      Q1.Symbol
   ) AS GroupsOfIdenticalRecords
GROUP BY
   GroupsOfIdenticalRecords.Symbol,
   GroupsOfIdenticalRecords.ClosestEarlierDifferentRecord

, - ( SSMS - ) 2+ . "-100" "-2", . , "ON" LEFT OUTER JOIN (2 , 100 , ), , SQL Server 2 , Q1 Q2 ( 4e12 ) , ON.

(, "(SELECT TOP 100000 FROM Quotes) AS Q1" Q2), . , 20 , "WHERE Sequence BETWEEN 0 AND 99999", "... BETWEEN 100000 AND 199999" .. ( , [0, 99999], [99900, 199999] .., , ).

100 000 ([0,99999], [100000, 199999] ..). ( )? , "", . , MIN(), MAX() .. ( ), ( Q1 Q2). ? ( ) ?

SELECT
   CONVERT(INTEGER, Sequence / 100000)*100000 AS BlockStart,
   MIN(((1+CONVERT(INTEGER, Sequence / 100000))*100000)-1) AS BlockEnd
FROM
   Quotes
GROUP BY
   CONVERT(INTEGER, Sequence / 100000)*100000
+4
2

. , , . . . :

Sequence  Symbol  Price    seq1    seq2   diff
   0        X      $10      1       1       0
   0        Y      $ 5      1       1       0
   1        X      $10      2       2       0
   1        Y      $ 5      2       2       0
   2        X      $10      3       3       0
   2        Y      $ 5      3       3       0
   3        X      $10      4       4       0
   3        Y      $ 6      4       1       3
   4        X      $10      5       5       0
   4        Y      $ 6      5       2       3
   5        X      $10      6       6       0
   5        Y      $ 6      6       3       3
   6        X      $10      7       7       0
   6        Y      $ 5      7       4       3
   7        X      $11      8       1       7
   7        Y      $ 5      8       5       3

, , diff .

SQL, :

select min(q.sequence) as sequence, symbol, price
from (select q.*,
             (row_number() over (partition by symbol order by sequence) -
              row_number() over (partition by symbol, price order by sequence)
             ) as grp
      from quotes q
     ) q
group by symbol, grp, price;

, , , .

+1

. , .

. . , , . .

/ (seq1) , / (seq2) ( , Price). seq1 seq2 (.. Diff , ) seq1 "" seq2 ( "" , seq1 seq2 ). seq2 , "" seq1, diff , diff ( ). Symbol/Price, , , .

SQL , OVER. , seq1, - seq2. , , .

. (Bid, Ask ..) OVER GROUP BY:

row_number() over (partition by Symbol, Price, Bid, BidSize, Ask, AskSize, Change, Volume, DayLow, DayHigh, Time order by Sequence)

group by Symbol, grp, price, Bid, BidSize, Ask, AskSize, Change, Volume, DayLow, DayHigh, Time

, use > MIN (...) <= MAX (...) .

0

Source: https://habr.com/ru/post/1529081/


All Articles