Counting values, sequential and ungrouped table

I have an SQL table with samples defined as follows:

sampledate (datetime, 24 records per day per parameter) parameterID (int) value (decimal) valid (bit, 1=valid data, 0=invalid data) 

a pair of sampled and parameterized ones is unique.

each selection is in the format 02/02/2011 12:00, so there is less than 24 lines per day for each parameter (for example, the probe may fail or work, for example, and will output less than 24 samples).

I need to calculate daily average values ​​for each parameter. The average value is valid for a given day only if

  • There are at least 18 valid values.
  • no more than 5 invalid sequential values ​​present

Condition 1) is quite easy to achieve for this @ parameter:

  SELECT CONVERT(DATETIME, FLOOR(CONVERT(FLOAT, sampledate))) as avgdate, AVG(value) as avg, parameterID, isValid = CASE WHEN COUNT(value) > 17 THEN 1 ELSE 0 END FROM samples WHERE parameterId=@parameter GROUP BY parameterId, CONVERT(DATETIME, FLOOR(CONVERT(FLOAT, sampledate))), valid HAVING valid = 1 ORDER BY sampledate 

How can I add condition 2, which boils down to counting consecutive 0s within 24 hours, possibly with better specs?

we have millions of samples, and cursors are slow.

+4
source share
6 answers

And here comes the recursive CTE solution, and it is parameterizable:

 WITH seq_samples AS ( SELECT sampledate, parameterID, value, valid, avgdate = CAST(FLOOR(CAST(sampledate AS float)) AS datetime), rownum = ROW_NUMBER() OVER ( PARTITION BY parameterID, CAST(FLOOR(CAST(sampledate AS float)) AS datetime) ORDER BY sampledate) FROM samples ), rec_samples AS ( SELECT sampledate, parameterID, value, valid, avgdate, rownum, inv_seq_num = 1 - valid FROM seq_samples WHERE rownum = 1 UNION ALL SELECT ss.sampledate, ss.parameterID, ss.value, ss.valid, ss.avgdate, ss.rownum, inv_seq_num = CASE ss.valid WHEN 1 THEN 0 ELSE rs.inv_seq_num + 1 END FROM seq_samples ss INNER JOIN rec_samples rs ON ss.avgdate = rs.avgdate AND ss.parameterID = rs.parameterID AND ss.rownum = rs.rownum + 1 ) SELECT avgdate, parameterID, avgvalue = AVG(value) FROM rec_samples GROUP BY avgdate, parameterID HAVING SUM(CAST(valid AS int)) >= @minimal_valid_count AND MAX(inv_seq_num) <= @critical_invalid_count 

Your idea is basically implemented here. Additional numbering is used, which applies only to invalid strings and only breaks on date transitions and actual values. At the end, MAX is applied to the numbering column to see if the maximum number has exceeded @critical_invalid_count . And for another parameter, just check the sum of the valid attributes.

So there you are.


EDIT for seq_samples CTE (applies to your customized version of the original query).

 seq_samples AS ( SELECT *, rownum = ROW_NUMBER() OVER ( PARTITION BY parameterID, avgdate ORDER BY sampledate) FROM ( SELECT sampledate, parameterID, value, valid, avgdate = CAST(FLOOR(CAST(sampledate AS float)) AS datetime) FROM samples ) s ), 

SSMS showed me a significant, almost unbelievable performance difference between my original request and the modified one. (It depends only on the numbers of the evaluation plan.) I do not know what adaptations you should have made in your initial decision, but I hope that the improvement that I saw will not be completely lost because of them.

+1
source

A few thoughts and comments ...

There are many ways to make date-time only a date value. I am using DATEADD (DAY, DATEDIFF (DAY, 0,), 0). But for the following code, I suggest just pretending to have a justDate field to make things shorter :)


Sequence is important, and the table does not have a sequence identifier. ROW_NUMBER () can give you this ...

 ROW_NUMBER() OVER (PARTITION BY parameter_id, justDate ORDER BY sampledate) AS "sequence_id" 


There seem to be several ways to do this.

For each sample, join the next sample five times. I don’t like it, but this is probably the easiest option ...

 WITH sequenced_samples AS ( SELECT parameterID AS "parameter_id", sampledate AS "sample_date_time", DATEADD(DAY, DATEDIFF(DAY, 0, sampledate), 0) AS "sample_date", ROW_NUMBER() OVER (PARTITION BY parameter_id, DATEADD(DAY, DATEDIFF(DAY, 0, sampledate), 0) ORDER BY sampledate) AS "sequence_id", CASE WHEN valid = 1 THEN value ELSE NULL END as "value", -(valid - 1) AS "invalid" -- turns 1 to 0, and 0 to 1 FROM samples ) SELECT "sample_1".parameter_id, "sample_1".sample_date, AVG(value) AS average_value FROM samples "sample_1" LEFT JOIN samples "sample_2" ON "sample_2".parameter_id = "sample_1".parameter_id AND "sample_2".sample_date = "sample_1".sample_date AND "sample_2".sequence_id = "sample_1".sequence_id + 1 LEFT JOIN samples "sample_3" ON "sample_3".parameter_id = "sample_1".parameter_id AND "sample_3".sample_date = "sample_1".sample_date AND "sample_3".sequence_id = "sample_1".sequence_id + 2 LEFT JOIN samples "sample_4" ON "sample_4".parameter_id = "sample_1".parameter_id AND "sample_4".sample_date = "sample_1".sample_date AND "sample_4".sequence_id = "sample_1".sequence_id + 3 LEFT JOIN samples "sample_5" ON "sample_5".parameter_id = "sample_1".parameter_id AND "sample_5".sample_date = "sample_1".sample_date AND "sample_5".sequence_id = "sample_1".sequence_id + 4 GROUP BY "sample_1".parameter_id, "sample_1".sample_date HAVING 5 > MAX("sample_1".invalid + "sample_2".invalid + "sample_3".invalid + "sample_4".invalid + "sample_5".invalid) AND 17 < (COUNT(*) - SUM("sample_1".invalid)) 


The following is a bit more intelegent (but only a little), but I do not sit anywhere with access to MS SQL Server, so I can’t say if it is more efficient.

Instead of 4 joins, join once, but compare 5 consecutive samples. Performing two levels of grouping.

 WITH sequenced_samples AS ( SELECT parameterID AS "parameter_id", sampledate AS "sample_date_time", DATEADD(DAY, DATEDIFF(DAY, 0, sampledate), 0) AS "sample_date", ROW_NUMBER() OVER (PARTITION BY parameter_id, DATEADD(DAY, DATEDIFF(DAY, 0, sampledate), 0) ORDER BY sampledate) AS "sequence_id", CASE WHEN valid = 1 THEN value ELSE NULL END AS "value", -(valid - 1) AS "invalid" -- Turns 0 to 1, and 1 to 0 FROM samples ) , checked_samples AS ( SELECT "sample".parameter_id, "sample".sample_date, "sample".value, "sample".invalid, SUM("next_5_samples".invalid) AS "sequence_invalidity" FROM samples "sample" INNER JOIN samples "next_5_samples" ON "next_5_samples".parameter_id = "sample".parameter_id AND "next_5_samples".sample_date = "sample".sample_date AND "next_5_samples".sequence_id >= "sample".sequence_id + 1 AND "next_5_samples".sequence_id <= "sample".sequence_id + 4 GROUP BY "sample".parameter_id, "sample".sample_date, "sample".valid, "sample".value ) SELECT parameter_id, sample_date, AVG(value) FROM checked_samples GROUP BY parameter_id, sample_date HAVING 5 > MAX(sequence_invalidity) AND 17 < (COUNT(*) - SUM(invalid)) 


The final option is to use common table recursive expressions to loop through records one at a time. It's harder to code the cursor, but (in my experience) much faster.


EDIT: The following query had a left join in a recursive CTE, and now it is not.

 WITH sequenced_samples AS ( SELECT parameterID AS "parameter_id", sampledate AS "sample_date_time", DATEADD(DAY, DATEDIFF(DAY, 0, sampledate), 0) AS "sample_date", ROW_NUMBER() OVER (PARTITION BY parameter_id, DATEADD(DAY, DATEDIFF(DAY, 0, sampledate), 0) ORDER BY sampledate) AS "sequence_id", value, valid FROM samples ) , recursed_totals AS ( SELECT parameter_id, sample_date, sequence_id - 1 AS "next_sequence_id", CASE WHEN valid = 1 THEN value ELSE 0 END AS "cumulative_value", valid AS "cumulative_count", CASE WHEN valid = 1 THEN 0 ELSE 1 END AS "cumulative_invalid", CASE WHEN valid = 1 THEN 0 ELSE 1 END AS "max_cumulative_invalid" FROM sequenced_samples WHERE sequence_id = ( SELECT COUNT(*) FROM sequenced_samples "look_up" WHERE "look_up".parameter_id = sequenced_samples.parameter_id AND "look_up".sample_date = sequenced_samples.sample_date ) UNION ALL SELECT "cumulative_samples".parameter_id, "cumulative_samples".sample_date, "next_sample".sequence_id - 1, "cumulative_samples".cumuatlive_value + CASE WHEN "next_sample".valid = 1 THEN "next_sample".value ELSE 0 END, "cumulative_samples".valid + ISNULL("next_sample".valid, 0), CASE WHEN "next_sample".valid = 0 THEN "cumulative_samples".cumulative_invalid + 1 WHEN "cumulative_samples".cumulative_invalid = 5 THEN 5 ELSE 0 END, CASE WHEN "next_sample".valid = 1 THEN "cumulative_samples".max_cumulative_invalid WHEN "cumulative_samples".cumulative_invalid = "cumulative_samples".max_cumulative_invalid THEN "cumulative_samples".max_cumulative_invalid + 1 ELSE "cumulative_samples".max_cumulative_invalid END FROM recursed_totals AS "cumulative_samples" INNER JOIN sequenced_samples AS "next_sample" ON "next_sample".parameter_id = "cumulative_samples".parameter_id AND "next_sample".sample_date = "cumulative_samples".sample_date AND "next_sample".sequence_id = "cumulative_samples".next_sequence_id ) SELECT parameter_id, sample_date, CAST(cumulative_value AS FLOAT) / CAST(cumulative_count AS FLOAT) AS "average", cumulative_count AS "valid_samples", max_cumulative_invalid AS "max_consecutive_invalid_samples" FROM recursed_totals WHERE parameter_id = @parameter_id 
+1
source

This is a solution that uses basically the same approach as the Dems method. I think the logic in my solution is a little different. (Or maybe it's just differently structured ...)

 WITH sortedsamples AS ( SELECT sampledate, parameterID, value, valid, avgdate = CAST(FLOOR(CAST(sampledate AS float)) AS datetime), rownum = ROW_NUMBER() OVER ( PARTITION BY parameterID, CAST(FLOOR(CAST(sampledate AS float)) AS datetime) ORDER BY sampledate ) FROM samples ) SELECT ss1.parameterID, ss1.avgdate, avg = AVG(value), isValid = CAST(CASE WHEN SUM(CAST(ss1.valid AS int)) < 18 THEN 0 ELSE MIN(CAST(ss1.valid | ISNULL(ss2.valid, 1) | ISNULL(ss3.valid, 1) | ISNULL(ss4.valid, 1) | ISNULL(ss5.valid, 1) AS int)) END AS bit) FROM sortedsamples ss1 LEFT JOIN sortedsamples ss2 ON ss1.avgdate = ss2.avgdate AND ss1.parameterID = ss2.parameterID AND ss1.rownum = ss2.rownum + 1 LEFT JOIN sortedsamples ss3 ON ss1.avgdate = ss3.avgdate AND ss1.parameterID = ss3.parameterID AND ss1.rownum = ss3.rownum + 2 LEFT JOIN sortedsamples ss4 ON ss1.avgdate = ss4.avgdate AND ss1.parameterID = ss4.parameterID AND ss1.rownum = ss4.rownum + 3 LEFT JOIN sortedsamples ss5 ON ss1.avgdate = ss5.avgdate AND ss1.parameterID = ss5.parameterID AND ss1.rownum = ss5.rownum + 4 GROUP BY ss1.parameterID, ss1.avgdate 
+1
source

Of your millions of samples, what percentage has less than 5 invalid values ​​per day? If this percentage is high enough, you're fine, because you can easily exclude them from cursor processing.

If the number of samples with 5 or more invalid values ​​per day is still in the millions, you can wait a long time.

0
source

your decisions are quite interesting (and I learned a lot from them), but I wonder if they can be improved.

For example, all solutions (possibly, except those that use recursive cte) are not parametric in the number (N) of invalid sequential samples. I can imagine that in the future I could use different N for different years or parameters.

I was wondering if it is possible to develop a solution starting with rownum () solutions: if I could find a way to reset counting each gap in a valid column, I could just cancel the day I find a row with rownum> N and valid = 0 That would be super simple, fast and versatile.

I am trying to explain this idea better:

let's say I could use rownum or a similar function to get this:

 date par value valid rownum 2010-01-26 00:00:00.000 25 14.0000000000 1 1 2010-01-26 01:00:00.000 25 15.3000001907 1 2 2010-01-26 02:00:00.000 25 16.8999996185 1 3 2010-01-26 03:00:00.000 25 13.6000003815 1 4 2010-01-26 04:00:00.000 25 16.2000007629 1 5 2010-01-26 05:00:00.000 25 12.1999998093 -1 1 2010-01-26 06:00:00.000 25 17.2000007629 -1 2 2010-01-26 07:00:00.000 25 16.2999992371 1 1 2010-01-26 08:00:00.000 25 18.2999992371 1 2 2010-01-26 09:00:00.000 25 15.0000000000 1 3 2010-01-26 10:00:00.000 25 17.7000007629 1 4 2010-01-26 11:00:00.000 25 16.5000000000 1 5 2010-01-26 12:00:00.000 25 17.3999996185 1 6 2010-01-26 13:00:00.000 25 17.7000007629 1 7 2010-01-26 14:00:00.000 25 18.2999992371 1 8 2010-01-26 15:00:00.000 25 15.1000003815 -1 1 2010-01-26 16:00:00.000 25 16.5000000000 -1 2 2010-01-26 17:00:00.000 25 10.3999996185 -1 3 2010-01-26 18:00:00.000 25 10.8999996185 -1 4 2010-01-26 19:00:00.000 25 10.1000003815 -1 5 <-----!!!! 2010-01-26 20:00:00.000 25 13.6999998093 1 1 2010-01-26 21:00:00.000 25 12.6999998093 1 2 2010-01-26 22:00:00.000 25 15.3999996185 -1 1 2010-01-26 23:00:00.000 25 8.6000003815 -1 2 

if N = 5, the existence of a string

 2010-01-26 19:00:00.000 25 10.1000003815 -1 5 

indicates that the whole day is invalid (not to mention the total amount of invalid data)

What do you think of this idea?

(I don't know if this should be an edit or a separate answer)

0
source

Here is another answer using CROSS APPLY and avoiding using ROW_NUMBER ().

If, however, more than 5 samples occur exactly at the same time, this will not work correctly for the same ID parameter. In this case, you will need ROW_NUMBER ().

 SELECT parameterID AS "parameter_id", DATEADD(DAY, DATEDIFF(DAY, 0, sampledate), 0) AS "sample_date", SUM(value) AS "total", SUM(CASE WHEN valid = 1 THEN value ELSE 0 END) AS "total_valid", COUNT(*) AS "count", SUM(valid) AS "count_valid", MAX(invalid) AS "date_invalidated" FROM samples CROSS APPLY ( SELECT CASE WHEN SUM(valid) = 0 THEN 1 ELSE 0 END AS "invalid" FROM ( SELECT TOP 5 valid FROM samples AS "5_samples" WHERE "5_samples".parameterID = "samples".parameterID AND "5_samples".sampledate >= "samples".sampledate AND "5_samples".sampledate < DATEADD(DAY, DATEDIFF(DAY, 0, "samples".sampledate), 1) ORDER BY sampledate ) AS "data" ) AS "check" WHERE parameterID = @parameterID GROUP BY parameter_id, DATEADD(DAY, DATEDIFF(DAY, 0, sampledate), 0) 
0
source

Source: https://habr.com/ru/post/1338113/


All Articles