How to get median / mode / range of a column in SQL using Java?

I need to get the median, mode and range of test scores from one column of the table, but I'm not sure how to do this. When you connect to the database using java, you usually return a ResultSet from which you can make a table or something in it, but how do you get specific numbers or digits? Is there an SQL command to get the median / mode / range or should I figure it out myself and how do you get the numbers from the table to be able to calculate the mode / median / range?

Thanks.

+4
source share
4 answers

This should be done in pure SQL. The range is easy with MIN() and MAX() functions. For the remainder, I am not a specialist in statistics, so I can’t tell from my head, but I found a link that may be useful: http://www.freeopenbook.com/mysqlcookbook/mysqlckbk-chp-13-sect-2.html

13.2 Calculation of descriptive statistics

13.2.1 Problem

You want to characterize a data set by calculating general descriptive or summary statistics.

13.2.2 Solution

Many common descriptive statistics, such as mean and standard deviation, can be obtained by applying aggregate functions to your data. Others, such as median or mode, can be calculated based on query counting.

13.2.3 Discussion

Suppose you have a testscore table containing observations representing the subject’s identifier, age, gender, and test score:

 mysql> SELECT subject, age, sex, score FROM testscore ORDER BY subject; +---------+-----+-----+-------+ | subject | age | sex | score | +---------+-----+-----+-------+ | 1 | 5 | M | 5 | | 2 | 5 | M | 4 | | 3 | 5 | F | 6 | | 4 | 5 | F | 7 | | 5 | 6 | M | 8 | | 6 | 6 | M | 9 | | 7 | 6 | F | 4 | | 8 | 6 | F | 6 | | 9 | 7 | M | 8 | | 10 | 7 | M | 6 | | 11 | 7 | F | 9 | | 12 | 7 | F | 7 | | 13 | 8 | M | 9 | | 14 | 8 | M | 6 | | 15 | 8 | F | 7 | | 16 | 8 | F | 10 | | 17 | 9 | M | 9 | | 18 | 9 | M | 7 | | 19 | 9 | F | 10 | | 20 | 9 | F | 9 | +---------+-----+-----+-------+ 

A good first step in analyzing a set of observations is to create some descriptive statistics that summarizes their general characteristics as a whole. General statistics of this type include:

  • The number of observations, their sum and their range (minimum and maximum)
  • Measures of central tendency, such as mean, median and mode
  • Measures of variation, such as standard deviation or deviation

In addition to the median and mode, all this can be easily calculated by calling aggregate functions:

 mysql> SELECT COUNT(score) AS n, -> SUM(score) AS sum, -> MIN(score) AS minimum, -> MAX(score) AS maximum, -> AVG(score) AS mean, -> STD(score) AS 'std. dev.' -> FROM testscore; +----+------+---------+---------+--------+-----------+ | n | sum | minimum | maximum | mean | std. dev. | +----+------+---------+---------+--------+-----------+ | 20 | 146 | 4 | 10 | 7.3000 | 1.7916 | +----+------+---------+---------+--------+-----------+ 

The aggregate functions used in this query only count non- NULL cases. If you use NULL to represent missing values, you can perform an additional characteristic to evaluate the extent to which missing values. (See Recipe 13.5 .)

The deviation is not displayed in the query, and MySQL does not have a function to compute it. However, the variance is just the square of the standard deviation, so it is easily calculated as follows:

 STD(score) * STD(score) 

STDDEV( ) is synonymous with STD( ) .

The standard deviation can be used to determine emission values ​​that are uncharacteristic far from the average. For example, to select values ​​that contain more than three standard deviations from the mean, you can do something like this:

 SELECT @mean := AVG(score), @std := STD(score) FROM testscore; SELECT score FROM testscore WHERE ABS( score-@mean ) > @std * 3; 

For a set of n values, the standard deviation created by STD( ) is based on n degrees of freedom. This is equivalent to calculating the standard deviation "manually" as follows ( @ss is the sum of the squares):

 mysql> SELECT -> @n := COUNT(score), -> @sum := SUM(score), -> @ss := SUM(score*score) -> FROM testscore; mysql> SELECT @var := ((@n * @ss) - (@sum * @sum)) / (@n * @n); mysql> SELECT SQRT(@var); +------------+ | SQRT(@var) | +------------+ | 1.791647 | +------------+ 

To calculate the standard deviation based on n-1 degrees of freedom, do this as follows:

 mysql> SELECT -> @n := COUNT(score), -> @sum := SUM(score), -> @ss := SUM(score*score) -> FROM testscore; mysql> SELECT @var := ((@n * @ss) - (@sum * @sum)) / (@n * (@n - 1)); mysql> SELECT SQRT(@var); +------------+ | SQRT(@var) | +------------+ | 1.838191 | +------------+ 

Or, more simply, like this:

 mysql> SELECT @n := COUNT(score) FROM testscore; mysql> SELECT STD(score)*SQRT(@n/(@n-1)) FROM testscore; +----------------------------+ | STD(score)*SQRT(@n/(@n-1)) | +----------------------------+ | 1.838191 | +----------------------------+ 

MySQL does not have a built-in function to calculate the mode or median of a set of values, but you can calculate them yourself. Mode is the value that is most often found. To determine what it is, count each value and see which one is most common:

 mysql> SELECT score, COUNT(score) AS count -> FROM testscore GROUP BY score ORDER BY count DESC; +-------+-------+ | score | count | +-------+-------+ | 9 | 5 | | 6 | 4 | | 7 | 4 | | 4 | 2 | | 8 | 2 | | 10 | 2 | | 5 | 1 | +-------+-------+ 

In this case, 9 is the value of the modal score.

The median of the set of ordered values ​​can be calculated as follows:

  • If the number of values ​​is odd, the median is the median.
  • If the number of values ​​is even, the median is the average of two average values.

Note that the definition of median given here is not completely general; he does not decide what to do if duplication of averages in the data set.

Based on this definition, use the following procedure to determine the median of the set of observations stored in the database:

  • Run the query to count the number of observations. From the count, you can determine whether the median calculation requires one or two values, and also that their indices are in an ordered set of observations.
  • Run a query containing an ORDER BY to sort cases and a LIMIT to retrieve the average value or values.
  • Take the average of the selected value or values.

For example, if table t contains a column with 37 values ​​(an odd number), you need to select one value using this query:

 SELECT score FROM t ORDER BY 1 LIMIT 18,1 

If the column contains 38 values ​​(an even number), the query will look like this:

 SELECT score FROM t ORDER BY 1 LIMIT 18,2 

You can then select the value or values ​​returned by the query and calculate the median from their average value.

The following Perl function implements median calculation. It takes a database descriptor and table and column names that contain a set of cases, then generates a query that retrieves the corresponding values ​​and returns their average value:

 sub median { my ($dbh, $tbl_name, $col_name) = @_; my ($count, $limit); $count = $dbh->selectrow_array ("SELECT COUNT($col_name) FROM $tbl_name"); return undef unless $count > 0; if ($count % 2 == 1) # odd number of values; select middle value { $limit = sprintf ("LIMIT %d,1", ($count-1)/2); } else # even number of values; select middle two values { $limit = sprintf ("LIMIT %d,2", $count/2 - 1); } my $sth = $dbh->prepare ( "SELECT $col_name FROM $tbl_name ORDER BY 1 $limit"); $sth->execute ( ); my ($n, $sum) = (0, 0); while (my $ref = $sth->fetchrow_arrayref ( )) { ++$n; $sum += $ref->[0]; } return ($sum / $n); } 

The previous method works for a set of values ​​stored in a database. If you have already selected an ordered set of values ​​in the @val array, you can instead calculate the median as follows:

 if (@val == 0) # if array is empty, median is undefined { $median = undef; } elsif (@val % 2 == 1) # if array size is odd, median is middle number { $median = $val[(@val-1)/2]; } else # array size is even; median is average { # of two middle numbers $median = ($val[@val/2 - 1] + $val[@val/2]) / 2; } 

The code works for arrays that have a starting index of 0; for languages ​​that use level 1 array indexes, adjust the algorithm accordingly.

+2
source

I calculated the Trimmed / Truncated average in SQL Server as follows:

 with tempResultSet as ( select Score, ntile(20) over(order by Score) as n, count(*) over() as x from @TestScores ) select avg(cast(Score as float)) from tempResultSet where (x >= 20 and n between 2 and 19) or (x < 20 and n between 2 and x - 1) 

this removes the top and bottom 10% of your set and gets the average “Score”

0
source

You do not tell us which database you are using, so I assume that you want the solution to work with standard SQL DBMSs.

You can use SQL to calculate the range (using the MIN / MAX aggregate functions) and the average (using the AVG) in a simple aggregate.

Standard SQL does not support the median, so you need to get SQL to sort the output and go through the elements until you find the middle one.

The mode is not supported, but note that you can use COUNT () and GROUP BY to get SQL to create a list of frequencies (how many times each data point appears), the record with the highest counter is the mode.

0
source

I am not very familiar with SQL essentials from basic necessities ... however I searched and found http://www.sqlteam.com/article/computing-the-trimmed-mean-in-sql for the average (actually uses your exact question in the example) http://scorreiait.wordpress.com/2008/10/28/how-to-compute-a-median-in-sql/ for meidan, http://www.mail-archive.com/ cf-talk@houseoffusion.com /msg44660.html for mode.

I don’t know if they are database specific, but it looks like they will make the work as efficient as possible. Good luck!

-1
source

Source: https://habr.com/ru/post/1306595/


All Articles