13.2 Calculation of descriptive statistics
13.2.1 Problem
You want to characterize a data set by calculating general descriptive or summary statistics.
13.2.2 Solution
Many common descriptive statistics, such as mean and standard deviation, can be obtained by applying aggregate functions to your data. Others, such as median or mode, can be calculated based on query counting.
13.2.3 Discussion
Suppose you have a testscore table containing observations representing the subject’s identifier, age, gender, and test score:
mysql> SELECT subject, age, sex, score FROM testscore ORDER BY subject; +---------+-----+-----+-------+ | subject | age | sex | score | +---------+-----+-----+-------+ | 1 | 5 | M | 5 | | 2 | 5 | M | 4 | | 3 | 5 | F | 6 | | 4 | 5 | F | 7 | | 5 | 6 | M | 8 | | 6 | 6 | M | 9 | | 7 | 6 | F | 4 | | 8 | 6 | F | 6 | | 9 | 7 | M | 8 | | 10 | 7 | M | 6 | | 11 | 7 | F | 9 | | 12 | 7 | F | 7 | | 13 | 8 | M | 9 | | 14 | 8 | M | 6 | | 15 | 8 | F | 7 | | 16 | 8 | F | 10 | | 17 | 9 | M | 9 | | 18 | 9 | M | 7 | | 19 | 9 | F | 10 | | 20 | 9 | F | 9 | +---------+-----+-----+-------+
A good first step in analyzing a set of observations is to create some descriptive statistics that summarizes their general characteristics as a whole. General statistics of this type include:
- The number of observations, their sum and their range (minimum and maximum)
- Measures of central tendency, such as mean, median and mode
- Measures of variation, such as standard deviation or deviation
In addition to the median and mode, all this can be easily calculated by calling aggregate functions:
mysql> SELECT COUNT(score) AS n, -> SUM(score) AS sum, -> MIN(score) AS minimum, -> MAX(score) AS maximum, -> AVG(score) AS mean, -> STD(score) AS 'std. dev.' -> FROM testscore; +----+------+---------+---------+--------+-----------+ | n | sum | minimum | maximum | mean | std. dev. | +----+------+---------+---------+--------+-----------+ | 20 | 146 | 4 | 10 | 7.3000 | 1.7916 | +----+------+---------+---------+--------+-----------+
The aggregate functions used in this query only count non- NULL cases. If you use NULL to represent missing values, you can perform an additional characteristic to evaluate the extent to which missing values. (See Recipe 13.5 .)
The deviation is not displayed in the query, and MySQL does not have a function to compute it. However, the variance is just the square of the standard deviation, so it is easily calculated as follows:
STD(score) * STD(score)
STDDEV( ) is synonymous with STD( ) .
The standard deviation can be used to determine emission values that are uncharacteristic far from the average. For example, to select values that contain more than three standard deviations from the mean, you can do something like this:
SELECT @mean := AVG(score), @std := STD(score) FROM testscore; SELECT score FROM testscore WHERE ABS( score-@mean ) > @std * 3;
For a set of n values, the standard deviation created by STD( ) is based on n degrees of freedom. This is equivalent to calculating the standard deviation "manually" as follows ( @ss is the sum of the squares):
mysql> SELECT -> @n := COUNT(score), -> @sum := SUM(score), -> @ss := SUM(score*score) -> FROM testscore; mysql> SELECT @var := ((@n * @ss) - (@sum * @sum)) / (@n * @n); mysql> SELECT SQRT(@var); +------------+ | SQRT(@var) | +------------+ | 1.791647 | +------------+
To calculate the standard deviation based on n-1 degrees of freedom, do this as follows:
mysql> SELECT -> @n := COUNT(score), -> @sum := SUM(score), -> @ss := SUM(score*score) -> FROM testscore; mysql> SELECT @var := ((@n * @ss) - (@sum * @sum)) / (@n * (@n - 1)); mysql> SELECT SQRT(@var); +------------+ | SQRT(@var) | +------------+ | 1.838191 | +------------+
Or, more simply, like this:
mysql> SELECT @n := COUNT(score) FROM testscore; mysql> SELECT STD(score)*SQRT(@n/(@n-1)) FROM testscore; +----------------------------+ | STD(score)*SQRT(@n/(@n-1)) | +----------------------------+ | 1.838191 | +----------------------------+
MySQL does not have a built-in function to calculate the mode or median of a set of values, but you can calculate them yourself. Mode is the value that is most often found. To determine what it is, count each value and see which one is most common:
mysql> SELECT score, COUNT(score) AS count -> FROM testscore GROUP BY score ORDER BY count DESC; +-------+-------+ | score | count | +-------+-------+ | 9 | 5 | | 6 | 4 | | 7 | 4 | | 4 | 2 | | 8 | 2 | | 10 | 2 | | 5 | 1 | +-------+-------+
In this case, 9 is the value of the modal score.
The median of the set of ordered values can be calculated as follows:
- If the number of values is odd, the median is the median.
- If the number of values is even, the median is the average of two average values.
Note that the definition of median given here is not completely general; he does not decide what to do if duplication of averages in the data set.
Based on this definition, use the following procedure to determine the median of the set of observations stored in the database:
- Run the query to count the number of observations. From the count, you can determine whether the median calculation requires one or two values, and also that their indices are in an ordered set of observations.
- Run a query containing an
ORDER BY to sort cases and a LIMIT to retrieve the average value or values. - Take the average of the selected value or values.
For example, if table t contains a column with 37 values (an odd number), you need to select one value using this query:
SELECT score FROM t ORDER BY 1 LIMIT 18,1
If the column contains 38 values (an even number), the query will look like this:
SELECT score FROM t ORDER BY 1 LIMIT 18,2
You can then select the value or values returned by the query and calculate the median from their average value.
The following Perl function implements median calculation. It takes a database descriptor and table and column names that contain a set of cases, then generates a query that retrieves the corresponding values and returns their average value:
sub median { my ($dbh, $tbl_name, $col_name) = @_; my ($count, $limit); $count = $dbh->selectrow_array ("SELECT COUNT($col_name) FROM $tbl_name"); return undef unless $count > 0; if ($count % 2 == 1)
The previous method works for a set of values stored in a database. If you have already selected an ordered set of values in the @val array, you can instead calculate the median as follows:
if (@val == 0)
The code works for arrays that have a starting index of 0; for languages that use level 1 array indexes, adjust the algorithm accordingly.