Linux command line utility for printing number statistics

Question

Linux command line utility for printing number statistics

I often find myself in a file with one number per line. I end up importing it into excel to view things like median, standard deviation, etc.

Is there a command line utility in linux to do the same? Usually I need to find the mean, average, minimum, maximum and std deviation.

+63

command-line linux statistics

MK. Mar 20 '12 at 15:31

source share

15 answers

Using "st" ( https://github.com/nferraz/st )

 $ st numbers.txt N min max sum mean stddev 10 1 10 55 5.5 3.02765

Or:

 $ st numbers.txt --transpose N 10 min 1 max 10 sum 55 mean 5.5 stddev 3.02765

(DISCLAIMER: I wrote this tool :))

+38

user2747481 04 Sep '13 at 15:23

source share

For average, average and standard deviation you can use awk . This will usually be faster than R solutions R For example, the following will print the average:

 awk '{a+=$1} END{print a/NR}' myfile

( NR is the awk variable for the number of entries, $1 means the first (space-separated) argument of the string ( $0 will be the whole string, which will also work here, but in principle it will be less secure, although it will probably just accept for calculation anyway first argument), and END means that the following commands will be executed after processing the entire file (you can also initialize a to 0 in the BEGIN{a=0} operator BEGIN{a=0} ).

Here is a simple awk script that provides more detailed statistics (accepts a CSV file as input, otherwise modifies the FS ):

 #!/usr/bin/awk -f BEGIN { FS=","; } { a += $1; b[++i] = $1; } END { m = a/NR; # mean for (i in b) { d += (b[i]-m)^2; e += (b[i]-m)^3; f += (b[i]-m)^4; } va = d/NR; # variance sd = sqrt(va); # standard deviation sk = (e/NR)/sd^3; # skewness ku = (f/NR)/sd^4-3; # standardized kurtosis print "N,sum,mean,variance,std,SEM,skewness,kurtosis" print NR "," a "," m "," va "," sd "," sd/sqrt(NR) "," sk "," ku }

Adding min / max to this scenario is easy, but sort and head / tail are just as simple:

 sort -n myfile | head -n1 sort -n myfile | tail -n1

+33

Skippy le Grand Gourou Mar 20 2018-12-12T00:

source share

Ya a tool that can be used to calculate statistics and view in ASCII mode is ministat . This is a tool from FreeBSD, but it is also packaged for a popular Linux distribution such as Debian / Ubuntu.

Usage example:

 $ cat test.log Handled 1000000 packets.Time elapsed: 7.575278 Handled 1000000 packets.Time elapsed: 7.569267 Handled 1000000 packets.Time elapsed: 7.540344 Handled 1000000 packets.Time elapsed: 7.547680 Handled 1000000 packets.Time elapsed: 7.692373 Handled 1000000 packets.Time elapsed: 7.390200 Handled 1000000 packets.Time elapsed: 7.391308 Handled 1000000 packets.Time elapsed: 7.388075 $ cat test.log| awk '{print $5}' | ministat -w 74 x <stdin> +--------------------------------------------------------------------------+ | x | |xx xx xxx| | |__________________________A_______M_________________| | +--------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 8 7.388075 7.692373 7.54768 7.5118156 0.11126122

+19

Aliaksei Ramanau Aug 18 '15 at 15:09

source share

Yes it's called perl
and here is a short one-line:

 perl -e 'use List::Util qw(max min sum); @a=();while(<>){$sqsum+=$_*$_; push(@a,$_)}; $n=@a;$s=sum(@a);$a=$s/@a;$m=max(@a);$mm=min(@a);$std=sqrt($sqsum/$n-($s/$n)*($s/$n));$mid=int @a/2;@srtd=sort @a;if(@a%2){$med=$srtd[$mid];}else{$med=($srtd[$mid-1]+$srtd[$mid])/2;};print "records:$n\nsum:$s\navg:$a\nstd:$std\nmed:$med\max:$m\min:$mm";'

Example

 $ cat tt 1 3 4 5 6.5 7. 2 3 4

And the team

 cat tt | perl -e 'use List::Util qw(max min sum); @a=();while(<>){$sqsum+=$_*$_; push(@a,$_)}; $n=@a;$s=sum(@a);$a=$s/@a;$m=max(@a);$mm=min(@a);$std=sqrt($sqsum/$n-($s/$n)*($s/$n));$mid=int @a/2;@srtd=sort @a;if(@a%2){$med=$srtd[$mid];}else{$med=($srtd[$mid-1]+$srtd[$mid])/2;};print "records:$n\nsum:$s\navg:$a\nstd:$std\nmed:$med\max:$m\min:$mm";' records:9 sum:35.5 avg:3.94444444444444 std:1.86256162380447 med:4 max:7. min:1

+15

bua Mar 20 '12 at 15:43

source share

Average:

 awk '{sum += $1} END {print "mean = " sum/NR}' filename

Median:

 gawk -v max=128 ' function median(c,v, j) { asort(v,j) if (c % 2) return j[(c+1)/2] else return (j[c/2+1]+j[c/2])/2.0 } { count++ values[count]=$1 if (count >= max) { print median(count,values); count=0 } } END { print "median = " median(count,values) } ' filename

mode:

 awk '{c[$1]++} END {for (i in count) {if (c[i]>max) {max=i}} print "mode = " max}' filename

Calculation in this mode requires an even number of samples, but you see how it works ...

Standard deviation:

 awk '{sum+=$1; sumsq+=$1*$1} END {print "stdev = " sqrt(sumsq/NR - (sum/NR)**2)}' filename

+11

ghoti Mar 20 2018-12-12T00:

source share

data_hacks is a Python command line utility for basic statistics.

The first example from this page gives the desired results:

 $ cat /tmp/data | histogram.py # NumSamples = 29; Max = 10.00; Min = 1.00 # Mean = 4.379310; Variance = 5.131986; SD = 2.265389 # each * represents a count of 1 1.0000 - 1.9000 [ 1]: * 1.9000 - 2.8000 [ 5]: ***** 2.8000 - 3.7000 [ 8]: ******** 3.7000 - 4.6000 [ 3]: *** 4.6000 - 5.5000 [ 4]: **** 5.5000 - 6.4000 [ 2]: ** 6.4000 - 7.3000 [ 3]: *** 7.3000 - 8.2000 [ 1]: * 8.2000 - 9.1000 [ 1]: * 9.1000 - 10.0000 [ 1]: *

+8

Matt Parker May 6 '14 at 15:52

source share

Just in case, datastat is a simple Linux program that uses simple statistics from the command line. For example,

 cat file.dat | datastat

displays the average of all rows for each column of the .dat file. If you need to know the standard deviation, min, max, you can add the options --dev , --min and --max respectively.

datastat has the ability to aggregate rows based on the value of one or more key columns. For example,

 cat file.dat | datastat -k 1

will produce for each different value found in the first column ("key"), the average value of all other column values aggregated among all rows with the same value on the key. You can use more columns as key fields (e.g. -k 1-3, -k 2,4, etc.).

It is written in C ++, works quickly and with little memory, and can be combined perfectly with other tools such as cut , grep , sed , sort , awk , etc.

+7

Tommaso Mar 26 '13 at 23:32

source share

You can also use clistats . This is a highly customizable command line interface tool for calculating statistics for a stream of split input numbers.

Input / output parameters

Input data can be from a file, standard input or channel
The output can be written to a file, standard output or channel
The output uses headers beginning with "#" to enable the pipeline in gnuplot

Analysis Parameters

Detect signal, end of file or blank lines to stop processing
You can specify a comment and delimiter character
Columns can be filtered out of processing.
Rows can be filtered out of processing based on a numerical constraint
Rows can be filtered out of processing based on row constraint
Source header lines may be skipped
Fixed the number of lines that can be processed
Duplicate delimiters can be ignored
Rows can be changed to columns.
Strictly ensure that only rows of the same size are processed.
A row containing column headers can be used to display output statistics.

Statistics Options

Summary statistics (number, minimum, average, maximum, standard deviation)
covariance
Correlation
Least Square Offset
Least Squares Tilt
bar graph
Source data after filtering

NOTE. I am an author.

+7

dpmcmlxxvi Jul 13 '14 at 5:14

source share

Another tool: https://www.gnu.org/software/datamash/

 # Example: calculate the sum and mean of values 1 to 10: $ seq 10 | datamash sum 1 mean 1 55 5.5

May be more common (at least the first tool I found pre-packaged for nix)

+7

olejorgenb Aug 17 '16 at 13:34

source share

I found that I want to do this in the shell pipeline, and all the correct arguments for R took some time. Here is what I came up with:

seq 10 | R --slave -e 'x <- scan(file="stdin",quiet=TRUE); summary(x)' Min. 1st Qu. Median Mean 3rd Qu. Max. 1.00 3.25 5.50 5.50 7.75 10.00

The --slave parameter "Make (s) R works as quietly as possible ... This implies --quiet and --no-save". The -e option tells R to treat the next line as R code. The first statement reads from the in standard and saves what is read in a variable named "x". The quiet=TRUE parameter for the scan function suppresses writing a line that indicates how many items have been read. The second statement applies the summary function to x , which outputs the result.

+6

Jose Quinteiro Feb 07 '17 at 1:49 on

source share

 #!/usr/bin/perl # # stdev - figure N, min, max, median, mode, mean, & std deviation # # pull out all the real numbers in the input # stream and run standard calculations on them. # they may be intermixed with other test, need # not be on the same or different lines, and # can be in scientific notion (avagadro=6.02e23). # they also admit a leading + or -. # # Tom Christiansen # tchrist@perl.com use strict; use warnings; use List::Util qw< min max >; # my $number_rx = qr{ # leading sign, positive or negative (?: [+-] ? ) # mantissa (?= [0123456789.] ) (?: # "N" or "N." or "NN" (?: (?: [0123456789] + ) (?: (?: [.] ) (?: [0123456789] * ) ) ? | # ".N", no leading digits (?: (?: [.] ) (?: [0123456789] + ) ) ) ) # abscissa (?: (?: [Ee] ) (?: (?: [+-] ? ) (?: [0123456789] + ) ) | ) }x; my $n = 0; my $sum = 0; my @values = (); my %seen = (); while (<>) { while (/($number_rx)/g) { $n++; my $num = 0 + $1; # 0+ is so numbers in alternate form count as same $sum += $num; push @values, $num; $seen{$num}++; } } die "no values" if $n == 0; my $mean = $sum / $n; my $sqsum = 0; for (@values) { $sqsum += ( $_ ** 2 ); } $sqsum /= $n; $sqsum -= ( $mean ** 2 ); my $stdev = sqrt($sqsum); my $max_seen_count = max values %seen; my @modes = grep { $seen{$_} == $max_seen_count } keys %seen; my $mode = @modes == 1 ? $modes[0] : "(" . join(", ", @modes) . ")"; $mode .= ' @ ' . $max_seen_count; my $median; my $mid = int @values/2; if (@values % 2) { $median = $values[ $mid ]; } else { $median = ($values[$mid-1] + $values[$mid])/2; } my $min = min @values; my $max = max @values; printf "n is %d, min is %g, max is %d\n", $n, $min, $max; printf "mode is %s, median is %g, mean is %g, stdev is %g\n", $mode, $median, $mean, $stdev;

+3

tchrist Mar 20 '12 at 15:45

source share

There is also a simple-r that can do almost anything R can, but with less keystrokes:

https://code.google.com/p/simple-r/

To calculate basic descriptive statistics, you would need to type one of:

 r summary file.txt r summary - < file.txt cat file.txt | r summary -

For each average, median, minimum, maximum and standard deviation, the code will look like this:

 seq 1 100 | r mean - seq 1 100 | r median - seq 1 100 | r min - seq 1 100 | r max - seq 1 100 | r sd -

It doesn’t work simple-R!

+3

Tom Sep 30 '13 at 20:37

source share

Another tool: tsv-summarize , from the eBay tsv utility . Min, max, average, average, standard deviation are supported. Designed for large data sets. Example:

 $ seq 10 | tsv-summarize --min 1 --max 1 --median 1 --stdev 1 1 10 5.5 3.0276503541

Disclaimer: I am the author.

+2

JonDeg Jan 20 '18 at 18:36

source share

Using xsv :

 $ echo '3 1 4 1 5 9 2 6 5 3 5 9' |tr ' ' '\n' > numbers-one-per-line.csv $ xsv stats -n < numbers-one-per-line.csv field,type,sum,min,max,min_length,max_length,mean,stddev 0,Integer,53,1,9,1,1,4.416666666666667,2.5644470922381863 # mode/median/cardinality not shown by default since it requires storing full file in memory: $ xsv stats -n --everything < numbers-one-per-line.csv | xsv table field type sum min max min_length max_length mean stddev median mode cardinality 0 Integer 53 1 9 1 1 4.416666666666667 2.5644470922381863 4.5 5 7

+1

unhammer Apr 24 '18 at 12:28

source share

Matt Parker · Accepted Answer · 2012-03-22 16:25

This is a breeze with R. For a file that looks like this:

1 2 3 4 5 6 7 8 9 10

Use this:

 R -q -e "x <- read.csv('nums.txt', header = F); summary(x); sd(x[ , 1])"

To get this:

  V1 Min. : 1.00 1st Qu.: 3.25 Median : 5.50 Mean : 5.50 3rd Qu.: 7.75 Max. :10.00 [1] 3.02765

The -q flag suppresses R startup licensing and helps to output data
The -e flag tells R that you will pass the expression from the terminal
x is data.frame - table, basically. This is a structure that holds several data vectors / columns, which is a bit strange if you just read in one vector. This affects which features you can use.
Some functions, such as summary() , are naturally adapted to data.frames . If x had several fields, summary() would provide the descriptive statistics described above for each.
But sd() can only take one vector at a time, so I index x for this command ( x[ , 1] returns the first column of x ). You can use apply(x, MARGIN = 2, FUN = sd) to get the SD for all columns.

Linux command line utility for printing number statistics

Input / output parameters

Analysis Parameters

Statistics Options

More articles: