Linux command line utility for printing number statistics

I often find myself in a file with one number per line. I end up importing it into excel to view things like median, standard deviation, etc.

Is there a command line utility in linux to do the same? Usually I need to find the mean, average, minimum, maximum and std deviation.

+63
command-line linux statistics
Mar 20 '12 at 15:31
source share
15 answers

This is a breeze with R. For a file that looks like this:

1 2 3 4 5 6 7 8 9 10 

Use this:

 R -q -e "x <- read.csv('nums.txt', header = F); summary(x); sd(x[ , 1])" 

To get this:

  V1 Min. : 1.00 1st Qu.: 3.25 Median : 5.50 Mean : 5.50 3rd Qu.: 7.75 Max. :10.00 [1] 3.02765 
  • The -q flag suppresses R startup licensing and helps to output data
  • The -e flag tells R that you will pass the expression from the terminal
  • x is data.frame - table, basically. This is a structure that holds several data vectors / columns, which is a bit strange if you just read in one vector. This affects which features you can use.
  • Some functions, such as summary() , are naturally adapted to data.frames . If x had several fields, summary() would provide the descriptive statistics described above for each.
  • But sd() can only take one vector at a time, so I index x for this command ( x[ , 1] returns the first column of x ). You can use apply(x, MARGIN = 2, FUN = sd) to get the SD for all columns.
+52
Mar 22 '12 at 16:25
source share
— -

Using "st" ( https://github.com/nferraz/st )

 $ st numbers.txt N min max sum mean stddev 10 1 10 55 5.5 3.02765 

Or:

 $ st numbers.txt --transpose N 10 min 1 max 10 sum 55 mean 5.5 stddev 3.02765 

(DISCLAIMER: I wrote this tool :))

+38
04 Sep '13 at 15:23
source share

For average, average and standard deviation you can use awk . This will usually be faster than R solutions R For example, the following will print the average:

 awk '{a+=$1} END{print a/NR}' myfile 

( NR is the awk variable for the number of entries, $1 means the first (space-separated) argument of the string ( $0 will be the whole string, which will also work here, but in principle it will be less secure, although it will probably just accept for calculation anyway first argument), and END means that the following commands will be executed after processing the entire file (you can also initialize a to 0 in the BEGIN{a=0} operator BEGIN{a=0} ).

Here is a simple awk script that provides more detailed statistics (accepts a CSV file as input, otherwise modifies the FS ):

 #!/usr/bin/awk -f BEGIN { FS=","; } { a += $1; b[++i] = $1; } END { m = a/NR; # mean for (i in b) { d += (b[i]-m)^2; e += (b[i]-m)^3; f += (b[i]-m)^4; } va = d/NR; # variance sd = sqrt(va); # standard deviation sk = (e/NR)/sd^3; # skewness ku = (f/NR)/sd^4-3; # standardized kurtosis print "N,sum,mean,variance,std,SEM,skewness,kurtosis" print NR "," a "," m "," va "," sd "," sd/sqrt(NR) "," sk "," ku } 

Adding min / max to this scenario is easy, but sort and head / tail are just as simple:

 sort -n myfile | head -n1 sort -n myfile | tail -n1 
+33
Mar 20 2018-12-12T00:
source share

Ya a tool that can be used to calculate statistics and view in ASCII mode is ministat . This is a tool from FreeBSD, but it is also packaged for a popular Linux distribution such as Debian / Ubuntu.

Usage example:

 $ cat test.log Handled 1000000 packets.Time elapsed: 7.575278 Handled 1000000 packets.Time elapsed: 7.569267 Handled 1000000 packets.Time elapsed: 7.540344 Handled 1000000 packets.Time elapsed: 7.547680 Handled 1000000 packets.Time elapsed: 7.692373 Handled 1000000 packets.Time elapsed: 7.390200 Handled 1000000 packets.Time elapsed: 7.391308 Handled 1000000 packets.Time elapsed: 7.388075 $ cat test.log| awk '{print $5}' | ministat -w 74 x <stdin> +--------------------------------------------------------------------------+ | x | |xx xx xxx| | |__________________________A_______M_________________| | +--------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 8 7.388075 7.692373 7.54768 7.5118156 0.11126122 
+19
Aug 18 '15 at 15:09
source share

Yes it's called perl
and here is a short one-line:

 perl -e 'use List::Util qw(max min sum); @a=();while(<>){$sqsum+=$_*$_; push(@a,$_)}; $n=@a;$s=sum(@a);$a=$s/@a;$m=max(@a);$mm=min(@a);$std=sqrt($sqsum/$n-($s/$n)*($s/$n));$mid=int @a/2;@srtd=sort @a;if(@a%2){$med=$srtd[$mid];}else{$med=($srtd[$mid-1]+$srtd[$mid])/2;};print "records:$n\nsum:$s\navg:$a\nstd:$std\nmed:$med\max:$m\min:$mm";' 

Example

 $ cat tt 1 3 4 5 6.5 7. 2 3 4 

And the team

 cat tt | perl -e 'use List::Util qw(max min sum); @a=();while(<>){$sqsum+=$_*$_; push(@a,$_)}; $n=@a;$s=sum(@a);$a=$s/@a;$m=max(@a);$mm=min(@a);$std=sqrt($sqsum/$n-($s/$n)*($s/$n));$mid=int @a/2;@srtd=sort @a;if(@a%2){$med=$srtd[$mid];}else{$med=($srtd[$mid-1]+$srtd[$mid])/2;};print "records:$n\nsum:$s\navg:$a\nstd:$std\nmed:$med\max:$m\min:$mm";' records:9 sum:35.5 avg:3.94444444444444 std:1.86256162380447 med:4 max:7. min:1 
+15
Mar 20 '12 at 15:43
source share

Average:

 awk '{sum += $1} END {print "mean = " sum/NR}' filename 

Median:

 gawk -v max=128 ' function median(c,v, j) { asort(v,j) if (c % 2) return j[(c+1)/2] else return (j[c/2+1]+j[c/2])/2.0 } { count++ values[count]=$1 if (count >= max) { print median(count,values); count=0 } } END { print "median = " median(count,values) } ' filename 

mode:

 awk '{c[$1]++} END {for (i in count) {if (c[i]>max) {max=i}} print "mode = " max}' filename 

Calculation in this mode requires an even number of samples, but you see how it works ...

Standard deviation:

 awk '{sum+=$1; sumsq+=$1*$1} END {print "stdev = " sqrt(sumsq/NR - (sum/NR)**2)}' filename 
+11
Mar 20 2018-12-12T00:
source share

data_hacks is a Python command line utility for basic statistics.

The first example from this page gives the desired results:

 $ cat /tmp/data | histogram.py # NumSamples = 29; Max = 10.00; Min = 1.00 # Mean = 4.379310; Variance = 5.131986; SD = 2.265389 # each * represents a count of 1 1.0000 - 1.9000 [ 1]: * 1.9000 - 2.8000 [ 5]: ***** 2.8000 - 3.7000 [ 8]: ******** 3.7000 - 4.6000 [ 3]: *** 4.6000 - 5.5000 [ 4]: **** 5.5000 - 6.4000 [ 2]: ** 6.4000 - 7.3000 [ 3]: *** 7.3000 - 8.2000 [ 1]: * 8.2000 - 9.1000 [ 1]: * 9.1000 - 10.0000 [ 1]: * 
+8
May 6 '14 at 15:52
source share

Just in case, datastat is a simple Linux program that uses simple statistics from the command line. For example,

 cat file.dat | datastat 

displays the average of all rows for each column of the .dat file. If you need to know the standard deviation, min, max, you can add the options --dev , --min and --max respectively.

datastat has the ability to aggregate rows based on the value of one or more key columns. For example,

 cat file.dat | datastat -k 1 

will produce for each different value found in the first column ("key"), the average value of all other column values ​​aggregated among all rows with the same value on the key. You can use more columns as key fields (e.g. -k 1-3, -k 2,4, etc.).

It is written in C ++, works quickly and with little memory, and can be combined perfectly with other tools such as cut , grep , sed , sort , awk , etc.

+7
Mar 26 '13 at 23:32
source share

You can also use clistats . This is a highly customizable command line interface tool for calculating statistics for a stream of split input numbers.

Input / output parameters

  • Input data can be from a file, standard input or channel
  • The output can be written to a file, standard output or channel
  • The output uses headers beginning with "#" to enable the pipeline in gnuplot

Analysis Parameters

  • Detect signal, end of file or blank lines to stop processing
  • You can specify a comment and delimiter character
  • Columns can be filtered out of processing.
  • Rows can be filtered out of processing based on a numerical constraint
  • Rows can be filtered out of processing based on row constraint
  • Source header lines may be skipped
  • Fixed the number of lines that can be processed
  • Duplicate delimiters can be ignored
  • Rows can be changed to columns.
  • Strictly ensure that only rows of the same size are processed.
  • A row containing column headers can be used to display output statistics.

Statistics Options

  • Summary statistics (number, minimum, average, maximum, standard deviation)
  • covariance
  • Correlation
  • Least Square Offset
  • Least Squares Tilt
  • bar graph
  • Source data after filtering

NOTE. I am an author.

+7
Jul 13 '14 at 5:14
source share

Another tool: https://www.gnu.org/software/datamash/

 # Example: calculate the sum and mean of values 1 to 10: $ seq 10 | datamash sum 1 mean 1 55 5.5 

May be more common (at least the first tool I found pre-packaged for nix)

+7
Aug 17 '16 at 13:34
source share

I found that I want to do this in the shell pipeline, and all the correct arguments for R took some time. Here is what I came up with:

seq 10 | R --slave -e 'x <- scan(file="stdin",quiet=TRUE); summary(x)' Min. 1st Qu. Median Mean 3rd Qu. Max. 1.00 3.25 5.50 5.50 7.75 10.00

The --slave parameter "Make (s) R works as quietly as possible ... This implies --quiet and --no-save". The -e option tells R to treat the next line as R code. The first statement reads from the in standard and saves what is read in a variable named "x". The quiet=TRUE parameter for the scan function suppresses writing a line that indicates how many items have been read. The second statement applies the summary function to x , which outputs the result.

+6
Feb 07 '17 at 1:49 on
source share
 #!/usr/bin/perl # # stdev - figure N, min, max, median, mode, mean, & std deviation # # pull out all the real numbers in the input # stream and run standard calculations on them. # they may be intermixed with other test, need # not be on the same or different lines, and # can be in scientific notion (avagadro=6.02e23). # they also admit a leading + or -. # # Tom Christiansen # tchrist@perl.com use strict; use warnings; use List::Util qw< min max >; # my $number_rx = qr{ # leading sign, positive or negative (?: [+-] ? ) # mantissa (?= [0123456789.] ) (?: # "N" or "N." or "NN" (?: (?: [0123456789] + ) (?: (?: [.] ) (?: [0123456789] * ) ) ? | # ".N", no leading digits (?: (?: [.] ) (?: [0123456789] + ) ) ) ) # abscissa (?: (?: [Ee] ) (?: (?: [+-] ? ) (?: [0123456789] + ) ) | ) }x; my $n = 0; my $sum = 0; my @values = (); my %seen = (); while (<>) { while (/($number_rx)/g) { $n++; my $num = 0 + $1; # 0+ is so numbers in alternate form count as same $sum += $num; push @values, $num; $seen{$num}++; } } die "no values" if $n == 0; my $mean = $sum / $n; my $sqsum = 0; for (@values) { $sqsum += ( $_ ** 2 ); } $sqsum /= $n; $sqsum -= ( $mean ** 2 ); my $stdev = sqrt($sqsum); my $max_seen_count = max values %seen; my @modes = grep { $seen{$_} == $max_seen_count } keys %seen; my $mode = @modes == 1 ? $modes[0] : "(" . join(", ", @modes) . ")"; $mode .= ' @ ' . $max_seen_count; my $median; my $mid = int @values/2; if (@values % 2) { $median = $values[ $mid ]; } else { $median = ($values[$mid-1] + $values[$mid])/2; } my $min = min @values; my $max = max @values; printf "n is %d, min is %g, max is %d\n", $n, $min, $max; printf "mode is %s, median is %g, mean is %g, stdev is %g\n", $mode, $median, $mean, $stdev; 
+3
Mar 20 '12 at 15:45
source share

There is also a simple-r that can do almost anything R can, but with less keystrokes:

https://code.google.com/p/simple-r/

To calculate basic descriptive statistics, you would need to type one of:

 r summary file.txt r summary - < file.txt cat file.txt | r summary - 

For each average, median, minimum, maximum and standard deviation, the code will look like this:

 seq 1 100 | r mean - seq 1 100 | r median - seq 1 100 | r min - seq 1 100 | r max - seq 1 100 | r sd - 

It doesn’t work simple-R!

+3
Sep 30 '13 at 20:37
source share

Another tool: tsv-summarize , from the eBay tsv utility . Min, max, average, average, standard deviation are supported. Designed for large data sets. Example:

 $ seq 10 | tsv-summarize --min 1 --max 1 --median 1 --stdev 1 1 10 5.5 3.0276503541 

Disclaimer: I am the author.

+2
Jan 20 '18 at 18:36
source share

Using xsv :

 $ echo '3 1 4 1 5 9 2 6 5 3 5 9' |tr ' ' '\n' > numbers-one-per-line.csv $ xsv stats -n < numbers-one-per-line.csv field,type,sum,min,max,min_length,max_length,mean,stddev 0,Integer,53,1,9,1,1,4.416666666666667,2.5644470922381863 # mode/median/cardinality not shown by default since it requires storing full file in memory: $ xsv stats -n --everything < numbers-one-per-line.csv | xsv table field type sum min max min_length max_length mean stddev median mode cardinality 0 Integer 53 1 9 1 1 4.416666666666667 2.5644470922381863 4.5 5 7 
+1
Apr 24 '18 at 12:28
source share



All Articles