Bar chart normalization in gnuplot

I am trying to build a histogram whose cells are normalized by the number of elements in the bunker.

I use the following

binwidth=5 bin(x,width)=width*floor(x/width) + binwidth/2.0 plot 'file' using (bin($2, binwidth)):($4) smooth freq with boxes 

to get a basic histogram, but I want the value of each bin divided by the size of the cell. How can this be done in gnuplot or use external tools if necessary?

+6
source share
5 answers

In gnuplot 4.4, functions take a different property because they can execute several consecutive commands and then return a value (see gnuplot tricks ). This means that you can actually calculate the number of points n in the gnuplot file without knowing it in advance. This code runs for the file "out.dat" containing one column: a list of n samples from a regular distribution:

 binwidth = 0.1 set boxwidth binwidth sum = 0 s(x) = ((sum=sum+1), 0) bin(x, width) = width*floor(x/width) + binwidth/2.0 plot "out.dat" u ($1):(s($1)) plot "out.dat" u (bin($1, binwidth)):(1.0/(binwidth*sum)) smooth freq w boxes 

The first plot statement reads the data file and multiplies the sum once for each point, plotting zero.

The second plot operator actually uses the sum value to normalize the histogram.

+8
source

In gnuplot 4.6, you can count the number of points with the stats command, which is faster than plot . Actually, you do not need such a trick s(x)=((sum=sum+1),0) , but immediately count the number of the STATS_records variable after running stats 'out.dat' u 1 .

+8
source

Here is how I would do it, with n = 500 random Gaussian variations generated from R using the following command:

 Rscript -e 'cat(rnorm(500), sep="\\n")' > rnd.dat 

I use the same idea as yours to define a normalized histogram where y is defined as 1 / (bin width * n), except that I use int instead of floor , and I did not return to the bin value. In short, this is a quick adaptation from smooth.dem demo script, and a similar approach is described in the Janert, Gnuplot tutorial in action ( Chapter 13 , p. 257, freely available). You can replace the random-points sample data file, which is available in the demo folder that comes with Gnuplot. Please note that we need to specify the number of points in the form of Gnuplot as there are no counters for entries in the file.

 bw1=0.1 bw2=0.3 n=500 bin(x,width)=width*int(x/width) set xrange [-3:3] set yrange [0:1] tstr(n)=sprintf("Binwidth = %1.1f\n", n) set multiplot layout 1,2 set boxwidth bw1 plot 'rnd.dat' using (bin($1,bw1)):(1./(bw1*n)) smooth frequency with boxes t tstr(bw1) set boxwidth bw2 plot 'rnd.dat' using (bin($1,bw2)):(1./(bw2*n)) smooth frequency with boxes t tstr(bw2) 

Here is the result, with two hopper widths

enter image description here

Moreover, this is a really rough approach to the histogram, and more complex solutions are easily accessible in R. Indeed, the problem is how to determine a good bean width, and this problem has already been discussed at stats.stackexchange.com : using the Freedman-Diaconis rule is not should be too complicated to implement, although you will need to calculate the range between quartiles.

This is how R will work with the same dataset, with a default parameter (Sturges rule, because in this particular case it will not make any difference) and an equally distributed bin, like the ones used above.

enter image description here

The R code used is shown below:

 par(mfrow=c(1,2), las=1) hist(rnd, main="Sturges", xlab="", ylab="", prob=TRUE) hist(rnd, breaks=seq(-3.5,3.5,by=.1), main="Binwidth = 0.1", xlab="", ylab="", prob=TRUE) 

You can even see how R does its work by checking the values ​​returned by hist() :

 > str(hist(rnd, plot=FALSE)) List of 7 $ breaks : num [1:14] -3.5 -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 ... $ counts : int [1:13] 1 1 12 20 49 79 108 87 71 43 ... $ intensities: num [1:13] 0.004 0.004 0.048 0.08 0.196 0.316 0.432 0.348 0.284 0.172 ... $ density : num [1:13] 0.004 0.004 0.048 0.08 0.196 0.316 0.432 0.348 0.284 0.172 ... $ mids : num [1:13] -3.25 -2.75 -2.25 -1.75 -1.25 -0.75 -0.25 0.25 0.75 1.25 ... $ xname : chr "rnd" $ equidist : logi TRUE - attr(*, "class")= chr "histogram" 

All that can be said is that you can use the results of R to process your data with Gnuplot if you want (although I would recommend using R: - directly).

+3
source

Another way to count the number of data points in a file is with a system command. This is useful if you are creating several files, and you do not know the number of points in advance. I used:

 countpoints(file) = system( sprintf("grep -v '^#' %s| wc -l", file) ) file1count = countpoints (file1) file2count = countpoints (file2) file3count = countpoints (file3) ... 

The countpoints functions avoid counting lines starting with "#". Then you would use the functions already mentioned to build a normalized histogram.

Here is a complete example:

 n=100 xmin=-50. xmax=50. binwidth=(xmax-xmin)/n bin(x,width)=width*floor(x/width)+width/2.0 countpoints(file) = system( sprintf("grep -v '^#' %s| wc -l", file) ) file1count = countpoints (file1) file2count = countpoints (file2) file3count = countpoints (file3) plot file1 using (bin(($1),binwidth)):(1.0/(binwidth*file1count)) smooth freq with boxes,\ file2 using (bin(($1),binwidth)):(1.0/(binwidth*file2count)) smooth freq with boxes,\ file3 using (bin(($1),binwidth)):(1.0/(binwidth*file3count)) smooth freq with boxes ... 
+2
source

Simply

 plot 'file' using (bin($2, binwidth)):($4/$4) smooth freq with boxes 
-2
source

Source: https://habr.com/ru/post/886681/


All Articles