How to build a quantile strip (in R)

I have a CSV file that contains lines for each (Java GC) event that interests me. An object consists of a subgrid time stamp (not equidistant) and some variables. The object looks like this:

gcdata <- read.table("http://bernd.eckenfels.net/view/gc1001.ygc.csv",header=TRUE,sep=",", dec=".") start = as.POSIXct(strptime("2012-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S")) gcdata.date = gcdata$Timestamp + start gcdata = gcdata[,2:7] # remove old date col gcdata=data.frame(date=gcdata.date,gcdata) str(gcdata) 

Results in

 'data.frame': 2997 obs. of 7 variables: $ date : POSIXct, format: "2012-01-01 00:00:06" "2012-01-01 00:00:06" "2012-01-01 00:00:18" ... $ Distance.s. : num 0 0.165 11.289 9.029 11.161 ... $ YGUsedBefore.K.: int 1610619 20140726 20148325 20213304 20310849 20404772 20561918 21115577 21479211 21544930 ... $ YGUsedAfter.K. : int 7990 15589 80568 178113 272036 429182 982841 1346475 1412181 1355412 ... $ Promoted.K. : int 0 0 0 0 8226 937 65429 71166 62548 143638 ... $ YGCapacity.K. : int 22649280 22649280 22649280 22649280 22649280 22649280 22649280 22649280 22649280 22649280 ... $ Pause.s. : num 0.0379 0.022 0.0287 0.0509 0.109 ... 

In this case, I care about the pause time (in seconds). I want to build a chart that will show me for each (wall clock) basically the average value in the form of a line, 2% and 98% in the form of a gray corridor and the maximum value (within each hour) in the form of a red line.

I did some work, but using q98 functions is ugly, using multiple line operators seems useless, and I don't know how to reach the gray area between q02 and q98:

 q02 <- function(x, ...) { x <- quantile(x,probs=c(0.2)) } q98 <- function(x, ...) { x <- quantile(x,probs=c(0.98)) } hours = droplevels(cut(gcdata$date, breaks="hours")) # can I have 2 hours? plot(aggregate(gcdata$Pause.s. ~ hours, data=gcdata, FUN=max),ylim=c(0,2), col="red", ylab="Pause(s)", xlab="Days") # Is always black? lines(aggregate(gcdata$Pause.s. ~ hours, data=gcdata, FUN=q98),ylim=c(0,2), col="green") lines(aggregate(gcdata$Pause.s. ~ hours, data=gcdata, FUN=q02),ylim=c(0,2), col="green") lines(aggregate(gcdata$Pause.s. ~ hours, data=gcdata, FUN=mean),ylim=c(0,2), col="blue") 

Now this leads to a chart with black dots as a maximum, a blue line as an hourly average and a lower and upper 0.2 + 0.98 green line. I think it would be better to read the gray corridor, perhaps the dashed maximum (red) line and somehow fix the axis labels. Exported Chart (png) Any suggestions? (file available above)

+4
source share
4 answers

nice to see old Debian oldies here :) Your answer is already pretty nice. As I happen to work a lot with time series, I thought of throwing in a variant using excellent zoo and xts . The latter is built on top of the first and has, among other things, the period.apply() function, which we can use here together with the endpoints() function to get you two-hour aggregates.

So upstairs I use

 library(zoo) # for zoo objects library(xts) # for period.apply gcdata <- read.table("http://bernd.eckenfels.net/view/gc1001.ygc.csv", header=TRUE, sep=",", dec=".") timestamps <- gcdata$Timestamp + as.POSIXct(strptime("2012-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S")) gcdatazoo <- zoo(gcdata[-1], order.by=timestamps) # as zoo object 

to create a zoo object. Your function remains:

 plotAreaCorridor <- function(x, y, col.poly1="lightgray", col.poly2="gray",...) { x.pol <- c(x, rev(x), x[1]) y.pol <- c(y[,1], rev(y[,5]),y[,1][1]) plot(x, y[,6]+1, type="n", ...) polygon(x.pol, y.pol, col=col.poly1, lty=0) x.pol <- c(x, rev(x), x[1]) y.pol <- c(y[,2], rev(y[,4]), y[,1][1]) polygon(x.pol, y.pol, col=col.poly2, lty=0) lines(x, y[,3], col="blue") # median lines(x, y[,6], col="red") # max invisible(NULL) } 

And we can simplify it a bit:

 agg <- period.apply(gcdatazoo[,"Pause.s."], # to which data INDEX=endpoints(gcdatazoo, "hours", k=2), # every 2 hours FUN=function(x) quantile(x, # what fun. probs=c(5,20,50,80,95,100)/100)) #v99 = q99(gcdata$Pause.s.) # what is q99 ? v99 <- mean(agg[,5]) # mean of 95-th percentile? plotAreaCorridor(index(agg), # use time index as x axis coredata(agg), # and matrix part of zoo object as data ylim=c(0,max(agg[,5])*1.5), ylab="Quantiles of GC events", main="NewPar Collection Activity") abline(h=median(gcdatazoo[,"Pause.s."]), col="lightblue") abline(h=v99, col="grey") labeltxt <- paste("99%=",round(v99,digits=3),"sn=", nrow(gcdatazoo),sep="") text(x=index(agg)[20], y=1.5*v99, labeltxt, col="grey", pos=3) # or legend() 

which gives

enter image description here

Now the axis is now automatic and shows working days only as the interval is less than a week; this can be canceled as necessary.

+1
source

You need to try polygon . This code may be useful:

 y98 = aggregate(gcdata$Pause.s. ~ hours, data=gcdata, FUN=q98) y02 = aggregate(gcdata$Pause.s. ~ hours, data=gcdata, FUN=q02) ymax = aggregate(gcdata$Pause.s. ~ hours, data=gcdata, FUN=max) ymin = aggregate(gcdata$Pause.s. ~ hours, data=gcdata, FUN=min) ymean = aggregate(gcdata$Pause.s. ~ hours, data=gcdata, FUN=mean) x = ymean[,1] y1 = cbind(y02[,2], ymean[,2], y98[,2]) y2 = cbind(ymin[,2], ymean[,2], ymax[,2]) plotAreaCI(x,y2, ylim=c(0,2), xlab="time", ylab="variable") plotAreaCI(x,y1, ylim=c(0,2), poly.col="blue", add=TRUE) 

pic1

or

 plotAreaCI(x,y2, ylim=c(0,2), xlab="time", ylab="variable", nice.x = TRUE) plotAreaCI(x,y1, ylim=c(0,2), mean.lwd=2, poly.col="blue", add=TRUE) 

pic2

where the plotAreaCI function is defined as follows:

 plotAreaCI = function(x, y, add=FALSE, nice.x = FALSE, xlim=NULL, ylim=NULL, mean.col="black", mean.lwd=1.5, poly.col="gray", poly.lty=3, xlab=NULL, ylab=NULL, main="", ...) { isFactorX = isClass("factor", x) if(isFactorX) { x.label = x x = as.numeric(x) } if(is.null(xlim)) xlim=range(x, na.rm=TRUE) if(is.null(ylim)) ylim=range(y, na.rm=TRUE) x.pol = c(x, rev(x), x[1]) y.pol = c(y[,1], rev(y[,3]), y[,1][3]) if(!add) { plot.new() plot.window(xlim=xlim, ylim=ylim, ...) if(!nice.x & isFactorX) { axis(1, at=x, labels=x.label) } else { xticks = axTicks(1) if(isFactorX) { xticks = xticks[xticks>=1] axis(1, at=xticks, labels=x.label[xticks]) } else { axis(1) } } axis(2, las=1) box() title(xlab=xlab, ylab=ylab, main=main) } polygon(x.pol, y.pol, col=poly.col, lty=poly.lty) lines(x, y[,2], col=mean.col, lwd=mean.lwd) return(invisible()) } 
+4
source

This is the code that I use to build the time variation in laboratory analytes (systolic blood pressure in this case):

  SBP.qtr.mat <- aggregate(set1HLI$SBP, list( year(set1HLI$Drawdt)+0.25* quarter(set1HLI$Drawdt)), quantile, prob=c(0.1,0.25,0.5,0.75, 0.9,0.95, 0.975), na.rm=TRUE) matplot(SBP.qtr.mat[,1], SBP.qtr.mat$x, type="pl") 

It should not be too difficult to adapt this to your problem .... or you can post a reproducible example to work. This gives 10, 25, 50, 75, 90, 95 and 97.5 percentiles in one data file. The matrix and matrix process the construction of such an object.

The gray area?, ... The usual approach is to build a polygon that extends to the lower borders, β€œrotate” on the right side and return to the upper side, as well as connect back on the left side. polygon arguments are configured as x, y . There is a col argument that you would set to gray.

To make "2-hour" sequences with which you could combine your frame or use with cut.POSIXt" as a breaks argument , there is the option of using multiples of time units with seq.POSIXt`:

 > seq(ISOdate(1910,1,1), ISOdate(1999,1,1), "10 years") [1] "1910-01-01 12:00:00 GMT" "1920-01-01 12:00:00 GMT" "1930-01-01 12:00:00 GMT" "1940-01-01 12:00:00 GMT" [5] "1950-01-01 12:00:00 GMT" "1960-01-01 12:00:00 GMT" "1970-01-01 12:00:00 GMT" "1980-01-01 12:00:00 GMT" [9] "1990-01-01 12:00:00 GMT" 

I have not seen the document, but you can use the multiplicity of the interval with cut.POSIXt :

 > str( cut( seq(ISOdate(1910,1,1), ISOdate(1999,1,1), "years"), "10 years") ) Factor w/ 9 levels "1910-01-01","1920-01-01",..: 1 1 1 1 1 1 1 1 1 1 ... > str( cut( seq(ISOdate(1910,1,1), ISOdate(1999,1,1), "years"), "5 years") ) Factor w/ 18 levels "1910-01-01","1915-01-01",..: 1 1 1 1 1 2 2 2 2 2 ... 
+1
source

Currently, I have not come to the following script (anyway, I need to look at a more advanced answer from DWin). Now I look like I was looking, but the code is still pretty ugly (for example, I don't know how to align the label and how to get the corresponding xlab labels):

 plotAreaCorridor = function(x, y, col.poly1="lightgray", col.poly2="gray",...) { x.pol = c(x, rev(x), x[1]) y.pol = c(y[,1], rev(y[,5]),y[,1][1]) plot(x, y[,6]+1, type="n", ...) # ugly since type="n" does not work for factor polygon(x.pol, y.pol, col=col.poly1, lty=0) x.pol = c(x, rev(x), x[1]) y.pol = c(y[,2], rev(y[,4]), y[,1][1]) polygon(x.pol, y.pol, col=col.poly2, lty=0) lines(x, y[,3], col="blue") # median lines(x, y[,6], col="red") # max return(invisible()) } pause = gcdata$Pause.s. hours = droplevels(cut(gcdata$date, breaks="hours")) # can I have 2 hours? agg = aggregate(pause ~ hours, FUN=quantile, probs=c(5,20,50,80,95,100)/100) x = agg$hours ys = agg$pause q99 <- function(x, ...) { x <- quantile(x,probs=c(0.99)) } v99 = q99(gcdata$Pause.s.) vmed = median(gcdata$Pause.s.) plotAreaCorridor(x, ys,ylim=c(0,v99*1.5)) abline(h=vmed, col="lightblue") abline(h=v99, col="grey") label=paste("99%=",round(v99,digits=3),"sn=", length(gcdata$date),sep="") text(x=30, y=v99, label, col="grey", pos=3) title("NewPar Collection Activity") 

New chart version

0
source

Source: https://habr.com/ru/post/1437007/


All Articles