Function calculation tip for describing the upper bound of data

I have a scatter plot of a dataset, and I am interested in calculating the upper bound of the data. I don't know if this is a standard statistical approach, so what I was looking at was dividing the X-axis data into small ranges, calculating the maximum value for these ranges, and then trying to define a function to describe these points. Is there a function already in R?

If relevant, there are 92611 points.

alt text

+4
source share
2 answers

You might like the quantile regression, which is available in the quantreg package. Will it be useful, it will depend on whether you want the absolute maximum in your "windows" to be admissible some kind of extreme quantile, say, the 95th or 99th? If you are not familiar with quantile regression, then consider linear regression that matches the expected or average response model due to model covariates. Quantum regression for the average quantile (0.5) would be consistent with the model for the median response due to model covariates.

Here is an example of using the quantreg package to show you what I mean. First, create some dummy data, similar to the data you display:

set.seed(1) N <- 5000 DF <- data.frame(Y = rev(sort(rlnorm(N, -0.9))) + rnorm(N), X = seq_len(N)) plot(Y ~ X, data = DF) 

Then fit the model to the 99th percentile (or 0.99 quantile):

 mod <- rq(Y ~ log(X), data = DF, tau = .99) 

To generate a “fitted line”, we predict from the model 100 uniformly spaced values ​​in X

 pDF <- data.frame(X = seq(1, 5000, length = 100)) pDF <- within(pDF, Y <- predict(mod, newdata = pDF)) 

and add the appropriate model to the diagram:

 lines(Y ~ X, data = pDF, col = "red", lwd = 2) 

This should give you the following:

quantile regression output

+9
source

I would nominate Gavin's second candidate for using quantile regression. Your data can be modeled using your X and Y of each log, usually distributed. You can see what the graph of the joint distribution of two independent (without imposing correlation, but not necessarily cor (x, y) == 0) log-normal variates looks like if you run:

 x <- rlnorm(1000, log(300), sdlog=1) y<- rlnorm(1000, log(7), sdlog=1) plot(x,y, cex=0.3) 

alt text

You can consider their individual distributions with qqplot (in the base build functions), remembering that the tails of such extensions can behave surprisingly. You should be more interested in how much of the values ​​correspond to a certain distribution than extremes ... unless, of course, your applications are related to finance or insurance. Don't want another global financial crisis due to poor modeling assumptions about tail behavior, now we?

 qqplot(x, rlnorm(10000, log(300), sdlog=1) ) 
+3
source

Source: https://habr.com/ru/post/1334210/


All Articles