How can I relate pageviews to bursts of memory?

I have memory problems with the application, but it’s a little difficult to understand exactly where. I have two data sets:

Viewed

  • Requested page
  • The requested page has been requested.

Memory usage

  • Amount of memory used
  • The recording time of this memory was

I would like to see exactly which pageviews are mapped to high memory usage. I suggest that I will do some sort of T-test to determine which pageviews correlate with increased memory usage. However, I have little doubt about which tit test to go to. Can anyone at least point me in the right direction?

+4
source share
4 answers

I would suggest building a dataset with two columns. The first is the proportion of each page in the highest memory usage time for distribution, and the second is the proportion of those (same) pages for the remaining memory allocation values.

Then you will need to perform a pair test to check whether the median of differences (high dormancy) is less than zero (H0), against the alternative hypothesis that the median of difference is greater than zero (H1). I would suggest using the non-parametric Wilcoxon Signed Ranks Test , which is a variation of the Mann - Whitney Test for paired samples. It also takes into account the magnitude of the differences in each pair, which other tests (for example, sign test) ignore.

Keep in mind that relationships (zero differences) present numerous problems in deriving nonparametric methods and should be avoided. The preferred way to deal with links is to add a little β€œnoise” to the data. That is, complete the test after changing the bound values ​​by adding a fairly small random variable that will not affect the ranking of differences

I hope that the test results and scheduling the distribution of differences will help you understand what the problem is.

This is an implementation of Wilcoxon's signature signature test in R language

+3
source

Jason

You ask good statistical questions. Think about how much memory is being used as a random variable. The first step is to look at the distribution of this rv. This may not correspond to the known distribution, but do not let this stop us. One simple approach is to use the maximum memory usage (top 5-10%) and see if these pageviews (or the times when they were requested) are viewed than other pageviews for the rest. I think you will need a nonparametric test that compares the proportion of pageviews of a sample with low memory and the proportion of pageviews in a sample of high memory. Hope this helps.

+3
source

What you represent certainly presents an interesting statistical problem, but can I suggest a graphical approach with a good table instead?

Assign each of your pages a unique number and make a graph of the scatter of the page # vs memory usage. You should get a bunch of vertical marker lines. I hope the culprit will be obvious.

If there are so many data points that the lines become hard, then you can add a small amount of noise to the page numbers to expand the lines. If requests overlap, you may have to try tricks, for example, divide the memory by the number of simultaneous requests, but your eyes should be able to choose the intruder even with a lot of noise.

+1
source

Here's another idea: If you can join the pageview and memory by timestamp values, you can create a table like this

Page A | Page B | Page C | Page D | Page E | .... | Memory_use

The value for each page column can be a bit [0,1], indicating that the page was requested or not, or the number of pages depending on your data. In the column Memory_use you can have the appropriate proportions of the memory load or count in MB. Thus, Memory_use can be considered as a dependent variable, and pages as explanatory. Thus, you can choose a suitable (depending on the form of the dependent variable) generalized linear model for this data set. The results of this analysis will give you an idea of ​​the following

-What pages significantly affect memory usage

- the degree to which each page contributes to the load (by its coefficient in the model)

- The possibility that other factors, not measured, play a significant role in the memory load (overdispersion), in the worst case, that all predictor variables may turn out to be insignificant.

+1
source

Source: https://habr.com/ru/post/1300196/


All Articles