How to calculate the ratio of data points, i.e. Combine them based on some criterion?

(Unfortunately, I lack the basic vocabulary to formulate my question. Therefore, please correct me where more precise terms are useful.)

I use R to do a very simple statistical analysis for virtual machine test results, and I often want to normalize my data based on some criterion.

My current problem is that I would like something like the following:

normalized_data <- ddply(bench, ~ Benchmark + Configuration + Approach, transform, Ratio = Time / Time[Approach == "appr2"]) 

So, I really want to calculate the acceleration between the corresponding pairs of measurements.

bench is a data frame with columns Time, Benchmark, Configuration and Approach and contains 100 measurements for all possible combinations of Benchmark, Configuration and Approach. Now I have exactly two approaches and you want to speed up "appr2" / "appr1". Thus, just looking at one specific landmark and one specific configuration, I have 100 dimensions for "appr1" and 100 "appr2" in my data frame. However, R gives me the following error resulting from the request:

 Error in data.frame(list(Time = c(405.73, 342.616, 404.484, 328.742, 403.384, : arguments imply differing number of rows: 100, 0 

Ideally, the result of my query will lead to the creation of a new data frame with three columns SpeedUp, Benchmark, Configuration. Based on this, I could calculate the means, confidence intervals, etc.

But at the moment, the main problem is how to express such a normalization. For another dataset, I was able to calculate a normalized value similar to this Time.norm = Time / Time[NumCores == min(NumCores)] , but it looks like it worked by chance, at least I don’t understand the difference.

Any clues. (Especially correct terminology for finding solutions to such problems.)

Edit: thanks to the Chase hint, here is a minimal data set that should be structurally identical to what I received, and it exhibits the same behavior with respect to the above query.

 bench <- structure(list(Time = c(399.04, 388.069, 401.072, 361.646), Benchmark = structure(c(1L, 1L, 1L, 1L), .Label = c("Fibonacci"), class = "factor"), Configuration = structure(c(1L, 1L, 1L, 1L), .Label = c("native"), class = "factor"), Approach = structure(c(1L, 1L, 2L, 2L), .Label = c("appr1", "appr2"), class = "factor")), .Names = c("Time", "Benchmark", "Configuration", "Approach"), row.names = c(NA, 4L), class = "data.frame") 
+6
source share
2 answers

Looks like I'm still missing quite a few basic concepts in R.

The solution lies in the formula used: ~ Benchmark + Configuration + Approach groups data across all three dimensions, and this is not what I really need. The resulting dataset really only contained the "appr1" data, and it was noted that they could correlate with.

Thus, changing forumla to ~ Benchmark + Configuration results in a dataset that contains the data "appr1" and "appr2" for all time dimensions. And then, it works as intended :)

0
source

If you try to do this in ddply in a way that I naively tried at first, you will find that you only work in specific categories:

  ddply(bench, ~ Benchmark + Configuration + Approach, transform, Ratio = Time / mean(Time[Approach == "appr2"]) ) #------------ Time Benchmark Configuration Approach Ratio 1 399.040 Fibonacci native appr1 NaN 2 388.069 Fibonacci native appr1 NaN 3 401.072 Fibonacci native appr2 1.0516915 4 361.646 Fibonacci native appr2 0.9483085 

Obviously not what they hoped for. You can calculate the average value outside the bench as a normalization factor:

  meanappr2 <- mean(subset(bench, Approach == "appr2", Time)) ddply(bench, ~ Benchmark + Configuration + Approach, transform, Ratio = Time / meanappr2 ) #-------------- Time Benchmark Configuration Approach Ratio 1 399.040 Fibonacci native appr1 1.0463631 2 388.069 Fibonacci native appr1 1.0175950 3 401.072 Fibonacci native appr2 1.0516915 4 361.646 Fibonacci native appr2 0.9483085 

If, on the other hand, you do not need to normalize line by line, but rather compare cross-groups, use the "summarize" parameter inside *ply operations:

  ddply(bench, ~ Benchmark + Configuration + Approach, summarise, Ratio = mean(Time) / meanappr2 ) #----------- Benchmark Configuration Approach Ratio 1 Fibonacci native appr1 1.031979 2 Fibonacci native appr2 1.000000 
0
source

Source: https://habr.com/ru/post/896081/


All Articles