How to calculate the ratio of data points, i.e. Combine them based on some criterion?

Question

How to calculate the ratio of data points, i.e. Combine them based on some criterion?

(Unfortunately, I lack the basic vocabulary to formulate my question. Therefore, please correct me where more precise terms are useful.)

I use R to do a very simple statistical analysis for virtual machine test results, and I often want to normalize my data based on some criterion.

My current problem is that I would like something like the following:

normalized_data <- ddply(bench, ~ Benchmark + Configuration + Approach, transform, Ratio = Time / Time[Approach == "appr2"])

So, I really want to calculate the acceleration between the corresponding pairs of measurements.

bench is a data frame with columns Time, Benchmark, Configuration and Approach and contains 100 measurements for all possible combinations of Benchmark, Configuration and Approach. Now I have exactly two approaches and you want to speed up "appr2" / "appr1". Thus, just looking at one specific landmark and one specific configuration, I have 100 dimensions for "appr1" and 100 "appr2" in my data frame. However, R gives me the following error resulting from the request:

 Error in data.frame(list(Time = c(405.73, 342.616, 404.484, 328.742, 403.384, : arguments imply differing number of rows: 100, 0

Ideally, the result of my query will lead to the creation of a new data frame with three columns SpeedUp, Benchmark, Configuration. Based on this, I could calculate the means, confidence intervals, etc.

But at the moment, the main problem is how to express such a normalization. For another dataset, I was able to calculate a normalized value similar to this Time.norm = Time / Time[NumCores == min(NumCores)] , but it looks like it worked by chance, at least I don’t understand the difference.

Any clues. (Especially correct terminology for finding solutions to such problems.)

Edit: thanks to the Chase hint, here is a minimal data set that should be structurally identical to what I received, and it exhibits the same behavior with respect to the above query.

 bench <- structure(list(Time = c(399.04, 388.069, 401.072, 361.646), Benchmark = structure(c(1L, 1L, 1L, 1L), .Label = c("Fibonacci"), class = "factor"), Configuration = structure(c(1L, 1L, 1L, 1L), .Label = c("native"), class = "factor"), Approach = structure(c(1L, 1L, 2L, 2L), .Label = c("appr1", "appr2"), class = "factor")), .Names = c("Time", "Benchmark", "Configuration", "Approach"), row.names = c(NA, 4L), class = "data.frame")

+6

r

smarr Aug 28 '11 at 16:19

source share

2 answers

If you try to do this in ddply in a way that I naively tried at first, you will find that you only work in specific categories:

  ddply(bench, ~ Benchmark + Configuration + Approach, transform, Ratio = Time / mean(Time[Approach == "appr2"]) ) #------------ Time Benchmark Configuration Approach Ratio 1 399.040 Fibonacci native appr1 NaN 2 388.069 Fibonacci native appr1 NaN 3 401.072 Fibonacci native appr2 1.0516915 4 361.646 Fibonacci native appr2 0.9483085

Obviously not what they hoped for. You can calculate the average value outside the bench as a normalization factor:

  meanappr2 <- mean(subset(bench, Approach == "appr2", Time)) ddply(bench, ~ Benchmark + Configuration + Approach, transform, Ratio = Time / meanappr2 ) #-------------- Time Benchmark Configuration Approach Ratio 1 399.040 Fibonacci native appr1 1.0463631 2 388.069 Fibonacci native appr1 1.0175950 3 401.072 Fibonacci native appr2 1.0516915 4 361.646 Fibonacci native appr2 0.9483085

If, on the other hand, you do not need to normalize line by line, but rather compare cross-groups, use the "summarize" parameter inside *ply operations:

  ddply(bench, ~ Benchmark + Configuration + Approach, summarise, Ratio = mean(Time) / meanappr2 ) #----------- Benchmark Configuration Approach Ratio 1 Fibonacci native appr1 1.031979 2 Fibonacci native appr2 1.000000

0

42- Aug 28 '11 at 18:31

source share

smarr · Accepted Answer · 2011-08-29T07:13:10+0000

Looks like I'm still missing quite a few basic concepts in R.

The solution lies in the formula used: ~ Benchmark + Configuration + Approach groups data across all three dimensions, and this is not what I really need. The resulting dataset really only contained the "appr1" data, and it was noted that they could correlate with.

Thus, changing forumla to ~ Benchmark + Configuration results in a dataset that contains the data "appr1" and "appr2" for all time dimensions. And then, it works as intended :)

How to calculate the ratio of data points, i.e. Combine them based on some criterion?

More articles: