Cross Data Comparison in R

I have a dataset with source data and some related variables. It looks something like this:

    "Origin","Destination","distance","volume"
    "A01"     "A01"          0.0        10
    "A02"     "A01"          1.2         9
    "A03"     "A01"          1.4        15 
    "A01"     "A02"          1.2        16

Then, for each origin-destination pair, I want to be able to calculate additional variables based on the data both in this row and in the selected other rows. For example, how many other areas of origin departing to this destination have traffic volumes greater than the focal pair. In this example, for endpoint A01, I get the following:

    "Origin","Destination","distance","volume","greater_flow"
    "A01"    "A01"            0.0        10         1
    "A02"    "A01"            1.2         9         2
    "A03"    "A01"            1.4        15         0

I am trying to work out something with group_byand apply, but I can’t understand how: a) to β€œfix” the data that I want to use as a reference (volumes A01 to A01), and b) limit the comparison to only the data with the same destination (A01) and c) is repeated for all source-destination pairs.

+4
source share
2 answers

here is the answer using the R base (using apply):

d <- data.frame(Origin = c("A01", "A02", "A03", "A01"), Destination = c("A01", "A01", "A01", "A02"), distance = c(0.0, 1.2, 1.4, 1.2), volume = c(10, 9, 15, 16))

# extracting entries with destination = A01
d2 <- d[d[, "Destination"] == "A01", ]

# calculating number of rows satisfying your condition
greater_flow <- apply(d2, 1, FUN = function(x) max(sum(x['volume'] < d2[, 'volume']) - 1, 0) )

# sticking things back together
data.frame(d2, greater_flow)

#  Origin Destination distance volume greater_flow
# 1    A01         A01      0.0     10            1
# 2    A02         A01      1.2      9            2
# 3    A03         A01      1.4     15            0

if you need to perform a calculation for all possible destinations, you can simply go through unique(d[, "Destination"]):

 lapply(unique(d[, "Destination"]), FUN = function(dest){
         d2 <- d[d[, "Destination"] == dest, ]
         greater_flow <- apply(d2, 1, FUN = function(x) max(sum(x['volume'] < d2[, 'volume']) - 1, 0) )

    data.frame(d2, greater_flow)    
 })

you can stick the output together, if necessary, through do.call(rbind, output).

+1
source
library(plyr)
Fun <- function(x) { x <- x[order(x$volume),]; x$greater_flow <- (1:nrow(x))-1; x }
ddply(d, ~ Destination, .fun=Fun)
0
source

Source: https://habr.com/ru/post/1612437/


All Articles