I have a large data frame in R, and I want to create several new columns based on existing columns. However, for each row, the new value also depends on some other rows.
Here are some dummy data
colnames <- c('date', 'docnr', 'clientid', 'values')
docnr <- c(1,2,3,4,5,6)
dates <- c('2017-01-01', '2017-02-01', '2017-03-01', '2017-04-01','2017-01-05', '2017-02-05')
clients <- c(1,1,1,1,2,2)
values <- c(10,14,4,7,9,19)
df <- data.frame(cbind(dates, docnr, clients, values))
names(df) <- colnames
df$date <- as.Date(df$date, format = "%Y-%m-%d")
df
date docnr clientid values
1 2017-01-01 1 1 10
2 2017-02-01 2 1 14
3 2017-03-01 3 1 4
4 2017-04-01 4 1 7
5 2017-01-05 5 2 9
6 2017-02-05 6 2 19
What I want to do is for each row (uniquely identified by docnr), to take the date and client ID, and find all the other rows with the same client and an earlier date.
Then I want to calculate some things from this subset. For example, I want the total number of rows in this subset and the total number of all values in this subset.
So, for this example data, I would expect:
date docnr clientid values counts totals
1 2017-01-01 1 1 10 0 0
2 2017-02-01 2 1 14 1 10
3 2017-03-01 3 1 4 2 24
4 2017-04-01 4 1 7 3 28
5 2017-01-05 5 2 9 0 0
6 2017-02-05 6 2 19 1 9
I am currently using a for loop:
counts <- numeric(0)
totals <- numeric(0)
for (i in 1:nrow(df)) {
tmp <- df[df$date< df$date[i] & df$clientid== df$clientid[i],
c( "date", "docnr","value")]
cnt <- nrow(tmp)
tot <- sum(tmp$value)
counts[i] <- res
totals[i] <- tot
}
df$counts <- counts
df$totals <- totals
, , 700 . ( ). doSNOW
, , .
sql- sqldf
, , , ( ).
SQL ( ?), R sqldf. , .
R ( sql), , - .