Financial data - R data.table - conditions group

Given the following data.table with financial data:

 userId systemBankId accountId valueDate quantity description 871 0065 6422 2013-02-28 -52400 AMORTIZACION PRESTAMO 871 0065 6422 2013-03-28 -52400 AMORTIZACION PRESTAMO 871 0065 6422 2013-04-01 -3000000 AMORTIZACION PRESTAMO 871 0065 6422 2013-04-30 -52349 AMORTIZACION PRESTAMO 871 0065 6422 2013-05-31 -52349 AMORTIZACION PRESTAMO 871 0065 6422 2013-06-28 -52349 AMORTIZACION PRESTAMO 871 0065 6422 2013-07-30 -52349 AMORTIZACION PRESTAMO 871 0065 6422 2013-08-30 -52349 AMORTIZACION PRESTAMO 871 0065 6422 2013-09-30 -52349 AMORTIZACION PRESTAMO 871 0065 6422 2013-10-30 -52349 AMORTIZACION PRESTAMO 871 0065 6422 2013-11-29 -52349 AMORTIZACION PRESTAMO 871 0065 6422 2013-12-30 -52349 AMORTIZACION PRESTAMO 871 0065 6422 2014-01-30 -52349 AMORTIZACION PRESTAMO 871 0065 6422 2014-02-28 -52349 AMORTIZACION PRESTAMO 871 0065 6422 2014-03-31 -52349 AMORTIZACION PRESTAMO 871 0065 6422 2014-04-30 -52349 AMORTIZACION PRESTAMO 871 0065 6422 2014-05-30 -52349 AMORTIZACION PRESTAMO 871 0065 6422 2014-06-30 -52349 AMORTIZACION PRESTAMO 871 0065 6422 2014-07-30 -52349 AMORTIZACION PRESTAMO 871 0065 6422 2014-08-29 -52349 AMORTIZACION PRESTAMO 871 0065 6422 2014-09-30 -52349 AMORTIZACION PRESTAMO 871 0065 6422 2014-10-30 -52349 AMORTIZACION PRESTAMO 871 0065 6422 2014-11-28 -52349 AMORTIZACION PRESTAMO 

I want to group by userId , systemBankId , accountId and quantity :

 dt[userId==871L,.N,by=.(userId,systemBankId,accountId,quantity)] 

The result is the following:

  userId systemBankId accountId quantity N 871 0065 6422 -52400 3 871 0065 6422 -3000000 1 871 0065 6422 -52349 20 

But, the first and third are the same transaction: a mortgage, and the second - a loan.

I want to group as follows:

 userId systemBankId accountId quantity N 871 0065 6422 -XXXXX 23 871 0065 6422 -3000000 1 

So, you can see that after 24 months this user has 23 mortgage transactions and 1 payment for a loan transaction.

Question: is there an easy way to do this? (I.e):

 dt[userId==871L,.N,by=.(userId,systemBankId,accountId,(quantity %between% c(-quantity*0.20,quantity*0,20 ))] 

For payments between ranges [-20%, 20%] are considered equal.

Thank you for the advanced.

Sincerely.


To get a data frame from the above data:

 structure(list(userId = c(871L, 871L, 871L, 871L, 871L, 871L, 871L, 871L, 871L, 871L, 871L, 871L, 871L, 871L, 871L, 871L, 871L, 871L, 871L, 871L, 871L, 871L, 871L), systemBankId = c(65L, 65L, 65L, 65L, 65L, 65L, 65L, 65L, 65L, 65L, 65L, 65L, 65L, 65L, 65L, 65L, 65L, 65L, 65L, 65L, 65L, 65L, 65L), accountId = c(6422L, 6422L, 6422L, 6422L, 6422L, 6422L, 6422L, 6422L, 6422L, 6422L, 6422L, 6422L, 6422L, 6422L, 6422L, 6422L, 6422L, 6422L, 6422L, 6422L, 6422L, 6422L, 6422L), valueDate = structure(c(2L, 4L, 1L, 10L, 23L, 5L, 14L, 16L, 17L, 19L, 8L, 21L, 9L, 3L, 22L, 11L, 12L, 13L, 15L, 7L, 18L, 20L, 6L), .Label = c("01/04/2013", "28/02/2013", "28/02/2014", "28/03/2013", "28/06/2013", "28/11/2014", "29/08/2014", "29/11/2013", "30/01/2014", "30/04/2013", "30/04/2014", "30/05/2014", "30/06/2014", "30/07/2013", "30/07/2014", "30/08/2013", "30/09/2013", "30/09/2014", "30/10/2013", "30/10/2014", "30/12/2013", "31/03/2014", "31/05/2013"), class = "factor"), quantity = c(-52400L, -52400L, -3000000L, -52349L, -52349L, -52349L, -52349L, -52349L, -52349L, -52349L, -52349L, -52349L, -52349L, -52349L, -52349L, -52349L, -52349L, -52349L, -52349L, -52349L, -52349L, -52349L, -52349L ), description = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "AMORTIZACION PRESTAMO", class = "factor")), .Names = c("userId", "systemBankId", "accountId", "valueDate", "quantity", "description" ), class = "data.frame", row.names = c(NA, -23L)) 

UPDATE

The final step is to note in the original data set transactions that represent mortgage payments, as well as loan payments.

Based on the answers, I do:

a) criterion: during a 24-month period, if there are twenty or more recurring transactions by userId, systemBankId, accountId, quantity (-20%, 20%), they are mortgage payments:

 tmp <- dt[userId==871L,.N,by=.(userId,systemBankId,accountId,round(quantity * 5, -floor(log10(abs(quantity))))/5)][N>20,list(userId,systemBankId,accountId,round,N)] userId systemBankId accountId round N 871 0065 6422 -52000 23 

I know that there are 23 mortgage transactions.

b) I need to identify these 23 transactions:

 tmp2 <- dt[userId==871L,list(userId,systemBankId,accountId,round=round(quantity * 5, -floor(log10(abs(quantity))))/5)] merge(tmp,tmp2,by=c('userId','systemBankId','accountId','round')) userId systemBankId accountId round N 871 0065 6422 -52000 23 871 0065 6422 -52000 23 871 0065 6422 -52000 23 871 0065 6422 -52000 23 871 0065 6422 -52000 23 871 0065 6422 -52000 23 871 0065 6422 -52000 23 871 0065 6422 -52000 23 871 0065 6422 -52000 23 871 0065 6422 -52000 23 871 0065 6422 -52000 23 871 0065 6422 -52000 23 871 0065 6422 -52000 23 871 0065 6422 -52000 23 871 0065 6422 -52000 23 871 0065 6422 -52000 23 871 0065 6422 -52000 23 871 0065 6422 -52000 23 871 0065 6422 -52000 23 871 0065 6422 -52000 23 871 0065 6422 -52000 23 871 0065 6422 -52000 23 871 0065 6422 -52000 23 userId systemBankId accountId round N 

Good. I defined 23 transactions, but if I have a transaction with an amount equal to -52000, this will be noted, as well as a mortgage payment.

My question is: based on the periodic payment criterion How do I determine the correct transactions.

thanks in advanced mode.

+6
source share
3 answers

Perhaps smth in these lines:

 dt[, round.qty := quantity[1] * round(quantity/quantity[1]), by = .(userId, systemBankId, accountId)] dt[, .N, by = .(userId, systemBankId, accountId, round.qty)] # userId systemBankId accountId round.qty N #1: 871 65 6422 -52400 22 #2: 871 65 6422 -2986800 1 
+4
source

Here's a quick hack with dplyr :

 library(dplyr) setDF(dt) %>% mutate(quantity = round(quantity/10000, 0)) %>% group_by(userId, systemBankId, accountId, quantity) %>% tally() 

What gives:

 #Source: local data frame [2 x 5] #Groups: userId, systemBankId, accountId # # userId systemBankId accountId quantity n #1 871 65 6422 -300 1 #2 871 65 6422 -5 22 

Edit

As mentioned by David in the comments, this answer is a simplification. A more consistent approach would be something like Roland's suggestion:

 library(dplyr) setDF(dt) %>% mutate(quantity = round(quantity * 5, -floor(log10(abs(quantity))))/5) %>% group_by(userId, systemBankId, accountId, quantity) %>% tally() 

Or using data.table :

 dt[userId == 871L, .N, by = .(userId, systemBankId, accountId, quantity = round(quantity * 5, -floor(log10(abs(quantity))))/5)] 
+4
source

Here is the smart function that matters created by @dnlbrky here: Use the value from the previous line in calculating the Rd table

 #Create a function to return previous rows rowShift <- function(x, shiftLen = 1L) { r <- (1L + shiftLen):(length(x) + shiftLen) r[r<1] <- NA return(x[r]) } 

This will allow you to define your 20% ranges:

 dt$prev_qty_low <-rowShift(dt$quantity,-1) * .8 dt$prev_qty_high <-rowShift(dt$quantity,-1) * 1.2 
0
source

Source: https://habr.com/ru/post/983602/


All Articles