Grouping string fields in R

I have a data frame like this:

date time userid status 1 02/25/2012 09:22:10 aabc logged_in 2 02/25/2012 09:30:10 aabc logged_out 3 02/25/2012 09:29:20 abbc logged_out 4 02/25/2012 09:27:30 abc logged_in 5 02/25/2012 09:26:29 abc login_failed 6 02/25/2012 09:26:39 abc login_failed 7 02/25/2012 09:26:52 abc login_failed 8 02/25/2012 09:27:09 abc login_failed 9 02/25/2012 09:27:20 abc login_failed 10 02/25/2012 09:24:10 abdc logged_in 11 02/25/2012 09:24:12 abdc logged_out 12 02/25/2012 09:22:10 abhc logged_in 13 02/25/2012 09:30:10 abuc logged_in 14 02/25/2012 09:30:14 abuc logged_out 15 02/25/2012 09:29:40 baa logged_in 

I want the user IDs, status and "account" for login_failures for each user. I have done this:

ddply(mytbl, c('userid', 'status'), function(x) c(count=nrow(x))) , but this gives an account for all users. I want to limit my output to only those users whose status is "login _failed". Any ideas? I saw questions grouped by number fields, but none of the lines.

I am not very familiar with all the functions of plyr. It will be great to see how this can be done using generalization, aggregate, sqldf, data.table, etc. Slowly understanding each of them.

Thanks Sri

+4
source share
4 answers
 require(data.table) DT = as.data.table(mytbl) DT[status=="login_failed", .N, by=userid] 

To name a column:

 DT[status=="login_failed", list(failed_logins=.N), by=userid] 
+4
source
 ddply(mytbl, .(userid), transform, failed_logins = length(which(status=="login_failed"))) 

Following the example of Brian Diggs, I wrote above because I suggested that you want this information to be added to the original dataset. If not, and you just need a summary, replace transform with summarise .

+2
source

A slightly different approach than @Maiasaura. I filter only failed logins and then summarize. The difference is whether those userid with inputs but not bad inputs will appear in the final result with 0 or not.

 ddply(mytbl[mytbl$status=="login_failed",], .(userid), summarise, failed_logins=length(status)) 

This gives

 > ddply(mytbl[mytbl$status=="login_failed",], .(userid), + summarise, failed_logins=length(status)) userid failed_logins 1 abc 5 

To complete the approaches, if you want all userid :

 ddply(mytbl, .(userid), summarise, failed_logins = sum(status=="login_failed")) 

which gives

 > ddply(mytbl, .(userid), + summarise, failed_logins = sum(status=="login_failed")) userid failed_logins 1 aabc 0 2 abbc 0 3 abc 5 4 abdc 0 5 abhc 0 6 abuc 0 7 baa 0 
+2
source

Here is the basic R solution using aggregate() :

 setNames(aggregate(status ~ userid, mytbl[mytbl$status == "login_failed", ], function(x) length(x)), c("userid", "failed_logins")) # userid failed_logins # 1 abc 5 

Update

Another useful feature that comes to mind is ave() , which you can use as follows:

  • First, use ave() to add a new column to your dataset that processes a counter for each action by each user. ( Note : I had to make sure that the "userid" and "status" columns were a character class, not factors, to make this work for me).

     mytbl$status_seq <- ave(mytbl$status, mytbl$userid, mytbl$status, FUN = seq_along) head(mytbl) # date time userid status status_seq # 1 02/25/2012 09:22:10 aabc logged_in 1 # 2 02/25/2012 09:30:10 aabc logged_out 1 # 3 02/25/2012 09:29:20 abbc logged_out 1 # 4 02/25/2012 09:27:30 abc logged_in 1 # 5 02/25/2012 09:26:29 abc login_failed 1 # 6 02/25/2012 09:26:39 abc login_failed 2 
  • Second, use aggregate() as shown above, a subset of the condition you are interested in and retrieve the max value.

     aggregate(status_seq ~ userid, mytbl[mytbl$status == "login_failed", ], function(x) max(x)) # userid status_seq # 1 abc 5 aggregate(status_seq ~ userid, mytbl[mytbl$status == "logged_out", ], function(x) max(x)) # userid status_seq # 1 aabc 1 # 2 abbc 1 # 3 abdc 1 # 4 abuc 1 

Note that ave() may be even more interesting if you used

 mytbl$status_seq <- ave(mytbl$status, mytbl$date, mytbl$userid, mytbl$status, FUN = seq_along) 

as this will reset the counter for each new day in your dataset.

Finally (at the risk of sharing a solution that might be too obvious), since you are only interested in numbers, you can examine table() , which gives you all the information right away:

 table(mytbl$userid, mytbl$status) # # logged_in logged_out login_failed # aabc 1 1 0 # abbc 0 1 0 # abc 1 0 5 # abdc 1 1 0 # abhc 1 0 0 # abuc 1 1 0 # baa 1 0 0 
+2
source

Source: https://habr.com/ru/post/1437650/


All Articles