Delete rows from data: overlapping time intervals?

Edit: I am now looking for a solution to this issue with other programming languages.

Based on another question that I asked , I have such a data set (for R-users, dput for this below) that represents user computer sessions:

username machine start end 1 user1 D5599.domain.com 2011-01-03 09:44:18 2011-01-03 09:47:27 2 user1 D5599.domain.com 2011-01-03 09:46:29 2011-01-03 10:09:16 3 user1 D5599.domain.com 2011-01-03 14:07:36 2011-01-03 14:56:17 4 user1 D5599.domain.com 2011-01-05 15:03:17 2011-01-05 15:23:15 5 user1 D5599.domain.com 2011-02-14 14:33:39 2011-02-14 14:40:16 6 user1 D5599.domain.com 2011-02-23 13:54:30 2011-02-23 13:58:23 7 user1 D5599.domain.com 2011-03-21 10:10:18 2011-03-21 10:32:22 8 user1 D5645.domain.com 2011-06-09 10:12:41 2011-06-09 10:58:59 9 user1 D5682.domain.com 2011-01-03 12:03:45 2011-01-03 12:29:43 10 USER2 D5682.domain.com 2011-01-12 14:26:05 2011-01-12 14:32:53 11 USER2 D5682.domain.com 2011-01-17 15:06:19 2011-01-17 15:44:22 12 USER2 D5682.domain.com 2011-01-18 15:07:30 2011-01-18 15:42:43 13 USER2 D5682.domain.com 2011-01-25 15:20:55 2011-01-25 15:24:38 14 USER2 D5682.domain.com 2011-02-14 15:03:00 2011-02-14 15:07:43 15 USER2 D5682.domain.com 2011-02-14 14:59:23 2011-02-14 15:14:47 > 

For the same user from the same computer there can be several parallel (overlapping in time) sessions. How to delete these rows so that only one session is left behind this data? The source dataset is approx. 500,000 rows.

Expected Result (Lines 2, 15 Deleted)

  username machine start end 1 user1 D5599.domain.com 2011-01-03 09:44:18 2011-01-03 09:47:27 3 user1 D5599.domain.com 2011-01-03 14:07:36 2011-01-03 14:56:17 4 user1 D5599.domain.com 2011-01-05 15:03:17 2011-01-05 15:23:15 5 user1 D5599.domain.com 2011-02-14 14:33:39 2011-02-14 14:40:16 6 user1 D5599.domain.com 2011-02-23 13:54:30 2011-02-23 13:58:23 7 user1 D5599.domain.com 2011-03-21 10:10:18 2011-03-21 10:32:22 8 user1 D5645.domain.com 2011-06-09 10:12:41 2011-06-09 10:58:59 9 user1 D5682.domain.com 2011-01-03 12:03:45 2011-01-03 12:29:43 10 USER2 D5682.domain.com 2011-01-12 14:26:05 2011-01-12 14:32:53 11 USER2 D5682.domain.com 2011-01-17 15:06:19 2011-01-17 15:44:22 12 USER2 D5682.domain.com 2011-01-18 15:07:30 2011-01-18 15:42:43 13 USER2 D5682.domain.com 2011-01-25 15:20:55 2011-01-25 15:24:38 14 USER2 D5682.domain.com 2011-02-14 15:03:00 2011-02-14 15:07:43 > 

Here is the dataset:

 structure(list(username = c("user1", "user1", "user1", "user1", "user1", "user1", "user1", "user1", "user1", "USER2", "USER2", "USER2", "USER2", "USER2", "USER2" ), machine = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("D5599.domain.com", "D5645.domain.com", "D5682.domain.com", "D5686.domain.com", "D5694.domain.com", "D5696.domain.com", "D5772.domain.com", "D5772.domain.com", "D5847.domain.com", "D5855.domain.com", "D5871.domain.com", "D5927.domain.com", "D5927.domain.com", "D5952.domain.com", "D5993.domain.com", "D6012.domain.com", "D6048.domain.com", "D6077.domain.com", "D5688.domain.com", "D5815.domain.com", "D6106.domain.com", "D6128.domain.com" ), class = "factor"), start = structure(c(1294040658, 1294040789, 1294056456, 1294232597, 1297686819, 1298462070, 1300695018, 1307603561, 1294049025, 1294835165, 1295269579, 1295356050, 1295961655, 1297688580, 1297688363), class = c("POSIXct", "POSIXt"), tzone = ""), end = structure(c(1294040847, 1294042156, 1294059377, 1294233795, 1297687216, 1298462303, 1300696342, 1307606339, 1294050583, 1294835573, 1295271862, 1295358163, 1295961878, 1297688863, 1297689287), class = c("POSIXct", "POSIXt"), tzone = "")), .Names = c("username", "machine", "start", "end"), row.names = c(NA, 15L), class = "data.frame") 
+6
source share
5 answers

Try the intervals package:

 library(intervals) f <- function(dd) with(dd, { r <- reduce(Intervals(cbind(start, end))) data.frame(username = username[1], machine = machine[1], start = structure(r[, 1], class = class(start)), end = structure(r[, 2], class = class(end))) }) do.call("rbind", by(d, d[1:2], f)) 

With data samples, this reduces 15 lines to the next 13 lines (by combining lines 1 and 2 and lines 12 and 13 in the original data frame):

  username machine start end 1 user1 D5599.domain.com 2011-01-03 02:44:18 2011-01-03 03:09:16 2 user1 D5599.domain.com 2011-01-03 07:07:36 2011-01-03 07:56:17 3 user1 D5599.domain.com 2011-01-05 08:03:17 2011-01-05 08:23:15 4 user1 D5599.domain.com 2011-02-14 07:33:39 2011-02-14 07:40:16 5 user1 D5599.domain.com 2011-02-23 06:54:30 2011-02-23 06:58:23 6 user1 D5599.domain.com 2011-03-21 04:10:18 2011-03-21 04:32:22 7 user1 D5645.domain.com 2011-06-09 03:12:41 2011-06-09 03:58:59 8 user1 D5682.domain.com 2011-01-03 05:03:45 2011-01-03 05:29:43 9 USER2 D5682.domain.com 2011-01-12 07:26:05 2011-01-12 07:32:53 10 USER2 D5682.domain.com 2011-01-17 08:06:19 2011-01-17 08:44:22 11 USER2 D5682.domain.com 2011-01-18 08:07:30 2011-01-18 08:42:43 12 USER2 D5682.domain.com 2011-01-25 08:20:55 2011-01-25 08:24:38 13 USER2 D5682.domain.com 2011-02-14 07:59:23 2011-02-14 08:14:47 
+3
source

One solution is to first split the intervals so that they are sometimes equal, but not partially overlapping, and they remove duplicates. The problem is that we are left with many small reference intervals, and merging them does not look easy.

 library(reshape2) library(sqldf) d$machine <- as.character( d$machine ) # Duplicated levels... ddply( d, c("username", "machine"), function (u) { # For each username and machine, # compute all the possible non-overlapping intervals intervals <- sort(unique( c(u$start, u$end) )) intervals <- data.frame( start = intervals[-length(intervals)], end = intervals[-1] ) # Only retain those actually in the data u <- sqldf( " SELECT DISTINCT u.username, u.machine, intervals.start, intervals.end FROM u, intervals WHERE u.start <= intervals.start AND intervals.end <= u.end " ) # We have non-overlapping, but potentially abutting intervals: # ideally, we should merge them, but I do not see an easy # way to do so. u } ) 

EDIT: Another, conceptually cleaner solution that fixes the problem of unconnected joins is to count the number of open sessions for each user and machine: when it ceases to be zero, the user (with one or more sessions), when it drops to zero, the user closed all his sessions.

 ddply( d, c("username", "machine"), function (u) { a <- rbind( data.frame( time = min(u$start) - 1, sessions = 0 ), data.frame( time = u$start, sessions = 1 ), data.frame( time = u$end, sessions = -1 ) ) a <- a[ order(a$time), ] a$sessions <- cumsum(a$sessions) a$previous <- c( 0, a$sessions[ - nrow(a) ] ) a <- a[ a$previous == 0 & a$sessions > 0 | a$previous > 0 & a$sessions == 0, ] a$previous_time <- a$time a$previous_time[-1] <- a$time[ -nrow(a) ] a <- a[ a$previous > 0 & a$sessions == 0, ] a <- data.frame( username = u$username[1], machine = u$machine[1], start = a$previous_time, end = a$time ) a } ) 
+1
source

An alternative solution using the interval class from lubridate .

 library(lubridate) int <- with(d, new_interval(start, end)) 

Now we need a function to check for overlap. See Determine if two date ranges overlap .

 int_overlaps <- function(int1, int2) { (int_start(int1) <= int_end(int2)) & (int_start(int2) <= int_end(int1)) } 

Now call this for all pairs of intervals.

 index <- combn(seq_along(int), 2) overlaps <- int_overlaps(int[index[1, ]], int[index[2, ]]) 

Overlapping lines:

 int[index[1, overlaps]] int[index[2, overlaps]] 

And the lines to delete are just index[2, overlaps] .

+1
source

Pseudo-code solution: O (n log n), O (n) if the data is already known for sorting properly.

Start by sorting the data by user, machine, and start time (so that all rows for a given user on this computer are grouped together, and the rows in each group are in increasing order of start time).

  • Initialize the "working interval" to null / nil / undef / etc.

  • For each row in order:

    • If the working interval exists and belongs to a different user or computer than the current line, output and clear the working interval.
    • If the working interval exists and its final time is strictly before the start time of the current line, output and clear the working interval.
    • If the working interval exists, it must belong to the same user and machine and overlap or be inferior to the current line interval, so set the end time of the working interval to the end time of the current line.
    • Else, the working interval does not exist, so set the working interval to the current line.
  • Finally, if a working interval exists, print it.

+1
source

I don't know if this is what you need or not, or if it will work better than what you already have. This is a powershell solution that uses a hash table with keys that are a combination of username and username. Values ​​are a hash of the start and end time.

If the key (session) already exists, it updates the end time. If he does not create, he sets the start time and the start time of the end. When he encounters new session entries for this user / computer in the log, he updates the session end time.

  $ht = @{} import-csv <logfile> | foreach{ $key = $_.username + $_.computername if ($ht.ContainsKey($key)){$ht.$key.end = $_.end} else{$ht.add("$key",@{start=$_.start;end=$_.end}} } 

You will need to separate usernames and computers with keys when this is done.

+1
source

Source: https://habr.com/ru/post/906927/


All Articles