Edit: I am now looking for a solution to this issue with other programming languages.
Based on another question that I asked , I have such a data set (for R-users, dput for this below) that represents user computer sessions:
username machine start end 1 user1 D5599.domain.com 2011-01-03 09:44:18 2011-01-03 09:47:27 2 user1 D5599.domain.com 2011-01-03 09:46:29 2011-01-03 10:09:16 3 user1 D5599.domain.com 2011-01-03 14:07:36 2011-01-03 14:56:17 4 user1 D5599.domain.com 2011-01-05 15:03:17 2011-01-05 15:23:15 5 user1 D5599.domain.com 2011-02-14 14:33:39 2011-02-14 14:40:16 6 user1 D5599.domain.com 2011-02-23 13:54:30 2011-02-23 13:58:23 7 user1 D5599.domain.com 2011-03-21 10:10:18 2011-03-21 10:32:22 8 user1 D5645.domain.com 2011-06-09 10:12:41 2011-06-09 10:58:59 9 user1 D5682.domain.com 2011-01-03 12:03:45 2011-01-03 12:29:43 10 USER2 D5682.domain.com 2011-01-12 14:26:05 2011-01-12 14:32:53 11 USER2 D5682.domain.com 2011-01-17 15:06:19 2011-01-17 15:44:22 12 USER2 D5682.domain.com 2011-01-18 15:07:30 2011-01-18 15:42:43 13 USER2 D5682.domain.com 2011-01-25 15:20:55 2011-01-25 15:24:38 14 USER2 D5682.domain.com 2011-02-14 15:03:00 2011-02-14 15:07:43 15 USER2 D5682.domain.com 2011-02-14 14:59:23 2011-02-14 15:14:47 >
For the same user from the same computer there can be several parallel (overlapping in time) sessions. How to delete these rows so that only one session is left behind this data? The source dataset is approx. 500,000 rows.
Expected Result (Lines 2, 15 Deleted)
username machine start end 1 user1 D5599.domain.com 2011-01-03 09:44:18 2011-01-03 09:47:27 3 user1 D5599.domain.com 2011-01-03 14:07:36 2011-01-03 14:56:17 4 user1 D5599.domain.com 2011-01-05 15:03:17 2011-01-05 15:23:15 5 user1 D5599.domain.com 2011-02-14 14:33:39 2011-02-14 14:40:16 6 user1 D5599.domain.com 2011-02-23 13:54:30 2011-02-23 13:58:23 7 user1 D5599.domain.com 2011-03-21 10:10:18 2011-03-21 10:32:22 8 user1 D5645.domain.com 2011-06-09 10:12:41 2011-06-09 10:58:59 9 user1 D5682.domain.com 2011-01-03 12:03:45 2011-01-03 12:29:43 10 USER2 D5682.domain.com 2011-01-12 14:26:05 2011-01-12 14:32:53 11 USER2 D5682.domain.com 2011-01-17 15:06:19 2011-01-17 15:44:22 12 USER2 D5682.domain.com 2011-01-18 15:07:30 2011-01-18 15:42:43 13 USER2 D5682.domain.com 2011-01-25 15:20:55 2011-01-25 15:24:38 14 USER2 D5682.domain.com 2011-02-14 15:03:00 2011-02-14 15:07:43 >
Here is the dataset:
structure(list(username = c("user1", "user1", "user1", "user1", "user1", "user1", "user1", "user1", "user1", "USER2", "USER2", "USER2", "USER2", "USER2", "USER2" ), machine = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("D5599.domain.com", "D5645.domain.com", "D5682.domain.com", "D5686.domain.com", "D5694.domain.com", "D5696.domain.com", "D5772.domain.com", "D5772.domain.com", "D5847.domain.com", "D5855.domain.com", "D5871.domain.com", "D5927.domain.com", "D5927.domain.com", "D5952.domain.com", "D5993.domain.com", "D6012.domain.com", "D6048.domain.com", "D6077.domain.com", "D5688.domain.com", "D5815.domain.com", "D6106.domain.com", "D6128.domain.com" ), class = "factor"), start = structure(c(1294040658, 1294040789, 1294056456, 1294232597, 1297686819, 1298462070, 1300695018, 1307603561, 1294049025, 1294835165, 1295269579, 1295356050, 1295961655, 1297688580, 1297688363), class = c("POSIXct", "POSIXt"), tzone = ""), end = structure(c(1294040847, 1294042156, 1294059377, 1294233795, 1297687216, 1298462303, 1300696342, 1307606339, 1294050583, 1294835573, 1295271862, 1295358163, 1295961878, 1297688863, 1297689287), class = c("POSIXct", "POSIXt"), tzone = "")), .Names = c("username", "machine", "start", "end"), row.names = c(NA, 15L), class = "data.frame")