Identification of duplicate / unique teams (and restructuring data) in R

Question

Identification of duplicate / unique teams (and restructuring data) in R

I have a dataset that looks like this:

 Person Team
   1     30
   2     30
   3     30
   4     30
   11    40
   22    40
   1     50
   2     50
   3     50
   4     50
   15    60
   16    60
   17    60
   1     70
   2     70
   3     70
   4     70
   11    80
   22    80

My common goal is to organize team identification codes so that it is easy to see which teams are duplicates of each other and which teams are unique. I want to generalize the data so that it looks like this:

 Team   Duplicate1  Duplicate2
  30        50          70
  40        80  
  60

As you can see, commands 30, 50, and 70 have the same elements, so they split the line. Similarly, commands 40 and 80 have the same elements, so they split the line. Only command 60 (in this example) is unique.

In situations where teams are duplicated, I don’t care which team identifier is in this column. In addition, there may be more than two duplicate teams. Teams range from 2 to 8 members.

+4

r

waxattax Jan 05 '15 at 21:52

5

, . , , .

require(dplyr)

df %>%
  arrange(Team, Person) %>%   # this line is necessary in case the rest of your data isn't sorted
  group_by(Team) %>%
  summarize(players = paste0(Person, collapse = ",")) %>%
  group_by(players) %>%
  summarize(teams = paste0(Team, collapse = ",")) %>%
  mutate(
    original_team = ifelse(grepl(",", teams), substr(teams, 1, gregexpr(",", teams)[[1]][1]-1), teams),
    dup_teams = ifelse(grepl(",", teams), substr(teams, gregexpr(",", teams)[[1]][1]+1, nchar(teams)), NA)
  )

:

Source: local data frame [3 x 4]

   players    teams original_team dup_teams
1  1,2,3,4 30,50,70            30     50,70
2    11,22    40,80            40        80
3 15,16,17       60            60        NA

+3

rsoren 05 . '15 22:37

dd<-structure(list(Person = c(1L, 2L, 3L, 4L, 11L, 22L, 1L, 2L, 3L, 
4L, 15L, 16L, 17L, 1L, 2L, 3L, 4L, 11L, 22L), Team = c(30L, 30L, 
30L, 30L, 40L, 40L, 50L, 50L, 50L, 50L, 60L, 60L, 60L, 70L, 70L, 
70L, 70L, 80L, 80L)), .Names = c("Person", "Team"), 
class = "data.frame", row.names = c(NA, -19L))

()/(), .

tt <- with(dd, table(Team, Person))
grp <- do.call("interaction", c(data.frame(unclass(tt)), drop=TRUE))
split(rownames(tt), grp)

$`1.1.1.1.0.0.0.0.0`
[1] "30" "50" "70"

$`0.0.0.0.0.1.1.1.0`
[1] "60"

$`0.0.0.0.1.0.0.0.1`
[1] "40" "80"

"" . , , setNames(). .

+2

MrFlick 05 . '15 22:04

R ( ):

DF2 <- aggregate(Person ~ Team, DF, toString)
> split(DF2$Team, DF2$Person)
$`1, 2, 3, 4`
[1] 30 50 70

$`11, 22`
[1] 40 80

$`15, 16, 17`
[1] 60

( DF2$DupeGroup <- as.integer(factor(DF2$Person)) )
  Team     Person DupeGroup
1   30 1, 2, 3, 4         1
2   40     11, 22         2
3   50 1, 2, 3, 4         1
4   60 15, 16, 17         3
5   70 1, 2, 3, 4         1
6   80     11, 22         2

, , , NA, , data.frame . , .

, data.table, aggregate :

library(data.table)
setDT(DF)[, toString(Person), by=Team][,DupeGroup := .GRP, by=V1][]
   Team         V1 DupeGroup
1:   30 1, 2, 3, 4         1
2:   40     11, 22         2
3:   50 1, 2, 3, 4         1
4:   60 15, 16, 17         3
5:   70 1, 2, 3, 4         1
6:   80     11, 22         2

+2

docendo discimus 05 . '15 22:10

Using uniquecombsfrom the package mgcv:

library(mgcv)
library(magrittr) # for the pipe %>%

# Using MrFlick data
team_names <- sort(unique(dd$Team))
unique_teams <- with(dd, table(Team, Person)) %>% uniquecombs %>% attr("index")
printout <- unstack(data.frame(team_names, unique_teams))

> printout
$`1`
[1] 60

$`2`
[1] 40 80

$`3`
[1] 30 50 70

Now you can use something like this answer to print in tabular form (note that the groups are column by column rather than row, as in your question)

attributes(printout) <- list(names = names(printout)
                             , row.names = 1:max(sapply(printout, length))
                             , class = "data.frame")
> printout
     1    2  3
1   60   40 30
2 <NA>   80 50
3 <NA> <NA> 70
Warning message:
In format.data.frame(x, digits = digits, na.encode = FALSE) :
  corrupt data frame: columns will be truncated or padded with NAs

+2

nacnudus Jan 6 '15 at 2:57

source share

Gregor · Accepted Answer · 2015-01-05T22:07:00+0000

, , :

# using MrFlick data
library(dplyr)
dd %>% group_by(Team) %>%
    arrange(Person) %>%
    summarize(team.char = paste(Person, collapse = "_")) %>%
    group_by(team.char) %>%
    arrange(team.char, Team) %>%
    mutate(duplicate = 1:n())

Source: local data frame [6 x 3]
Groups: team.char

  Team team.char duplicate
1   40     11_22         1
2   80     11_22         2
3   60  15_16_17         1
4   30   1_2_3_4         1
5   50   1_2_3_4         2
6   70   1_2_3_4         3

( arrange(Person), , @Reed.)

Identification of duplicate / unique teams (and restructuring data) in R

More articles: