Restructuring team - individual level data in R (when saving information at the team level)

My current data is as follows:

Person Team 10 100 11 100 12 100 10 200 11 200 14 200 15 200 

I want to conclude who knew each other based on which teams they were together. I also want to count how many times the dyad has been in the team together, and I want to keep track of the team identification codes that connect each pair of people. In other words, I want to create a dataset that looks like this:

 Person1 Person2 Count Team1 Team2 Team3 10 11 2 100 200 NA 10 12 1 100 NA NA 11 12 1 100 NA NA 10 14 1 200 NA NA 10 15 1 200 NA NA 11 14 1 200 NA NA 11 15 1 200 NA NA 

The resulting dataset captures relationships that can be derived based on the commands that were specified in the original dataset. The variable "Count" reflects the number of times when a couple of people were in a team together. The variables "Team1", "Team2" and "Team3" list the identifiers of teams that connect each pair of people to each other. It does not matter which personal / team identifier is listed first versus second. Teams range from 2 to 8 members.

+6
source share
4 answers

Here is the "data.table" solution, which seems to get to the place you want to get (albeit with quite a lot of code):

 library(data.table) dcast.data.table( dcast.data.table( as.data.table(d)[, combn(Person, 2), by = Team][ , ind := paste0("Person", c(1, 2))][ , time := sequence(.N), by = list(Team, ind)], time + Team ~ ind, value.var = "V1")[ , c("count", "time") := list(.N, sequence(.N)), by = list(Person1, Person2)], Person1 + Person2 + count ~ time, value.var = "Team") # Person1 Person2 count 1 2 # 1: 10 11 2 100 200 # 2: 10 12 1 100 NA # 3: 10 14 1 200 NA # 4: 10 15 1 200 NA # 5: 11 12 1 100 NA # 6: 11 14 1 200 NA # 7: 11 15 1 200 NA # 8: 14 15 1 200 NA 

Update: step-by-step version above

To understand what happens above, here is a step by step:

 ## The following would be a long data.table with 4 columns: ## Team, V1, ind, and time step1 <- as.data.table(d)[ , combn(Person, 2), by = Team][ , ind := paste0("Person", c(1, 2))][ , time := sequence(.N), by = list(Team, ind)] head(step1) # Team V1 ind time # 1: 100 10 Person1 1 # 2: 100 11 Person2 1 # 3: 100 10 Person1 2 # 4: 100 12 Person2 2 # 5: 100 11 Person1 3 # 6: 100 12 Person2 3 ## Here, we make the data "wide" step2 <- dcast.data.table(step1, time + Team ~ ind, value.var = "V1") step2 # time Team Person1 Person2 # 1: 1 100 10 11 # 2: 1 200 10 11 # 3: 2 100 10 12 # 4: 2 200 10 14 # 5: 3 100 11 12 # 6: 3 200 10 15 # 7: 4 200 11 14 # 8: 5 200 11 15 # 9: 6 200 14 15 ## Create a "count" column and a "time" column, ## grouped by "Person1" and "Person2". ## Count is for the count column. ## Time is for going to a wide format step3 <- step2[, c("count", "time") := list(.N, sequence(.N)), by = list(Person1, Person2)] step3 # time Team Person1 Person2 count # 1: 1 100 10 11 2 # 2: 2 200 10 11 2 # 3: 1 100 10 12 1 # 4: 1 200 10 14 1 # 5: 1 100 11 12 1 # 6: 1 200 10 15 1 # 7: 1 200 11 14 1 # 8: 1 200 11 15 1 # 9: 1 200 14 15 1 ## The final step of going wide out <- dcast.data.table(step3, Person1 + Person2 + count ~ time, value.var = "Team") out # Person1 Person2 count 1 2 # 1: 10 11 2 100 200 # 2: 10 12 1 100 NA # 3: 10 14 1 200 NA # 4: 10 15 1 200 NA # 5: 11 12 1 100 NA # 6: 11 14 1 200 NA # 7: 11 15 1 200 NA # 8: 14 15 1 200 NA 
+6
source

Calculations are easy to get with self-connect, which I think is easiest to do with sqldf . (Note that I probably think sqldf easiest because I'm not too good with data.table .) Editing to include @G. Grothendieck offer:

 # your data dd <- structure(list(Person = c(10L, 11L, 12L, 10L, 11L, 14L, 15L), Team = c(100L, 100L, 100L, 200L, 200L, 200L, 200L)), .Names = c("Person", "Team"), class = "data.frame", row.names = c(NA, -7L)) library(sqldf) dyads = sqldf("select dd1.Person Person1, dd2.Person Person2 , count(*) Count , group_concat(dd1.Team) Teams from dd dd1 inner join dd dd2 on dd1.Team = dd2.Team and dd1.Person < dd2.Person group by dd1.Person, dd2.Person") Person1 Person2 Count Teams 1 10 11 2 100,200 2 10 12 1 100 3 10 14 1 200 4 10 15 1 200 5 11 12 1 100 6 11 14 1 200 7 11 15 1 200 8 14 15 1 200 

Then we can split the row to get the columns we need.

 library(stringr) cbind(dyads, apply(str_split_fixed(dyads$Teams, ",", n = max(str_count(dyads$Teams, pattern = ",")) + 1), MARGIN = 2, FUN = as.numeric)) Person1 Person2 Count Teams 1 2 1 10 11 2 100,200 100 200 2 10 12 1 100 100 NA 3 10 14 1 200 200 NA 4 10 15 1 200 200 NA 5 11 12 1 100 100 NA 6 11 14 1 200 200 NA 7 11 15 1 200 200 NA 8 14 15 1 200 200 NA 

I will leave the column renaming.

+4
source

Following @Gregor and using Gregor data, I tried to add command columns. I could not produce what you requested, but it can be useful. Using full_join in the dev version of dplyr (dplyr 0.4), I did the following. I created a data frame for each command with all Person combinations using combn and saved the data as an object, a . Then I split a command and used full_join . Thus, I tried to create command columns, at least for teams 100 and 200 . I used rename to change the column names and select to arrange the columns in your path.

 library(dplyr) group_by(dd, Team) %>% do(data.frame(t(combn(.$Person, 2)))) %>% data.frame() ->a; full_join(filter(a, Team == "100"), filter(a, Team == "200"), by = c("X1", "X2")) %>% rename(Person1 = X1, Person2 = X2, Team1 = Team.x, Team2 = Team.y) %>% select(Person1, Person2, Team1, Team2) # Person1 Person2 Team1 Team2 #1 10 11 100 200 #2 10 12 100 NA #3 11 12 100 NA #4 10 14 NA 200 #5 10 15 NA 200 #6 11 14 NA 200 #7 11 15 NA 200 #8 14 15 NA 200 

EDIT

I am sure there are better ways to do this. But this is the closest I can do. I tried to add an account using a different connection in this version.

 group_by(dd, Team) %>% do(data.frame(t(combn(.$Person, 2)))) %>% data.frame() ->a; full_join(filter(a, Team == "100"), filter(a, Team == "200"), by = c("X1", "X2")) %>% full_join(count(a, X1, X2), by = c("X1", "X2")) %>% rename(Person1 = X1, Person2 = X2, Team1 = Team.x, Team2 = Team.y, Count = n) %>% select(Person1, Person2, Count, Team1, Team2) # Person1 Person2 Count Team1 Team2 #1 10 11 2 100 200 #2 10 12 1 100 NA #3 11 12 1 100 NA #4 10 14 1 NA 200 #5 10 15 1 NA 200 #6 11 14 1 NA 200 #7 11 15 1 NA 200 #8 14 15 1 NA 200 
+4
source

Here is a general solution:

 library(dplyr) library(reshape2) find.friends <- function(d,n=2) { d$exist <- T z <- dcast(d,Person~Team,value.var='exist') # Person 100 200 # 1 10 TRUE TRUE # 2 11 TRUE TRUE # 3 12 TRUE NA # 4 14 NA TRUE # 5 15 NA TRUE pairs.per.team <- sapply( sort(unique(d$Team)), function(team) { non.na <- !is.na(z[,team]) if (sum(non.na)<n) return() combns <- t(combn(z$Person[non.na],n)) cbind(combns,team) } ) df <- as.data.frame(do.call(rbind,pairs.per.team)) if (nrow(df)==0) return() persons <- sprintf('Person%i',1:n) colnames(df)[1:n] <- persons # Person1 Person2 team # 1 10 11 100 # 2 10 12 100 # 3 11 12 100 # 4 10 11 200 # 5 10 14 200 # 6 10 15 200 # 7 11 14 200 # 8 11 15 200 # 9 14 15 200 # Personally, I find the data frame above most suitable for further analysis. # The following code is needed only to make the output compatible with the author one df2 <- df %>% grouped_df(as.list(persons)) %>% mutate(i.team=paste0('team',seq_along(team))) # Person1 Person2 team i.team # 1 10 11 100 team1 # 2 10 12 100 team1 # 3 11 12 100 team1 # 4 10 11 200 team2 # 5 10 14 200 team1 # 6 10 15 200 team1 # 7 11 14 200 team1 # 8 11 15 200 team1 # 9 14 15 200 team1 # count number of teams per pair df2.count <- df %>% grouped_df(as.list(persons)) %>% summarize(cnt=length(team)) # reshape the data df3 <- dcast(df2, as.formula(sprintf('%s~i.team',paste(persons,collapse='+'))), value.var='team' ) df3$count <- df2.count$cnt df3 } 

Your data:

 d <- structure(list(Person = c("10", "11", "12", "10", "11", "14", "15"), Team = c("100", "100", "100", "200", "200", "200", "200" )), .Names = c("Person", "Team"), row.names = c(NA, -7L), class = "data.frame") 

Using

  find.friends(d,n=2) 

You should get the desired result.

By changing n , you can also find triads, tetrads, etc.

+1
source

Source: https://habr.com/ru/post/980623/


All Articles