Create origin-destination matrices with R

My data frame consists of people and the city in which they live at a particular point in time. I would like to generate one destination-origin matrix for each year that records the number of moves from one city to another. I'd like to know:

  • How can I automatically generate destination tables for each year in my dataset?
  • How can I generate all tables in the same 5x5 format, 5 is the number of cities in my example?
  • Is there more efficient code than I suggest below? I intend to run it on a very large dataset.

Consider the following example:

#An example dataframe id=sample(1:5,50,T) year=sample(2005:2010,50,T) city=sample(paste(rep("City",5),1:5,sep=""),50,T) df=as.data.frame(cbind(id,year,city),stringsAsFactors=F) df$year=as.numeric(df$year) df=df[order(df$id,df$year),] rm(id,year,city) 

My best attempt

 #Creating variables for(i in 1:length(df$id)){ df$origin[i]=df$city[i] df$destination[i]=df$city[i+1] df$move[i]=ifelse(df$orig[i]!=df$dest[i] & df$id[i]==df$id[i+1],1,0) #Checking whether a move has taken place and whether its the same person df$year_move[i]=ceiling((df$year[i]+df$year[i+1])/2) #I consider that the person has moved exactly between the two dates at which its location was recorded } df=df[df$move!=0,c("origin","destination","year_move")] 

Creating a source table for 2007

 yr07=df[df$year_move==2007,] table(yr07$origin,yr07$destination) 

Result

  City1 City2 City3 City5 City1 0 0 1 2 City2 2 0 0 0 City5 1 1 0 0 
+7
source share
2 answers

You can separate your data using an identifier, perform the necessary calculations in a data frame with a specific identifier to capture all the moves from that person, and then re-combine:

 spl <- split(df, df$id) move.spl <- lapply(spl, function(x) { ret <- data.frame(from=head(x$city, -1), to=tail(x$city, -1), year=ceiling((head(x$year, -1)+tail(x$year, -1))/2), stringsAsFactors=FALSE) ret[ret$from != ret$to,] }) (moves <- do.call(rbind, move.spl)) # from to year # 1.1 City4 City2 2007 # 1.2 City2 City1 2008 # 1.3 City1 City5 2009 # 1.4 City5 City4 2009 # 1.5 City4 City2 2009 # ... 

Since this code uses vectorized calculations for each identifier, it should be much faster than looping through each line of your data frame, as it was in the provided code.

Now you can capture 5x5 year-specific displacement matrices using split and table :

 moves$from <- factor(moves$from) moves$to <- factor(moves$to) lapply(split(moves, moves$year), function(x) table(x$from, x$to)) # $`2005` # # City1 City2 City3 City4 City5 # City1 0 0 0 0 1 # City2 0 0 0 0 0 # City3 0 0 0 0 0 # City4 0 0 0 0 0 # City5 0 0 1 0 0 # # $`2006` # # City1 City2 City3 City4 City5 # City1 0 0 0 1 0 # City2 0 0 0 0 0 # City3 1 0 0 1 0 # City4 0 0 0 0 0 # City5 2 0 0 0 0 # ... 
+6
source

You can use Reshape2 Dcast and the loop to do this.

 library(reshape2) # write function write_matrices <- function(year){ mat <- dcast(subset(df, df$year_move == year), origin ~ destination) print(year) print(mat) } # get unique list of years (there was an NA in there, so that why this is longer than it needs to be years <- unique(subset(df, is.na(df$year_move) == FALSE)$year_move) # loop though and get results for (year in years){ write_matrices(year) } 

The only thing this does not apply to is the requirement that each matrix has 5 * 5, because if in some years not all 5 cities are displayed, only cities this year are displayed.

You can fix this by adding a step that first turns your observations into a frequency table, so they are included, but in the form of zeros.

0
source

Source: https://habr.com/ru/post/987285/


All Articles