Count Distinct Restaurants by Zip Code in R

Question

Count Distinct Restaurants by Zip Code in R

I have a data frame that has the average salary for each US zip code, as well as all the fast food restaurants located in this zip code. Here is an example of this data frame:

Row_NUM ZIP MEDIAN RESTAURANT 26800 1001 56663 McDonald's 33161 1007 79076 McDonald's 23706 1008 63980 McDonald's 23709 1008 63980 McDonald's 30007 1008 63980 Taco Bell 30008 1008 63980 McDonald's 30009 1011 63476 McDonald's 24429 1013 36578 McDonald's 15323 1020 50058 KFC 29196 1020 50058 McDonald's 33127 1020 50058 McDonald's 39362 1020 50058 Wendy's 44914 1020 50058 Taco Bell 2542 1027 58573 Burger King 35242 1027 58573 McDonald's

I want to do two things. First, I want to create a new data frame that has only unique zip codes, the median salary of this zip code, and the total number of restaurants in this zip code. So, for this sample data frame:

 ZIP MEDIAN TOTAL_RESTAURANTS 1001 56663 1 1007 79076 1 1008 63980 4

Secondly, I want to create a data frame that has only a unique zip code and a column with the total number of each fast food restaurant. So, for this sample data frame:

 ZIP MEDIAN TOTAL_MCDONALDS TOTAL_TACOBELL TOTAL_KFC 1001 56663 1 0 0 1007 79076 1 0 0 1008 63980 3 1 0

I thought the code below would work, but it only gives me the zip code and general restaurant, and I'm not sure how to change it to include the other three columns.

 df <- ddply(df,~ZIP, summarise,TOTAL_RESTAURANTS=length(RESTAURANT))

Any help would be greatly appreciated.

Edit: These are the types of data that I work with in my data frame.

  str(df) data.frame': 50002 obs. of 3 variables: $ ZIP : int 44126 24014 77011 2190 48509 21061 43213 70130 31907 19422 ... $ MEDIAN : int 54496 50175 27113 74205 50895 62408 36734 47591 38710 103683 ... $ RESTAURANT: Factor w/ 10 levels "McDonald's","Burger King",..: 2 2 2 2 2 2 2 2 2 2 ...

Here is my session information:

  R version 3.2.2 (2015-08-14) Platform: x86_64-apple-darwin13.4.0 (64-bit) Running under: OS X 10.11 (El Capitan) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] graphics grDevices utils datasets grid stats methods base other attached packages: [1] lubridate_1.3.3 extrafont_0.17 jsonlite_0.9.16 dplyr_0.4.2 tidyr_0.2.0 tableplot_0.3-5 reshape2_1.4.1 RCurl_1.95-4.7 [9] bitops_1.0-6 gplots_2.17.0 ggthemes_2.2.1 ggplot2_1.0.1 loaded via a namespace (and not attached): [1] Rcpp_0.12.0 Rttf2pt1_1.3.3 magrittr_1.5 MASS_7.3-43 munsell_0.4.2 colorspace_1.2-6 R6_2.1.1 [8] stringr_1.0.0 plyr_1.8.3 caTools_1.17.1 tools_3.2.2 parallel_3.2.2 gtable_0.1.2 KernSmooth_2.23-15 [15] DBI_0.3.1 extrafontdb_1.0 gtools_3.5.0 lazyeval_0.1.10 assertthat_0.1 digest_0.6.8 memoise_0.2.1 [22] labeling_0.3 gdata_2.17.0 stringi_0.5-5 scales_0.3.0 proto_0.3-10

+5

r

Spencer Oct 14 '15 at 23:56

source share

1 answer

jazzurro · Accepted Answer · 2015-10-15T00:09:38+0000

Uisng dplyr and reshape2 , this will be one way. Since you want to summarize data using ZIP , you can use the variable to group the data. One thing is unclear whether the MEDIAN values are the same for each ZIP or not. Here I suggested that you may have different meanings. Therefore, I used median() . Using n() , you can find out how many fast food stores exist.

 summarize(group_by(mydf, ZIP), mid = median(MEDIAN), total = n()) #If you have an identical MEDIAN for each ZIP, you could do; summarize(group_by(mydf, ZIP), mid = first(MEDIAN), total = n()) # ZIP mid total # (int) (dbl) (int) #1 1001 56663 1 #2 1007 79076 1 #3 1008 63980 4 #4 1011 63476 1 #5 1013 36578 1 #6 1020 50058 5 #7 1027 58573 2

In the second part, you can use dcast() . You want to see how many fast food stores there are by type of fast food store. Combining ZIP and MEDIAN , you ask R to check how many stores (RESTAURANT) exist.

 dcast(mydf, ZIP + MEDIAN ~ RESTAURANT, length, value.var = "RESTAURANT") # ZIP MEDIAN BurgerKing KFC McDonald TacoBell Wendy's #1 1001 56663 0 0 1 0 0 #2 1007 79076 0 0 1 0 0 #3 1008 63980 0 0 3 1 0 #4 1011 63476 0 0 1 0 0 #5 1013 36578 0 0 1 0 0 #6 1020 50058 0 1 2 1 1 #7 1027 58573 1 0 1 0 0

If you are using data.table , you can do the following.

 library(data.table) setDT(mydf)[, list(mid = first(MEDIAN), total = .N), by = ZIP][] # If you calculate median setDT(mydf)[, list(mid = as.double(median(MEDIAN)), total = .N), by = ZIP][] dcast(setDT(mydf), ZIP + MEDIAN ~ RESTAURANT, fun = length, value.var = "RESTAURANT")

DATA

 mydf <-structure(list(Row_NUM = c(26800L, 33161L, 23706L, 23709L, 30007L, 30008L, 30009L, 24429L, 15323L, 29196L, 33127L, 39362L, 44914L, 2542L, 35242L), ZIP = c(1001L, 1007L, 1008L, 1008L, 1008L, 1008L, 1011L, 1013L, 1020L, 1020L, 1020L, 1020L, 1020L, 1027L, 1027L ), MEDIAN = c(56663L, 79076L, 63980L, 63980L, 63980L, 63980L, 63476L, 36578L, 50058L, 50058L, 50058L, 50058L, 50058L, 58573L, 58573L), RESTAURANT = structure(c(3L, 3L, 3L, 3L, 4L, 3L, 3L, 3L, 2L, 3L, 3L, 5L, 4L, 1L, 3L), .Label = c("BurgerKing", "KFC", "McDonald's", "TacoBell", "Wendy's"), class = "factor")), .Names = c("Row_NUM", "ZIP", "MEDIAN", "RESTAURANT"), class = "data.frame", row.names = c(NA, -15L))

Count Distinct Restaurants by Zip Code in R

More articles: