Looping in R to create multiple graphs when you have one extra variable

I often come across data that have too many categorical variables to satisfactorily display a single plot. When such a situation arises, I write something to iterate over a variable and save several graphs related to this variable.

This process is illustrated by the following example:

library(tidyr) library(dplyr) library(ggplot2) mtcars <- add_rownames(mtcars, "car") param<-unique(mtcars$cyl) for (i in param) { mcplt <- mtcars %>% filter(cyl==i) %>% ggplot(aes(x=mpg, y=hp)) + geom_point() + facet_wrap(~car) + ggtitle(paste("Cylinder Type: ",i,sep="")) ggsave(mcplt, file=paste("Type",i,".jpeg",sep="")) } 

Whenever I see links online for looping, everything always seems to indicate that looping is usually not a good strategy in R. If so, can anyone recommend a better way to achieve the same result as above ? I would be interested, in particular, for something faster, since the loops are like slow SOOOO. But perhaps the solution is that this is the best solution. I was just curious if anyone could improve this.

Thanks in advance.

+5
source share
1 answer

This is a well-designed topic for R, see posts here and here . The answers to this question show that the for() *apply() alternatives improve clarity, facilitate parallelization, and in some circumstances speed up the problem. However, apparently, your real question is: โ€œHow to do it fasterโ€, because it takes a lot of time for you to be unhappy. Inside the loop, you perform 3 different tasks.

  • Snatch a piece of data block with filter()
  • Make a story.
  • Save the chart in jpeg format.

There are several ways to complete all three of these steps, so try and rate them all. I will use the diamond data from ggplot2, because it is more than the data of cars. I hope that the differences in performance between the methods will be noticeable this way. I learned a lot from this chapter of Hadley Wickham's book on measuring performance .

So that I can use profiling, I put the following code into a block and save it in a separate R file with the name for_solution.r.

 f <- function(){ param <- unique(diamonds$cut) for (i in param){ mcplt <- diamonds %>% filter(cut==i) %>% ggplot(aes(x=carat, y=price)) + geom_point() + facet_wrap(~color) + ggtitle(paste("Cut: ",i,sep="")) ggsave(mcplt, file=paste("Cut",i,".jpeg",sep="")) } } 

and then I:

 library(dplyr) library(ggplot2) source("for_solution.r",keep.source=TRUE) Rprof(line=TRUE) f() Rprof(NULL) summaryRprof(lines="show") 

Studying this conclusion, I see that a block of code spends 97.25% of the time just saving files. Studying the source for ggsave() I see that this function does a lot of defensive programming to identify the type of output, then it opens the graphics device, prints and closes the device. So I am wondering if this particular step will help manually. I will also take advantage of the fact that the jpeg device will automatically create new files for each page, only to open and close the device once.

 f1 <- function(){ param <- unique(diamonds$cut) jpeg("cut%03d.jpg",width=par("din")[1],height=par("din")[2],units="in",res=300) # open the jpeg device, change defaults to match ggsave() for (i in param){ mcplt <- diamonds %>% filter(cut==i) %>% ggplot(aes(x=carat, y=price)) + geom_point() + facet_wrap(~color) + ggtitle(paste("Cut: ",i,sep="")) print(mcplt) } dev.off() } 

and now profile again

 Rprof(line=TRUE) f1() Rprof(NULL) summaryRprof(lines="show") 

f1() still spends most of its time on print(mcplt) , and it's a little faster than before (1.96 seconds versus 2.18 seconds). One possible way to speed things up is to use a smaller device (lower resolution or smaller image); when I used the default values โ€‹โ€‹for jpeg() , the difference was bigger, bigger 25% faster. I also tried changing the device to png() , but that is no different.

Based on profiling, I do not expect this to help, but for completeness I will try to handle the for loop and run everything inside dplyr using do() . I found this question and this one useful here.

 jpeg("cut%03d.jpg",width=par("din")[1],height=par("din")[2],units="in",res=300) # open the jpeg device, change defaults to match ggsave() plots = diamonds %>% group_by(cut) %>% do({plot=ggplot(aes(x=carat, y=price),data=.) + geom_point() + facet_wrap(~color) + ggtitle(paste("Cut: ",.$cut,sep="")) print(plot)}) dev.off() 

Running this code gives

Error: results are not data frames at positions: 1, 2, 3

but it seems to work. I believe the error occurs when do() returned, because the print () method does not return data.frame. Profiling seems to indicate that it is running a little faster, only 1.78 seconds. But I donโ€™t like solutions that cause errors, even if they do not cause problems.

I need to stay here, but I have already learned a lot about where to focus. Other things to try will include:

  • Using parallel or something similar to run each piece of data in a separate process. I am not sure if this will help if the problem is saving the file, but if the image will be rendered by the processor, I think.
  • Try using data.table instead of dplyr, but again, this is the slow part of printing. A.
  • Try basic graphics and trellis graphics and graphically instead of ggplot2. I do not know about the relative speed, but this can change.
  • Buy a faster hard drive! I just compared the speed f () on my home computer with a regular hard drive on my working machine with an SSD - it is about 3 times slower than timing above.
+4
source

Source: https://habr.com/ru/post/1237805/


All Articles