Low server issue when using RJDBC in a parallel computing environment

Question

Low server issue when using RJDBC in a parallel computing environment

I have an R server with 16 cores and 8 GB of RAM that initializes a local SNOW cluster of, say, 10 people. Each worker downloads a series of data sets from a Microsoft SQL server, combines them on some key, then runs analyzes in the data set before writing the results to the SQL server. The connection between workers and the SQL server is through an RJDBC connection. When multiple workers receive data from an SQL server, ram usage fails and server R fails.

The strange thing is that using a plunger to load an employee in the data seems disproportionately large compared to the size of the loaded dataset. Each data set has about 8000 rows and 6500 columns. This translates to about 20 MB when saved as an R object on disk, and about 160 MB when saved as a comma-delimited file. However, RAM usage in an R session is about 2.3 GB.

Here is a quick overview of the code (some typographical changes to improve readability):

Establish a connection using RJDBC:

require("RJDBC") drv <- JDBC("com.microsoft.sqlserver.jdbc.SQLServerDriver","sqljdbc4.jar") con <<- dbConnect(drv, "jdbc:sqlserver://<some.ip>","<username>","<pass>")

After that, there is some code that sorts the input vector of the requestDataSets function with the names of all the tables for the query by the number of records, so we load the data sets from largest to smallest:

 nrow.to.merge <- rep(0, length(requestedDataSets)) for(d in 1:length(requestedDataSets)){ nrow.to.merge[d] <- dbGetQuery(con, paste0("select count(*) from",requestedDataSets[d]))[1,1] } merge.order <- order(nrow.to.merge,decreasing = T)

Then we go through the requested data vector and load and / or combine the data:

 for(d in merge.order){ # force reconnect to SQL server drv <- JDBC("com.microsoft.sqlserver.jdbc.SQLServerDriver","sqljdbc4.jar") try(dbDisconnect(con), silent = T) con <<- dbConnect(drv, "jdbc:sqlserver://<some.ip>","<user>","<pass>") # remove the to.merge object rm(complete.data.to.merge) # force garbage collection gc() jgc() # ask database for dataset d complete.data.to.merge <- dbGetQuery(con, paste0("select * from",requestedDataSets[d])) # first dataset if (d == merge.order[1]){ complete.data <- complete.data.to.merge colnames(complete.data)[colnames(complete.data) == "key"] <- "key_1" } # later dataset else { complete.data <- merge( x = complete.data, y = complete.data.to.merge, by.x = "key_1", by.y = "key", all.x=T) } } return(complete.data)

When I run this code from a series of twelve data sets, the number of rows / columns of the complete.data object will be as expected, so it is unlikely that the merge call will somehow explode the use. For twelve iterations, memory.size () returns 1178, 1364, 1500, 1662, 1656, 1925, 1835, 1987, 2106, 2130, 2217 and 2361. Which is again strange, since the data set at the end is no more than 162 MB .. .

As you can see in the above code, I have already tried a couple of fixes such as calling GC (), JGC () (which is the Java garbage collection function jgc <- function () {. Jcall ("java / lang / System" , method = "gc")}). I also tried combining the data on the SQL server side, but then I came across a number of column restrictions.

It seems to me that using RAM is much more than a data set that is ultimately created, and I suppose there is some kind of buffer / heap that is full ... but I seem to be unable to find it.

Any recommendations on how to solve this problem are welcome. Let me know if (part of) my description of the problem is uncertain or if you need more information.

Thanks.

+5

garbage-collection memory sql-server r rjdbc

Predictor2016 Nov 15 '16 at 13:51

source share

1 answer

Tim biegeleisen · Answer 1 · 2016-11-15T13:58:25+0000

This answer is rather a glorified comment. Just because the data processed on one node requires only 160 MB, this does not mean that the amount of memory required to process it is 160 MB. Many algorithms require O(n^2) storage space, which will be in GB for your piece of data. Therefore, I really do not see anything here, which is not surprising.

I have already tried a couple of fixes like calling GC (), JGC () (which is a Java forced garbage collection function ...

You cannot force garbage collection in Java by calling System.gc() only politely asking the JVM to do garbage collection, but it can ignore the request if it wants to. In any case, the JVM typically optimizes garbage collection on its own, and I doubt this is your bottleneck. Most likely, you just click on the overhead that R has to crunch your data.

Low server issue when using RJDBC in a parallel computing environment

More articles: