Plyr in R is very slow during the merger

I use the plyr package in R to do the following:

  • take a row from table A according to column A and column B
  • find the row from table B having the same value in column A and column B
  • copy column C from table B to table A

I made a progress bar to show progress, but after it shows 100%, it still works, since I see that my processor is still busy with RGUI, but it just does not end there.

My table A contains about 40,000 rows of data with unique column A and column B.

I suspect that the "combined" part of the split-conquer-comb workflow in plyr cannot handle these 40,000 rows of data, because I can do this for another table with 4,000 rows of data.

Any suggestions for improving efficiency? Thank.

UPDATE

Here is my code:

for (loop.filename in (1:nrow(filename)))
  {print("infection source merge")
   print(filename[loop.filename, "table_name"])
   temp <- get(filename[loop.filename, "table_name"])
   temp1 <- ddply(temp,
                  c("HOSP_NO", "REF_DATE"),
                  function(df)
                    {temp.infection.source <- abcde[abcde[,"Case_Number"]==unique(df[,"HOSP_NO"]) &
                                              abcde[,"Reference_Date"]==unique(df[,"REF_DATE"]),
                                              "Case_Definition"]
                     if (length(temp.infection.source)==0) {
                         temp.infection.source<-"NIL"
                         } else {
                         if (length(unique(temp.infection.source))>1) {
                             temp.infection.source<-"MULTIPLE"
                             } else {
                            temp.infection.source<-unique(temp.infection.source)}}
                     data.frame(df,
                                INFECTION_SOURCE=temp.infection.source)
                     },
                    .progress="text")
   assign(filename[loop.filename, "table_name"], temp1)
  }
+3
source share
1 answer

If I understand correctly what you are trying to achieve, this should do what you want, pretty quickly and without too much memory loss.

#toy data
A <- data.frame(
    A=letters[1:10],
    B=letters[11:20],
    CC=1:10
)

ord <- sample(1:10)
B <- data.frame(
    A=letters[1:10][ord],
    B=letters[11:20][ord],
    CC=(1:10)[ord]
)
#combining values
A.comb <- paste(A$A,A$B,sep="-")
B.comb <- paste(B$A,B$B,sep="-")
#matching
A$DD <- B$CC[match(A.comb,B.comb)]
A

This only applies to unique combinations. If this is not the case, you need to take care of it first. Without data, it is completely impossible to understand what you are trying to achieve exactly in your full function, but you should be able to transfer the logic given here in your own case.

+2
source

Source: https://habr.com/ru/post/1770677/


All Articles