Combining two frames of data in R having common and unusual patterns

I have two data frames: Data1 and Data2, which I want to combine based on the variable "ID".

This sample data can be downloaded here: http://dl.dropbox.com/u/52600559/example.RData

Here is the first data frame:

> Data1 ID Fruit Color Weight 1 1 Apple Red 5 2 2 Orange Orange 7 3 3 Banana Yellow 3 4 4 Pear Green 5 5 5 Tomato Red 4 6 6 Berry Blue 4 7 7 Mandarin Orange 4 8 8 Pineapple Yellow 9 9 9 Nectarine Orange 5 10 10 Beet Red 5 

And here is the second data frame:

 > Data2 ID Fruit Color Weight 1 1 Apple Red 5 2 2 Orange Orange 7 3 3 Banana Yellow 3 4 4 Pear Green 5 5 5 Tomato Red 4 6 11 Pomegranate Red 6 7 12 Grape Green 4 8 13 Cranberry Red 4 9 14 Melon Pink 5 10 15 Pumpkin Orange 10 

I tried to combine them as follows:

 > merge(Data1, Data2, by = "ID", sort = FALSE, all.x = TRUE, all.y = TRUE) ID Fruit.x Color.x Weight.x Fruit.y Color.y Weight.y 1 1 Apple Red 5 Apple Red 5 2 2 Orange Orange 7 Orange Orange 7 3 3 Banana Yellow 3 Banana Yellow 3 4 4 Pear Green 5 Pear Green 5 5 5 Tomato Red 4 Tomato Red 4 6 9 Nectarine Orange 5 <NA> <NA> NA 7 6 Berry Blue 4 <NA> <NA> NA 8 7 Mandarin Orange 4 <NA> <NA> NA 9 8 Pineapple Yellow 9 <NA> <NA> NA 10 10 Beet Red 5 <NA> <NA> NA 11 14 <NA> <NA> NA Melon Pink 5 12 11 <NA> <NA> NA Pomegranate Red 6 13 12 <NA> <NA> NA Grape Green 4 14 13 <NA> <NA> NA Cranberry Red 4 15 15 <NA> <NA> NA Pumpkin Orange 10 

As you can see, both data frames have many identical variables. However, some identifiers in Data1 are not in Data2, and vice versa. Moreover, some identifiers are located in both data frames.

Question 1: I want to combine all the columns shown above. So, I want Fruit.x to merge with Fruit.y. in one column called "Fruits." How can i do this?

Question 2: What if for one of the samples that are present in both Data1 and Data2, one of the values ​​is not consistent. So, for sample ID 1, if Fruit.x is Apple, but Fruit.y is incorrectly encoded as Aple (with a spelling error), is there a way to check all these instances quickly so that I can choose which one is correct? Or can I tell R to always consider Data1 correct and Data2 when this happens?

Thanks to everyone who can help!

+6
source share
3 answers

Try the following:

 merge(Data1, Data2, all = TRUE) 

and to write, try this where amatch is an approximate match with fruit and near contains an approximate match that doesn't match exactly:

 for(fruit in Data1$Fruit) { amatch <- agrep(fruit, Data2$Fruit, value = TRUE) near <- amatch[amatch != fruit] if (length(near) > 0) cat(fruit, ":", near, "\n") } 

Using the provided data, it gives:

 Berry : Cranberry 

EDIT: improved code clarity

+10
source

To answer question 1:

 merge(data1, data2, all=T) 

should give you what you are looking for. However, this does not apply to spelling errors. You will have to deal with them separately. unique is a good tool to look for them as a tolower to normalize capitalization problems.

+3
source

This should do you most of the way: it will add two frames of data and discard duplicate rows.

 unique(rbind(Data1, Data2)) 

Sorry, I do not have good tips to deal with errors.

+2
source

Source: https://habr.com/ru/post/908533/


All Articles