Unique lines with several commas in R

Reference Information. I do SNP annotation from GWAS in the body without much annotation. I use the chained tBLASTn table from UCSC along with biomaRt to map each SNP to the probable gene (s).

I have a dataframe that looks like this:

SNP hu_mRNA gene chr1.111642529 NM_002107 H3F3A chr1.111642529 NM_005324 H3F3B chr1.111801684 BC098118 <NA> chr1.111925084 NM_020435 GJC2 chr1.11801605 AK027740 <NA> chr1.11801605 NM_032849 C13orf33 chr1.151220354 NM_018913 PCDHGA10 chr1.151220354 NM_018918 PCDHGA5 

What I would like to get is one line for each SNP, and a comma for genes and hu_mRNAs. Here is what I do after:

  SNP hu_mRNA gene chr1.111642529 NM_002107,NM_005324 H3F3A chr1.111801684 BC098118,NM_020435 GJC2 chr1.11801605 AK027740,NM_032849 C13orf33 chr1.151220354 NM_018913,NM_018918 PCDHGA10,PCDHGA5 

Now I know that I can do this with a wrist click in perl, but I really want to do it all in R. Any suggestions?

+4
source share
5 answers

You can use aggregate with paste for each of them and merge at the end:

 x <- structure(list(SNP = structure(c(1L, 1L, 2L, 3L, 4L, 4L, 5L, 5L), .Label = c("chr1.111642529", "chr1.111801684", "chr1.111925084", "chr1.11801605", "chr1.151220354"), class = "factor"), hu_mRNA = structure(c(3L, 4L, 2L, 7L, 1L, 8L, 5L, 6L), .Label = c("AK027740", "BC098118", "NM_002107", "NM_005324", "NM_018913", "NM_018918", "NM_020435", "NM_032849"), class = "factor"), gene = structure(c(4L, 5L, 1L, 3L, 1L, 2L, 6L, 7L), .Label = c("<NA>", "C13orf33", "GJC2", "H3F3A", "H3F3B", "PCDHGA10", "PCDHGA5"), class = "factor")), .Names = c("SNP", "hu_mRNA", "gene"), class = "data.frame", row.names = c(NA, -8L )) a1 <- aggregate(hu_mRNA~SNP,data=x,paste,sep=",") a2 <- aggregate(gene~SNP,data=x,paste,sep=",") merge(a1,a2) SNP hu_mRNA gene 1 chr1.111642529 NM_002107, NM_005324 H3F3A, H3F3B 2 chr1.111801684 BC098118 <NA> 3 chr1.111925084 NM_020435 GJC2 4 chr1.11801605 AK027740, NM_032849 <NA>, C13orf33 5 chr1.151220354 NM_018913, NM_018918 PCDHGA10, PCDHGA5 
+8
source

You can do this on one line using plyr , as this is a classic split-apply-combine problem. You split using SNP , apply paste to collapse and assemble the parts back into the data frame.

 plyr::ddply(x, .(SNP), colwise(paste), collapse = ",") 

If you want to do data conversion to R in flick of a wrist , find out plyr and reshape2 :). Another click of the wrist solution using data.table is really useful if you are dealing with a huge amount of data.

 data.table::data.table(x)[,lapply(.SD, paste, collapse = ","),'SNP'] 
+8
source

Set up test data first. Note that we made the columns more "character" rather than "factor" using as.is=TRUE :

 Lines <- "SNP hu_mRNA gene chr1.111642529 NM_002107 H3F3A chr1.111642529 NM_005324 H3F3B chr1.111801684 BC098118 <NA> chr1.111925084 NM_020435 GJC2 chr1.11801605 AK027740 <NA> chr1.11801605 NM_032849 C13orf33 chr1.151220354 NM_018913 PCDHGA10 chr1.151220354 NM_018918 PCDHGA5" cat(Lines, "\n", file = "data.txt") DF <- read.table("data.txt", header = TRUE, na.strings = "<NA>", as.is = TRUE) 

Now try this aggregate statement:

 > aggregate(. ~ SNP, DF, toString) SNP hu_mRNA gene 1 chr1.111642529 NM_002107, NM_005324 H3F3A, H3F3B 2 chr1.111925084 NM_020435 GJC2 3 chr1.11801605 NM_032849 C13orf33 4 chr1.151220354 NM_018913, NM_018918 PCDHGA10, PCDHGA5 
+4
source

This can also be solved using reshape2 melt and dcast . With this approach, melt first converts the data to a "long" format, and then the dcast -ed values ​​with the same operation, paste(..., collapse = ",") :

 library(reshape2) x <- read.table( stringsAsFactors = FALSE, header = TRUE, na.strings = "<NA>", text = " SNP hu_mRNA gene chr1.111642529 NM_002107 H3F3A chr1.111642529 NM_005324 H3F3B chr1.111801684 BC098118 <NA> chr1.111925084 NM_020435 GJC2 chr1.11801605 AK027740 <NA> chr1.11801605 NM_032849 C13orf33 chr1.151220354 NM_018913 PCDHGA10 chr1.151220354 NM_018918 PCDHGA5") (xm <-melt(x, id.vars = "SNP", na.rm = TRUE)) ## SNP variable value ## 1 chr1.111642529 hu_mRNA NM_002107 ## 2 chr1.111642529 hu_mRNA NM_005324 ## 3 chr1.111801684 hu_mRNA BC098118 ## 4 chr1.111925084 hu_mRNA NM_020435 ## 5 chr1.11801605 hu_mRNA AK027740 ## 6 chr1.11801605 hu_mRNA NM_032849 ## 7 chr1.151220354 hu_mRNA NM_018913 ## 8 chr1.151220354 hu_mRNA NM_018918 ## 9 chr1.111642529 gene H3F3A ## 10 chr1.111642529 gene H3F3B ## 12 chr1.111925084 gene GJC2 ## 14 chr1.11801605 gene C13orf33 ## 15 chr1.151220354 gene PCDHGA10 ## 16 chr1.151220354 gene PCDHGA5 (xc <- dcast(xm, SNP~variable, fun.aggregate = paste, collapse = ",")) ## SNP hu_mRNA gene ## 1 chr1.111642529 NM_002107,NM_005324 H3F3A,H3F3B ## 2 chr1.111801684 BC098118 ## 3 chr1.111925084 NM_020435 GJC2 ## 4 chr1.11801605 AK027740,NM_032849 C13orf33 ## 5 chr1.151220354 NM_018913,NM_018918 PCDHGA10,PCDHGA5 
+1
source

Here is a dplyr solution that IHMO is the most readable:

 library(dplyr) x %>% group_by(SNP) %>% summarize( genes = paste(gene, collapse = ','), hu_mRNA = paste(hu_mRNA, collapse = ',') ) 

Result:

 Source: local data frame [5 x 3] SNP genes hu_mRNA (fctr) (chr) (chr) 1 chr1.111642529 H3F3A,H3F3B NM_002107,NM_005324 2 chr1.111801684 <NA> BC098118 3 chr1.111925084 GJC2 NM_020435 4 chr1.11801605 <NA>,C13orf33 AK027740,NM_032849 5 chr1.151220354 PCDHGA10,PCDHGA5 NM_018913,NM_018918 
0
source

Source: https://habr.com/ru/post/1380627/


All Articles