Venn diagram from the list of clusters and related factors

I have an input file with a list of clusters ~ 50,000 and the presence of several factors in each of them (about 10 million records in total), see the example below below:

set.seed(1) x = paste("cluster-",sample(c(1:100),500,replace=TRUE),sep="") y = c( paste("factor-",sample(c(letters[1:3]),300, replace=TRUE),sep=""), paste("factor-",sample(c(letters[1]),100, replace=TRUE),sep=""), paste("factor-",sample(c(letters[2]),50, replace=TRUE),sep=""), paste("factor-",sample(c(letters[3]),50, replace=TRUE),sep="") ) data = data.frame(cluster=x,factor=y) 

With a bit of help from another question, I got it to create a piechart for the joint occurrence of such factors:

 counts = with(data, table(tapply(factor, cluster, function(x) paste(as.character(sort(unique(x))), collapse='+')))) pie(counts[counts>1]) 

But now I would like to have a Venn diagram for the co-occurrence of factors. Ideally, also in the form in which it can take a threshold for the minimum amount for each factor. For example, the Venn diagram for different factors, so that each of them should have n> 10 in each cluster that should be taken into account.

I tried to find a way to create a table count using an aggregate, but could not get it to work.

+12
r data-visualization combinations
Nov 14 2018-11-11T00:
source share
1 answer

I presented two solutions using two different packages with Venn chart capabilities. As you expected, both include an initial step using the aggregate() function.

I prefer the results from the venneuler package. By default, these labels are not perfect, but you can adjust them by looking at the appropriate plot method (possibly using locator() to select the coordinates).

Solution 1:

One possibility is to use venneuler() in the venneuler package to draw a Venn diagram.

 library(venneuler) ## Modify the "factor" column, by renaming it and converting ## it to a character vector. levels(data$factor) <- c("a", "b", "c") data$factor <- as.character(data$factor) ## FUN is an anonymous function that determines which letters are present ## 2 or more times in the cluster and then pastes them together into ## strings of a form that venneuler() expects. ## inter <- aggregate(factor ~ cluster, data=data, FUN = function(X) { tab <- table(X) names <- names(tab[tab>=2]) paste(sort(names), collapse="&") }) ## Count how many clusters contain each combination of letters counts <- table(inter$factor) counts <- counts[names(counts)!=""] # To remove groups with <2 of any letter # a a&b a&b&c a&c b b&c c # 19 13 12 14 13 9 12 ## Convert to proportions for venneuler() ps <- counts/sum(counts) ## Calculate the Venn diagram vd <- venneuler(c(a=ps[["a"]], b = ps[["b"]], c = ps[["c"]], "a&b" = ps[["a&b"]], "a&c" = ps[["a&c"]], "b&c" = ps[["b&c"]], "a&b&c" = ps[["a&b&c"]])) ## Plot it! plot(vd) 

A few notes about the options I made while writing this code:

  • I changed the factor names from "factor-a" to "a" . You can obviously change that.

  • I need each factor to be present> = 2 times (instead of> 10) in order to be counted in each cluster. (This should have demonstrated code with this small subset of your data.)

  • If you look at the intermediate counts object, you will see that it contains the original unnamed element. This element represents the number of clusters containing less than 2 letters. You can decide whether I am better, whether you want to include them in the calculation of the subsequent ps object ("proportions").

enter image description here

Solution Two:

Another possibility is to use vennCounts() and vennDiagram() in the Bioconductor limma limma . To download the package, follow the instructions here. Unlike the venneuler solution venneuler , the overlap in the resulting diagram is not proportional to the actual degree of intersection. Instead, it annotates a chart with actual frequencies. (Note that this solution does not contain any changes to the data$factor column.)

 library(limma) out <- aggregate(factor ~ cluster, data=data, FUN=table) out <- cbind(out[1], data.frame(out[2][[1]])) counts <- vennCounts(out[, -1] >= 2) vennDiagram(counts, names = c("Factor A", "Factor B", "Factor C"), cex = 1, counts.col = "red") 

enter image description here

+20
Nov 17 '11 at 19:01
source share



All Articles