How to draw hierarchical clustering?

I have the following dataset:

data<-data.frame(X=c(1,2,3,4),Y=c(1,3,2,1)) for(i in 1:nrow(data)){ data[i,i]<-NA} colnames(data) <- c("A","B","C","D") rownames(data) <- c("A","B","C","D") plot(hclust(dist(data))) 

and then the result is as follows:

enter image description here

But, I wonder how this plot is drawn. Here I am trying to get the dendrogram step by step. We know that the distance matrix at the beginning has the following form:

enter image description here

Each time we find two points with a minimum distance, and then combine them as one cluster

enter image description here

So, the first merge is B and C. And we update the distance matrix

enter image description here

Again we find 2 points with a minimum distance that D with cluster B,C

enter image description here

Update distance matrix again

enter image description here

As a result, I should have the following merges

  • B and C
  • B, C and D
  • B, C, D and A

But here is the paradox with what the plot of R did. So how do you justify this?

+5
source share
1 answer

The updated answer is to use a single binding, not the default complete binding.

I will do my best to explain how I see it. I believe this is as simple as the method argument used in hclust. The default method for hclust does not match the algorithm you posted, but we can configure the method to do this.

But first, I get the error you are trying to make:

 > data<-data.frame(X=c(1,2,3,4),Y=c(1,3,2,1)) > for(i in 1:nrow(data)){ data[i,i]<-NA} > colnames(data) <- c("A","B","C","D") > rownames(data) <- c("A","B","C","D") > plot(hclust(dist(data))) Error in hclust(dist(data)) : NA/NaN/Inf in foreign function call (arg 11) 

What is your intention on the line for(i in 1:nrow(data)){ data[i,i]<-NA} ? After this line, your data object is as follows:

  XY V3 V4 1 NA 1 NA NA 2 2 NA NA NA 3 3 2 NA NA 4 4 1 NA NA 

However, if we can only start with the following code, we can generate the desired tree as follows:

 dt<-data.frame(X = c(1, 2, 3, 4), Y = c(1, 3, 2, 1)) rownames(dt) <- c("A", "B", "C", "D") dt<-dist(dt) plot(hclust(dt, method = "single")) 

enter image description here

Notice the change in method to call hclust on method = single . The default value of method is method = complete . The complete binding method does not combine clusters into nodes based on the shortest distance, but at the longest intercluster distance. Extracting Some Material from a Fantastic Introduction to Statistical Learning Using Applications in R , which describes the various communication methods available:

enter image description here

This text, James, Witten, Hasti and Tibbrani, is available as a free download from the link above. The hierarchical clustering section begins on page 390. Please let me know if this helps to figure it out.

Original answer

I think you are calling the dist function in the wrong way and maybe too many times. Try the following:

 dt<-data.frame(X=c(1,2,3,4),Y=c(1,3,2,1)) rownames(dt) <- c("A","B","C","D") dt<-dist(dt) plot(hclust((dt))) 

enter image description here

In fact, you called dist on an object that was already a dist class, which then turned into a matrix, and then again called dist in your plot call.

We can only consider the distance object as follows:

 > dt ABC B 2.236068 C 2.236068 1.414214 D 3.000000 2.828427 1.414214 

There is no need to call dist on this object again before passing it to the hclust function.

+9
source

Source: https://habr.com/ru/post/1267290/


All Articles