How to draw hierarchical clustering?

Question

How to draw hierarchical clustering?

I have the following dataset:

data<-data.frame(X=c(1,2,3,4),Y=c(1,3,2,1)) for(i in 1:nrow(data)){ data[i,i]<-NA} colnames(data) <- c("A","B","C","D") rownames(data) <- c("A","B","C","D") plot(hclust(dist(data)))

and then the result is as follows:

But, I wonder how this plot is drawn. Here I am trying to get the dendrogram step by step. We know that the distance matrix at the beginning has the following form:

Each time we find two points with a minimum distance, and then combine them as one cluster

So, the first merge is B and C. And we update the distance matrix

Again we find 2 points with a minimum distance that D with cluster B,C

Update distance matrix again

As a result, I should have the following merges

B and C
B, C and D
B, C, D and A

But here is the paradox with what the plot of R did. So how do you justify this?

+5

r

Salman Apr 28 '17 at 2:31

source share

1 answer

Nick criswell · Accepted Answer · 2017-04-28T14:43:40+0000

The updated answer is to use a `single` binding, not the default `complete` binding.

I will do my best to explain how I see it. I believe this is as simple as the method argument used in hclust. The default method for hclust does not match the algorithm you posted, but we can configure the method to do this.

But first, I get the error you are trying to make:

 > data<-data.frame(X=c(1,2,3,4),Y=c(1,3,2,1)) > for(i in 1:nrow(data)){ data[i,i]<-NA} > colnames(data) <- c("A","B","C","D") > rownames(data) <- c("A","B","C","D") > plot(hclust(dist(data))) Error in hclust(dist(data)) : NA/NaN/Inf in foreign function call (arg 11)

What is your intention on the line for(i in 1:nrow(data)){ data[i,i]<-NA} ? After this line, your data object is as follows:

  XY V3 V4 1 NA 1 NA NA 2 2 NA NA NA 3 3 2 NA NA 4 4 1 NA NA

However, if we can only start with the following code, we can generate the desired tree as follows:

 dt<-data.frame(X = c(1, 2, 3, 4), Y = c(1, 3, 2, 1)) rownames(dt) <- c("A", "B", "C", "D") dt<-dist(dt) plot(hclust(dt, method = "single"))

Notice the change in method to call hclust on method = single . The default value of method is method = complete . The complete binding method does not combine clusters into nodes based on the shortest distance, but at the longest intercluster distance. Extracting Some Material from a Fantastic Introduction to Statistical Learning Using Applications in R , which describes the various communication methods available:

This text, James, Witten, Hasti and Tibbrani, is available as a free download from the link above. The hierarchical clustering section begins on page 390. Please let me know if this helps to figure it out.

Original answer

I think you are calling the dist function in the wrong way and maybe too many times. Try the following:

 dt<-data.frame(X=c(1,2,3,4),Y=c(1,3,2,1)) rownames(dt) <- c("A","B","C","D") dt<-dist(dt) plot(hclust((dt)))

In fact, you called dist on an object that was already a dist class, which then turned into a matrix, and then again called dist in your plot call.

We can only consider the distance object as follows:

 > dt ABC B 2.236068 C 2.236068 1.414214 D 3.000000 2.828427 1.414214

There is no need to call dist on this object again before passing it to the hclust function.

How to draw hierarchical clustering?

The updated answer is to use a single binding, not the default complete binding.

Original answer

More articles:

The updated answer is to use a `single` binding, not the default `complete` binding.