In R, is there an algorithm for creating clusters of approximately equal size

It seems that there is a lot of information about creating hierarchical or k-dimensional clusters. But I would like to know if there is a solution in R that would create K clusters of approximately equal sizes. There is something about how to do this in other languages, but I could not find anything from an Internet search that suggests how to achieve a result in R.

An example would be

set.seed(123) df <- matrix(rnorm(100*5), nrow=100) km <- kmeans(df, 10) print(sapply(1:10, function(n) sum(km$cluster==n))) 

that leads to

 [1] 14 12 4 13 16 6 8 7 13 7 

Ideally I would like to see

 [1] 10 10 10 10 10 10 10 10 10 10 
+5
source share
2 answers

I would say that you should not, first of all. What for? When your data has naturally formed clusters, for example,

 plot(matrix(c(sample(1:10,10),sample(30:40, 7), sample(80:90,9)), ncol=2, byrow = F)) 

then they will be grouped together anyway (assuming k is equal to the natural n of the clusters, see this comprehensive answer on how to choose a good k). If they are uniform in size, then you will have clusters with ~ equal size; if this is not so, then forcing a uniform cluster size will undoubtedly worsen the suitability of the clustering solution. If you do not have natural clusters in your data, for example,

 plot(matrix(c(sample(1:100, 100), ncol=2))) 

then forcing the cluster size will either be redundant (if the data is completely random, the cluster sizes will be equal - but then there is not much in the clustering) or, if it has good clusters, for example,

 plot(matrix(c(sample(1:15,15),sample(20:100, 11)), ncol=2, byrow = T)) 

then forced size will almost certainly break them.

However, the Ward method mentioned in the comments of Jason Aizkalns will give you more β€œround” forms of clusters compared to simply connected ones, for example, so this can be a way (see help(hclust) for the difference between D and D2, this is not arbitrary )
0
source

It's not entirely clear what you are asking, but it is very easy to create random data in R. If your data set has two dimensions, you can do something like this -

 cluster1 = data.frame(x = rnorm(100, mean=5,sd=1), y = rnorm(100, mean=5,sd=1)) cluster2 = data.frame(x = rnorm(100, mean=15,sd=1), y = rnorm(100, mean=15,sd=1)) 

This generates normally distributed random data on x and y for 100 data points in each cluster.

Then view it -

 plot(cluster1, xlim = c(0,25), ylim = c(0,25)) lines(cluster2, type = "p")! 
-2
source

Source: https://habr.com/ru/post/1210498/


All Articles