Two questions in OP are separate topics (i.e., no matches in the answers), so I will try to answer them one at a time, looking at element 1 in the list.
How to determine if my [clustering] algorithms work correctly?
k-means that, like other uncontrolled ML methods, there is no good choice of diagnostic tests to answer questions such as "are the cluster assignments returned by the k-means more significant for k = 3 or k = 5?"
However, there is one widespread test that gives intuitive results, and it's easy to apply. This diagnostic metric is simply a ratio:
interchain separation / intracluster dispersion
As the value of this relationship increases, the quality of your clustering result increases.
This is intuitive. The first of these indicators is how far each cluster is from each other (measured in accordance with the centers of the clusters)?
But the interneteloid separation itself does not reflect the whole story, because two clustering algorithms can return results that have the same interchain separation, although it is obvious that it is better because the clusters are “more rigid” (ie, smaller radii); in other words, the edges of the cluster have a greater separation. For this, the second metric is taken into account - intracluster dispersion. This is simply the average variance calculated for each cluster.
In general, the ratio of intercentroid separation to variance within a cluster is a fast, consistent, and reliable method for comparing results from different clustering algorithms or for comparing the results of the same algorithm performed under different variable parameters, -eg, number of iterations, choice of distance metric, number centroids (k value).
The desired result is dense (small) clusters, each of which is far from the others.
The calculation is simple:
For intercentroid separation:
For intracluster dispersion:
for each cluster, calculate the distance of each data point in a given cluster from its cluster center; Following
(for each cluster) calculate the variance of the sequence of distances from the above step; then
average these variance values.
This is my answer to the first question. Here's the second question:
Is Euclidean distance the correct method for calculating distances? What if I have 100 dimensions instead of 30?
Firstly, a simple question - is the Euclidean distance of the actual metric as sizes / features increase?
Euclidean distance is highly scalable - it works for two dimensions or two thousand. For any pair of data points:
subtract their object vectors by elements,
the square of each element in this result vector,
summarize this result,
take the square root of this scalar.
Nowhere in this sequence of calculations is scale involved.
But regardless of whether the Euclidean distance is a suitable similarity metric for your problem, depends on your data. For example, is it purely numerical (continuous)? Or it has discrete (categorical) variables (e.g. gender? M / F). If one of your measurements is “current location” and 200 users, 100 have the value “San Francisco” and the other 100 have “Boston”, you cannot say that on average your users are somewhere in Kansas, but that’s what the Euclidean distance will do.
In any case, since we don’t know anything about this, I’ll just give you a simple flowchart so that you can apply it to your data and determine a suitable similarity metric.
To determine the appropriate affinity metric, given your data:
