I want to do pre-clustering for a set of approx. 500,000 points.
I haven't started yet, but this is what I thought I would do:
- store all points in localSOLR index
- determine the "natural position of the cluster" in accordance with some administrative information (for example, in large cities).
- and then calculate the cluster for each city:
- for each city
- for each zoom level
- query the index to get the points contained in the radius around the city (the length of the radius depends on the zoom level)
This should be efficient enough, because there are only 100 major cities, and SOLR queries are very fast. But a little more thinking showed that this was wrong:
- there may be clusters of points that are more “close” to each other than near the city: they must get their own cluster.
- at some zoom levels, some points will not be within an acceptable distance from any city, and therefore they will not be considered
- some cities are next to each other, and therefore some points will be counted twice (added to both clusters).
There are other approaches:
- examine each point and determine which cluster it belongs to; this fixes problems 2 and 3 above, but not 1, and is also extremely inefficient.
- create a (rectangular) grid (for each zoom level); this works, but will result in crazy / arbitrary clusters that mean nothing.
I think I'm looking for a universal geoclassification algorithm (or idea) and cannot find anything.
Change response to comment from Geert-Jan
I would like to create “natural” clusters, yes, and yes, I'm afraid that if I use an arbitrary grid, this will not reflect the reality of the data. For example, if there are many events that occur around a point located at or near the intersection of two rectangles, I should only get one cluster, but actually will build two (one in each rectangle).
Initially, I wanted to use localSOLR for performance reasons (and because I know it, and better experience indexing a lot of data in SOLR than loading it into a regular database); but since we are talking about pre-clustering, perhaps performance is not so important (although there shouldn't be a few days to visualize the result of a new clustering experiment). My first approach to finding a set of points in accordance with a predefined set of "big points" is clearly flawed anyway, the first reason I mentioned is the strongest: clusters should reflect the reality of the data, and not some other bureaucratic definition (they will obviously overlap, of course, but the data should come first).
There is a large cluster for live clustering that has been added to the core Google Maps API: Marker Clusterer . I wonder if someone tried to run it "offline": run it for any amount of time it needs, and then save the results?
Or is there a cluster that analyzes each point, point by point, and produces clusters with their coordinates and the number of included points, and what does this do in a reasonable amount of time?