Point Cluster Detection Algorithm

I have a 2D area with “points” distributed over this area. Now I'm trying to detect "clusters" of points, that is, areas with a certain high density of points.

Any thoughts (or links to articles with thoughts) on how to gracefully define these areas?

+43
algorithm data-structures image-processing
Dec 10 '08 at 13:28
source share
17 answers

How to determine an arbitrary resolution for your space and calculate for each point in this matrix a measure of the distance from this point to all points, then you can make a "thermal graph" and use the threshold to determine the clusters.

This is a nice exercise to handle, maybe I'll post a solution later.

EDIT:

There he is:

//load the image PImage sample; sample = loadImage("test.png"); size(sample.width, sample.height); image(sample, 0, 0); int[][] heat = new int[width][height]; //parameters int resolution = 5; //distance between points in the gridq int distance = 8; //distance at wich two points are considered near float threshold = 0.5; int level = 240; //leven to detect the dots int sensitivity = 1; //how much does each dot matters //calculate the "heat" on each point of the grid color black = color(0,0,0); loadPixels(); for(int a=0; a<width; a+=resolution){ for(int b=0; b<height; b+=resolution){ for(int x=0; x<width; x++){ for(int y=0; y<height; y++){ color c = sample.pixels[y*sample.width+x]; /** * the heat should be a function of the brightness and the distance, * but this works (tm) */ if(brightness(c)<level && dist(x,y,a,b)<distance){ heat[a][b] += sensitivity; } } } } } //render the output for(int a=0; a<width; ++a){ for(int b=0; b<height; ++b){ pixels[b*sample.width+a] = color(heat[a][b],0,0); } } updatePixels(); filter(THRESHOLD,threshold); 

EDIT 2 (less efficient code, but the same output):

 //load the image PImage sample; sample = loadImage("test.png"); size(sample.width, sample.height); image(sample, 0, 0); int[][] heat = new int[width][height]; int dotQ = 0; int[][] dots = new int[width*height][2]; int X = 0; int Y = 1; //parameters int resolution = 1; //distance between points in the grid int distance = 20; //distance at wich two points are considered near float threshold = 0.6; int level = 240; //minimum brightness to detect the dots int sensitivity = 1; //how much does each dot matters //detect all dots in the sample loadPixels(); for(int x=0; x<width; x++){ for(int y=0; y<height; y++){ color c = pixels[y*sample.width+x]; if(brightness(c)<level) { dots[dotQ][X] += x; dots[dotQ++][Y] += y; } } } //calculate heat for(int x=0; x<width; x+=resolution){ for(int y=0; y<height; y+=resolution){ for(int d=0; d<dotQ; d++){ if(dist(x,y,dots[d][X],dots[d][Y]) < distance) heat[x][y]+=sensitivity; } } } //render the output for(int a=0; a<width; ++a){ for(int b=0; b<height; ++b){ pixels[b*sample.width+a] = color(heat[a][b],0,0); } } updatePixels(); filter(THRESHOLD,threshold); /** This smooths the ouput with low resolutions * for(int i=0; i<10; ++i) filter(DILATE); * for(int i=0; i<3; ++i) filter(BLUR); * filter(THRESHOLD); */ 

And the result with Kent's (reduced) sample:

+24
Dec 10 '08 at 13:40
source share
— -

I would suggest using a variable-shift kernel to find the center of density of your points.

Medium Shift Illustration http://cvr.yorku.ca/members/gradstudents/kosta/compvis/file_mean_shift_path.gif

This figure shows that the core of the average shear (centered on the edge of the cluster) converges to the point of the cluster with the highest density.

In theory (in a nutshell):

A few answers to these questions already hint at a way to smooth out:

What you see in the animated picture is a combination of these two sentences: it uses a moving “block” (ie core) to search for locally high density.

The average offset is an iterative method that uses a pixel neighborhood called the kernel (similar to this one ) and uses it to calculate the average underlying image data. The average value in this context is the pixel-weighted kernel coordinate.

In each iteration, the kernel determines its central coordinates for the next iteration - this is called a shift. Therefore, the name means shift. The stop condition for iterations is when the shear distance drops to 0 (i.e., we are in the densest spot in the neighborhood).

A comprehensive introduction to the medium shift (both in theory and in applications) can be found in this ppt presentation.

On practice:

The implementation of the average shift is available in OpenCV :

 int cvMeanShift( const CvArr* prob_image, CvRect window, CvTermCriteria criteria, CvConnectedComp* comp ); 

O'Reilly Learning OpenCv (excerpts from a Google book) also provides a good explanation of how this works. Basically just upload this dot image (prob_image).

In practice, the trick is choosing the right kernel size . The smaller the kernel, the closer you need to run it in the cluster. The larger the core, the more random your initial position may be. However, if your image has several clusters of points, the core may converge directly between them.

+23
Jan 05 '09 at 18:28
source share

To add a little helper to the Trebs expression, I think it is important to really determine that the definition of a cluster is, of course, "close to each other", it is rather a vauge.

Take this set of samples that I generated, I know that there is a cluster shape, I created it.

However, the software identification of this “cluster” can be complex.

A person may think that a large toroidal cluster, but your automatic program, most likely, will decide to solve a number of small clusters in the immediate vicinity.

Also, note that there are ultra-high density areas that are in the context of the larger picture, just distracting

You will need to consider this behavior and possibly combine clusters with the same density, separated only by minor voids of lower density, depending on the specific application.

Whatever you develop, I would at least be interested in how it identifies the data in this set.

(I think that the search for HDRI ToneMapping technologies may be in order, because they work more or less in terms of light density, and there are "local" tonemaps and "global" tone maps, each of which gives different results)

+13
Dec 10 '08 at 13:56
source share

Apply a blur filter to a copy of your 2D area. Something like

 1 2 3 2 1 2 4 6 4 2 3 6 9 6 3 2 4 6 4 2 1 2 3 2 1 

Darker areas now identify point clusters.

+12
Dec 10 '08 at 13:57
source share

You can try creating a quadtree view of the data. Shorter paths on the graph will correspond to areas with high density.

Or, to put it more clearly: when using Quadtree and level traversal, each lower level of the node, consisting of "points", will be a region with a high density. As the level of nodes increases, such nodes represent regions with a lower density of “points”

+10
Dec 10 '08 at
source share

There's a great tutorial on clustering algorithms here, they discuss K-tools and K-gussussians.

+6
Dec 11 '08 at 15:19
source share

What about the morphological approach?

Expand the threshold image using a certain number (depending on the target density of points), then the points in the cluster will be displayed as a single object.

OpenCV supports morphological operations (like a number of image processing libraries):

http://www.seas.upenn.edu/~bensapp/opencvdocs/ref/opencvref_cv.htm#cv_imgproc_morphology

+5
Dec 10 '08 at 13:38
source share

It really sounds like an academic question.

The solution that comes to mind includes r * trees. This divides your total area into separate-sized and possibly overlapping drawers. After that, you can determine whether each cell will represent a “cluster” or not for you by calculating the average distance.

R * Trees

If this approach becomes difficult to implement, you might be better off knocking your datagrid into units of equal size and determining if a cluster is occurring in each; You must be very attentive to the boundary conditions with this approach. I would suggest that after the initial division, you go through and recombine areas with datapoints within a certain threshold of a certain edge.

+4
Dec 10 '08 at 14:02
source share
  • Set the probability density function to the data. I would use a “mixture of Gaussians” and place it with the training of maximizing expectations, run by the K-means algorithm. K-values ​​themselves can sometimes be sufficient without EM. The number of clusters themselves must be loaded using the model order selection algorithm.
  • Then, each point can be scored with p (x) using the model. That is, to get the back probability that the point was generated by the model.
  • Find the maximum p (x) to search for cluster centroids.

This can be encoded very quickly in a tool such as Matlab using machine learning tools. The study of MoG / EM / K-Me clustering is widely discussed in web standards. My favorite text is Dud / Hart's Classification of Designs.

+4
Feb 04 '09 at 19:06
source share

“Areas with a certain high density” means that you know approximately how many dots per unit area you think are high. This leads me to the grid approach, where you break your total area into subzones of the appropriate size, and then count the number of points in each area. When you find areas of your grid near your threshold, you can also search for neighboring areas of the grid.

+3
Dec 10 '08 at 13:40
source share

Let me organize this as a research article.

but. Problem Statement

To quote Epaga : “I have a 2D area with“ points ”distributed over this area. Now I'm trying to detect“ clusters ”of points, that is, areas with a certain high point density.”

Note that it is not mentioned anywhere that the points are taken from the image. (Although they can be ordered as one).

b.Method case 1: If the points are just points (points = points in 2D space). In this case, you will already have both x and y locations for all points. The problem comes down to clustering points. Ivan did a great job proposing a solution. He also summarized other answers of a similar taste. My 2cts in addition to his post is what you think, do you know the number of clusters a priori or not. Algorithms (controlled and uncontrolled clustering can be selected accordingly).

case 2: if the points really come from the image. Here the problem should be clarified. Let me explain the use of this image. alt text If no differences are made in the gray values ​​of the points, groups 1, 2, 3, 4, and 5 are “separate clusters”. However, if a distinction is made based on the gray value, cluster 5 is complex since the point has different gray values.

Despite this, this problem can be reduced to case 1 by raster scanning an image and preserving the coordinates of nonzero (non-white) pixels. The clustering algorithms proposed earlier can then be used to calculate the number of clusters and cluster centers.

+3
01 Sep 2018-10-01
source share

I think it depends on how much difference there is between points and clusters. If the distances are large and irregular, I initially triangulate the points, and then removes / hides all triangles with statistically large edge lengths. The remaining sub-controls form clusters of arbitrary shape. The intersection of the edges of these sub-controls gives polygons that can be used to determine which specific points lie in each cluster. Polygons can also be compared with well-known shapes, such as Kent Fredrick's cake, as needed.

IMO, mesh methods are good for quick and dirty decisions, but very fast fast very fast on sparse data. Square trees are better, but TIN is my personal favorite for more complex analysis.

+2
Dec 10 '08 at 14:46
source share

I would calculate the distances from each point to all other points. Then sort the distances. Points that have a distance from each other below the threshold are considered Near. A group of points adjacent to each other is a cluster.

The problem is that a cluster can be understood by a person when he sees a graph, but does not have a clear mathematical definition. You should determine your close threshold, perhaps by adjusting it empirically, until the result of your algorithm (more or less) is equal to what you consider to be clustered.

0
Dec 10 '08 at 13:37
source share

You can overlay a logic grid on top of your plane. If the grid has a certain number of contained points, it is considered "dense" and then can be diluted. This is done in GIS applications when working with cluster tolerances. Using a grid helps split your thinning algorithm.

0
Dec 10 '08 at 14:02
source share

You can use the genetic algorithm for this. If you define a “cluster”, such as a rectangular subzone with a high density of points, you can create an initial set of “solutions”, each of which consists of a number of randomly generated non-overlapping rectangular clusters. Then you should write a “fitness function”, which evaluates each decision - in this case, you would like the fitness function to minimize the total number of clusters and maximize the density of points in each cluster.

Your initial set of "decisions" will be terrible, most likely, but some are likely to be slightly less scary than others. You use the fitness function to eliminate the worst decisions, and then create the next generation solutions by crossing the “winners” from the last generation. Repeating this generation of generation processes, you should get one or more good solutions to this problem.

In order for the genetic algorithm to work, the various possible solutions to the problem space must be stepwise different from each other in terms of how well they solve the problem. Point clusters are ideal for this.

0
Dec 10 '08 at 2:30 p.m.
source share

Cluster 3.0 includes a C method library for statistical clustering. This has several different methods that may or may not solve your problem, based on what shape your point clusters form. The library is available here here and is distributed under the Python license.

0
Feb 25 '09 at 12:32
source share

Have you tried simple, out-of-the-box solutions like ImagXpress from Accusoft Pegasus?

The BlobRemoval method can be configured to count pixels and density to find hole punches, even if they are not continuous. (you can also try the Dilate function to close the spaces)

After playing around a bit, you can probably get the results that you need in the vast majority of cases, with very little code or science.

FROM#:
public void DocumentBlobRemoval (Rectangle Area, int MinPixelCount, int MaxPixelCount, short MinDensity)

0
Jun 15 '09 at 18:41
source share



All Articles