Remove crash from HashSet after iterating over it

I am writing an agglomeration clustering algorithm in java and am having problems with the delete operation. It always seems to fail when the number of clusters reaches half the original number.

In the example code below clustersthere is Collection<Collection<Integer>>.

      while(clusters.size() > K){
           // determine smallest distance between clusters
           Collection<Integer> minclust1 = null;
           Collection<Integer> minclust2 = null;
           double mindist = Double.POSITIVE_INFINITY;

           for(Collection<Integer> cluster1 : clusters){
                for(Collection<Integer> cluster2 : clusters){
                     if( cluster1 != cluster2 && getDistance(cluster1, cluster2) < mindist){
                          minclust1 = cluster1;
                          minclust2 = cluster2;
                          mindist = getDistance(cluster1, cluster2);
                     }
                }
           }

           // merge the two clusters
           minclust1.addAll(minclust2);
           clusters.remove(minclust2);
      }

After several runs of the loop, clusters.remove(minclust2)it eventually returns false, but I don’t understand why.

I tested this code by first creating 10 clusters, each with a single integer from 1 to 10. Distances are random numbers between 0 and 1. Here is the result after adding several println statements. After the number of clusters, I print out the actual clusters, the merge operation and the result of clusters.remove (minclust2).

Clustering: 10 clusters
[[3], [1], [10], [5], [9], [7], [2], [4], [6], [8]]
[5] <- [6]
true
Clustering: 9 clusters
[[3], [1], [10], [5, 6], [9], [7], [2], [4], [8]]
[7] <- [8]
true
Clustering: 8 clusters
[[3], [1], [10], [5, 6], [9], [7, 8], [2], [4]]
[10] <- [9]
true
Clustering: 7 clusters
[[3], [1], [10, 9], [5, 6], [7, 8], [2], [4]]
[5, 6] <- [4]
true
Clustering: 6 clusters
[[3], [1], [10, 9], [5, 6, 4], [7, 8], [2]]
[3] <- [2]
true
Clustering: 5 clusters
[[3, 2], [1], [10, 9], [5, 6, 4], [7, 8]]
[10, 9] <- [5, 6, 4]
false
Clustering: 5 clusters
[[3, 2], [1], [10, 9, 5, 6, 4], [5, 6, 4], [7, 8]]
[10, 9, 5, 6, 4] <- [5, 6, 4]
false
Clustering: 5 clusters
[[3, 2], [1], [10, 9, 5, 6, 4, 5, 6, 4], [5, 6, 4], [7, 8]]
[10, 9, 5, 6, 4, 5, 6, 4] <- [5, 6, 4]
false

[10, 9, 5, 6, 4, 5, 6, 4,...] .

: , HashSet<Integer> (a HashSet<HashSet<Integer>>).

+3
3

. , Set ( Map), , - . , , .

+5

remove , . ?

?

+1

The obvious problem is what it clusters.removeprobably uses equalsto find the item to delete. Unfortunately, equalsin collections it generally compares whether the elements are the same, and not if they are the same collection (I believe that C # makes the best choice in this regard).

AN easy fix is ​​to create clusterslike Collections.newSetFromMap(new IdentityHashMap<Collection<Integer>, Boolean>())(I think).

0
source

Source: https://habr.com/ru/post/1706502/


All Articles