NaN in the implementation of Mauch Euclid

We use the EuclideanDistanceSimilarity class to calculate the similarity of multiple elements using Hadoop.

Unfortunately, some items receive zero or very few similar items, even though they are very similar to items.

I think I tracked it down to this line in the EuclideanDistanceSimilarity class:

double euclideanDistance = Math.sqrt(normA - 2 * dots + normB);

The value passed to sqrt is sometimes negative, in which case NaN is returned. I suppose maybe there should be Math.abs somewhere, but my mathematicians are not strong enough to understand how the Euclidean calculations were rearranged, so I'm not sure what the effect will be.

Can someone better explain the math and confirm

double euclideanDistance = Math.sqrt(Math.abs(normA - 2 * dots + normB));

would be an acceptable solution?

+4
source share
1 answer

The code is in org.apache.mahout.math.hadoop.similarity.cooccurrence.measures. EuclideanDistanceSimilarity org.apache.mahout.math.hadoop.similarity.cooccurrence.measures. EuclideanDistanceSimilarity .

Yes, it is written in this way, because at this point in the calculation it has the norms of the vectors A and B and their point product, so it calculates the distance much faster in this way.

Identity is pretty simple. Let C = A - B and a, b and c be the lengths of the corresponding vectors. We need c. From the law of cosines c 2 = a 2 + b 2 - 2ab? cos (? theta;) and ab? cos (? theta;) is simply the meaning of the point product. Note that normA in the code is actually the square of the norm (length) - in fact it should be better named.

Let's get back to the question: you are here, the error here is that rounding can make the argument negative. The fix is ​​not abs() , but:

 double euclideanDistance = Math.sqrt(Math.max(0.0, normA - 2 * dots + normB)); 

It just needs to be limited to 0. I can fix this.

+5
source

Source: https://habr.com/ru/post/1442333/


All Articles