Cosse SigmoidCrossEntropyLoss Loss Loss Function

I was looking through the Caffe code for the SigmoidCrossEntropyLoss layer and documents , and I'm a bit confused. Documents list the loss function as log loss (I would reproduce it here, but without Latex, the formula will be difficult to read. Check the document link, it is at the very top).

However, the code itself ( Forward_cpu(...) ) shows a different formula

 Dtype loss = 0; for (int i = 0; i < count; ++i) { loss -= input_data[i] * (target[i] - (input_data[i] >= 0)) - log(1 + exp(input_data[i] - 2 * input_data[i] * (input_data[i] >= 0))); } top[0]->mutable_cpu_data()[0] = loss / num; 

Is it because it means that the sigmoid function is already applied to the input?

However, even in this case, the fragments (input_data[i] >= 0) also confuse me. They seem to be in place of p_hat from the document loss formula, which is supposed to be a prediction suppressed by a sigmoid function. So why do they just take the binary threshold? This made it even more confusing since this loss predicts [0,1] outputs, so (input_data[i] >= 0) will be 1 if it is not 100% sure that it is not.

Can someone explain this to me?

+6
source share
1 answer

The SigmoidCrossEntropy level in caffe combines 2 steps ( Sigmoid + CrossEntropy ) that will be executed on input_data in one piece of code:

 Dtype loss = 0; for (int i = 0; i < count; ++i) { loss -= input_data[i] * (target[i] - (input_data[i] >= 0)) - log(1 + exp(input_data[i] - 2 * input_data[i] * (input_data[i] >= 0))); } top[0]->mutable_cpu_data()[0] = loss / num; 

In fact, regardless of whether input_data >= 0 or not, the above code is always equivalent to the following code in math:

 Dtype loss = 0; for (int i = 0; i < count; ++i) { loss -= input_data[i] * (target[i] - 1) - log(1 + exp(-input_data[i]); } top[0]->mutable_cpu_data()[0] = loss / num; 

this code is based on a simple mathematical formula after applying Sigmoid and CrossEntropy on input_data and creating some combinations in math.

But the first part of the code (using caffe) is more numerically stable and takes less risk of overflow, since it avoids calculating large exp(input_data) (or exp(-input_data) ) when the absolute value of input_data too large. That's why you saw this code in coffee.

+6
source

Source: https://habr.com/ru/post/1011905/


All Articles