Considerations for Using ReLU as an Activation Function

Question

Considerations for Using ReLU as an Activation Function

I am implementing a neural network and wanted to use ReLU as a function of activating neurons. In addition, I train the network with SDG and backpropagation. I am testing a neural network using the paradigmatic XOR problem, and still it correctly classifies new patterns if I use a logic function or hyperbolic tangent as activation functions.

I read about the benefits of using Leaky ReLU as an activation function and implemented it in Python as follows:

def relu(data, epsilon=0.1): return np.maximum(epsilon * data, data)

where np is the name of NumPy . The related derivative is performed as follows:

 def relu_prime(data, epsilon=0.1): if 1. * np.all(epsilon < data): return 1 return epsilon

Using this function as an activation, I get incorrect results. For instance:

Input = [0, 0] → Output = [0.43951457]
Input = [0, 1] → Output = [0.46252925]
Input = [1, 0] → Output = [0.34939594]
Input = [1, 1] → Output = [0.37241062]

You can see that the outputs are very different from the expected XOR. So the question would be, is there any particular consideration for using ReLU as an activation function?

Please do not hesitate to give me more context or code. Thanks in advance.

EDIT: There is an error in the derivative, as it returns only one float, not a NumPy array. The correct code should be:

 def relu_prime(data, epsilon=0.1): gradients = 1. * (data > epsilon) gradients[gradients == 0] = epsilon return gradients

+5

python numpy machine-learning neural-network

tulians Jan 08 '17 at 23:27

source share

2 answers

Short answer

Do not use ReLU with binary digits. It is designed to work with much larger values. Also avoid using when there are no negative values, because basically it means that you are using the linear activation function, which is not the best. Best used for convolutional neural networks.

Long answer

Can't say something is wrong with python code, because I have Java code. But it’s logical, I think that using ReLU in this case is a bad solution. Since we predict XOR, there is a limited range of values for your NN [0,1]. This is also the range of the sigmoid activation function. With ReLU you work with the values [0, infinity], which means that there are a huge number of values that you will never use, since it is XOR. But ReLU still takes these values into account, and the error you are about to receive will increase. That's why you get the right answers in about 50% of cases. In fact, this value can be either 0% or higher than 99%. Moral of the story - when deciding which activation function to use, try to combine the range of input values in your NN with the range of values of the activation function.

+6

Arnis shaykh Jan 9 '17 at 13:46

source share

Nick becker · Accepted Answer · 2017-01-09T14:16:28+0000

Your relu_prime function should be:

 def relu_prime(data, epsilon=0.1): gradients = 1. * (data > 0) gradients[gradients == 0] = epsilon return gradients

Note the comparison of each value in the data matrix with 0 instead of epsilon . This follows from the standard leak definition of ReLUs , which creates a piecewise gradient of 1 when x > 0 and epsilon otherwise.

I cannot comment on whether a ReLU leak is the best choice for the XOR problem, but this should solve the gradient problem.

Considerations for Using ReLU as an Activation Function

More articles: