I am writing my main thesis on how to use the LSTM neural network in time series. In my experiment, I found out that scaling data can have a big impact on the result. For example, when I use the tanh activation function, and the range of values ββis between -1 and 1, the model seems to converge faster, and the validation error also does not affect sharply after each era.
Does anyone know if there is a mathematical explanation? Or are there any documents that already explain this situation?
source
share