by Michele Laurelli
A problem in deep networks where gradients become extremely small, preventing effective learning in early layers.
Vanishing gradients occur when repeated multiplication of small derivatives (< 1) makes gradients exponentially smaller in backpropagation. Common in deep RNNs and networks with sigmoid activations. Solutions: ReLU, LSTM, ResNet.
Deep RNNs failing to learn long dependencies
Sigmoid activation in deep networks
Pre-ResNet very deep networks
An algorithm for training neural networks by calculating gradients of the loss function with respect to weights.
An optimization algorithm that iteratively adjusts parameters to minimize a loss function by following the gradient.
An activation function that outputs the input if positive, otherwise zero: f(x) = max(0, x).