by Michele Laurelli
Adaptive learning rate optimizer that adapts rates per parameter based on historical gradients.
Accumulates squared gradients, larger accumulation means smaller learning rate. Good for sparse data. Can have diminishing learning rates over time.
Sparse gradient optimization
NLP embeddings
Convex problems