Gradient Descent Optimization: From SGD to Adam
Optimization is at the heart of machine learning. The choice of optimizer can significantly impact training speed, convergence, and final model performance. In this post, we’ll explore the evolution of gradient descent algorithms and understand when to use each one.
Table of Contents
Open Table of Contents
The Optimization Problem
In machine learning, we want to minimize a loss function L(θ) with respect to parameters θ:
θ* = argmin L(θ)
Gradient descent iteratively updates parameters in the direction of steepest descent:
θ_{t+1} = θ_t - α ∇L(θ_t)
Where α is the learning rate and ∇L(θ_t) is the gradient of the loss function.
1. Batch Gradient Descent
The vanilla gradient descent computes gradients using the entire dataset:
def batch_gradient_descent(X, y, theta, learning_rate, epochs):
m = len(X)
costs = []
for epoch in range(epochs):
# Compute predictions
predictions = X.dot(theta)
# Compute cost
cost = (1/(2*m)) * np.sum((predictions - y)**2)
costs.append(cost)
# Compute gradients
gradients = (1/m) * X.T.dot(predictions - y)
# Update parameters
theta = theta - learning_rate * gradients
return theta, costs
Advantages:
- Guaranteed to converge to global minimum for convex functions
- Stable gradient estimates
Disadvantages:
- Slow for large datasets
- Memory intensive
- Cannot handle online learning
2. Stochastic Gradient Descent (SGD)
SGD uses individual samples to compute gradients:
def stochastic_gradient_descent(X, y, theta, learning_rate, epochs):
m = len(X)
costs = []
for epoch in range(epochs):
# Shuffle data
indices = np.random.permutation(m)
X_shuffled = X[indices]
y_shuffled = y[indices]
epoch_cost = 0
for i in range(m):
# Single sample
xi = X_shuffled[i:i+1]
yi = y_shuffled[i:i+1]
# Compute prediction and cost
prediction = xi.dot(theta)
cost = 0.5 * (prediction - yi)**2
epoch_cost += cost
# Compute gradient
gradient = xi.T.dot(prediction - yi)
# Update parameters
theta = theta - learning_rate * gradient
costs.append(epoch_cost[0] / m)
return theta, costs
Advantages:
- Fast updates
- Can escape local minima
- Suitable for online learning
Disadvantages:
- Noisy gradients
- May not converge to exact minimum
- Sensitive to learning rate
3. Mini-batch Gradient Descent
Combines benefits of both batch and SGD:
def mini_batch_gradient_descent(X, y, theta, learning_rate, epochs, batch_size=32):
m = len(X)
costs = []
for epoch in range(epochs):
# Shuffle data
indices = np.random.permutation(m)
X_shuffled = X[indices]
y_shuffled = y[indices]
epoch_cost = 0
for i in range(0, m, batch_size):
# Mini-batch
X_batch = X_shuffled[i:i+batch_size]
y_batch = y_shuffled[i:i+batch_size]
# Compute predictions and cost
predictions = X_batch.dot(theta)
cost = (1/(2*len(X_batch))) * np.sum((predictions - y_batch)**2)
epoch_cost += cost * len(X_batch)
# Compute gradients
gradients = (1/len(X_batch)) * X_batch.T.dot(predictions - y_batch)
# Update parameters
theta = theta - learning_rate * gradients
costs.append(epoch_cost / m)
return theta, costs
4. Momentum
Momentum accumulates gradients from previous steps to accelerate convergence:
class MomentumOptimizer:
def __init__(self, learning_rate=0.01, momentum=0.9):
self.learning_rate = learning_rate
self.momentum = momentum
self.velocity = None
def update(self, theta, gradients):
if self.velocity is None:
self.velocity = np.zeros_like(theta)
# Update velocity
self.velocity = self.momentum * self.velocity - self.learning_rate * gradients
# Update parameters
theta = theta + self.velocity
return theta
Mathematical Formula:
v_t = β * v_{t-1} + α * ∇L(θ_t)
θ_{t+1} = θ_t - v_t
Advantages:
- Faster convergence
- Reduces oscillations
- Better at escaping local minima
Disadvantages:
- Additional hyperparameter (momentum coefficient)
- May overshoot minimum
5. RMSprop
RMSprop addresses the diminishing learning rate problem by adapting the learning rate:
class RMSpropOptimizer:
def __init__(self, learning_rate=0.001, decay_rate=0.9, epsilon=1e-8):
self.learning_rate = learning_rate
self.decay_rate = decay_rate
self.epsilon = epsilon
self.cache = None
def update(self, theta, gradients):
if self.cache is None:
self.cache = np.zeros_like(theta)
# Update cache
self.cache = self.decay_rate * self.cache + (1 - self.decay_rate) * gradients**2
# Update parameters
theta = theta - self.learning_rate * gradients / (np.sqrt(self.cache) + self.epsilon)
return theta
Mathematical Formula:
E[g²]_t = β * E[g²]_{t-1} + (1-β) * (∇L(θ_t))²
θ_{t+1} = θ_t - α * ∇L(θ_t) / (√(E[g²]_t) + ε)
6. Adam (Adaptive Moment Estimation)
Adam combines momentum and RMSprop:
class AdamOptimizer:
def __init__(self, learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8):
self.learning_rate = learning_rate
self.beta1 = beta1
self.beta2 = beta2
self.epsilon = epsilon
self.m = None # First moment
self.v = None # Second moment
self.t = 0 # Time step
def update(self, theta, gradients):
self.t += 1
if self.m is None:
self.m = np.zeros_like(theta)
self.v = np.zeros_like(theta)
# Update biased first moment estimate
self.m = self.beta1 * self.m + (1 - self.beta1) * gradients
# Update biased second moment estimate
self.v = self.beta2 * self.v + (1 - self.beta2) * gradients**2
# Compute bias-corrected first moment estimate
m_hat = self.m / (1 - self.beta1**self.t)
# Compute bias-corrected second moment estimate
v_hat = self.v / (1 - self.beta2**self.t)
# Update parameters
theta = theta - self.learning_rate * m_hat / (np.sqrt(v_hat) + self.epsilon)
return theta
Mathematical Formula:
m_t = β₁ * m_{t-1} + (1-β₁) * ∇L(θ_t)
v_t = β₂ * v_{t-1} + (1-β₂) * (∇L(θ_t))²
m̂_t = m_t / (1 - β₁^t)
v̂_t = v_t / (1 - β₂^t)
θ_{t+1} = θ_t - α * m̂_t / (√v̂_t + ε)
Comparing Optimizers
Let’s compare these optimizers on a simple 2D optimization problem:
import numpy as np
import matplotlib.pyplot as plt
def rosenbrock(x, y):
"""The Rosenbrock function - a classic optimization test case"""
return (1 - x)**2 + 100 * (y - x**2)**2
def rosenbrock_gradient(x, y):
"""Gradient of the Rosenbrock function"""
dx = -2 * (1 - x) - 400 * x * (y - x**2)
dy = 200 * (y - x**2)
return np.array([dx, dy])
def optimize_rosenbrock(optimizer, steps=1000, start_point=[-1.5, 2.0]):
"""Optimize the Rosenbrock function using given optimizer"""
theta = np.array(start_point)
trajectory = [theta.copy()]
for _ in range(steps):
grad = rosenbrock_gradient(theta[0], theta[1])
theta = optimizer.update(theta, grad)
trajectory.append(theta.copy())
return np.array(trajectory)
# Test different optimizers
optimizers = {
'SGD': SGDOptimizer(learning_rate=0.001),
'Momentum': MomentumOptimizer(learning_rate=0.001, momentum=0.9),
'RMSprop': RMSpropOptimizer(learning_rate=0.01),
'Adam': AdamOptimizer(learning_rate=0.01)
}
trajectories = {}
for name, optimizer in optimizers.items():
trajectories[name] = optimize_rosenbrock(optimizer)
# Visualize results
fig, ax = plt.subplots(figsize=(12, 8))
# Plot contour
x = np.linspace(-2, 2, 100)
y = np.linspace(-1, 3, 100)
X, Y = np.meshgrid(x, y)
Z = rosenbrock(X, Y)
contour = ax.contour(X, Y, Z, levels=np.logspace(0, 3, 20), colors='lightgray')
ax.clabel(contour, inline=True, fontsize=8)
# Plot trajectories
colors = ['red', 'blue', 'green', 'orange']
for i, (name, trajectory) in enumerate(trajectories.items()):
ax.plot(trajectory[:, 0], trajectory[:, 1],
color=colors[i], label=name, linewidth=2, alpha=0.7)
ax.scatter(trajectory[0, 0], trajectory[0, 1],
color=colors[i], s=100, marker='o')
ax.scatter(trajectory[-1, 0], trajectory[-1, 1],
color=colors[i], s=100, marker='x')
# Mark global minimum
ax.scatter(1, 1, color='red', s=200, marker='*', label='Global Minimum')
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_title('Optimizer Comparison on Rosenbrock Function')
ax.legend()
ax.grid(True, alpha=0.3)
plt.show()
When to Use Each Optimizer
SGD
- Use when: Simple problems, interpretability is important
- Avoid when: Large datasets, complex loss landscapes
Momentum
- Use when: Loss function has many local minima, oscillatory behavior
- Avoid when: Very noisy gradients
RMSprop
- Use when: Sparse gradients, recurrent neural networks
- Avoid when: Non-stationary objectives
Adam
- Use when: General-purpose optimization, deep learning
- Avoid when: Simple convex problems, when you need theoretical guarantees
Advanced Variations
AdaGrad
class AdaGradOptimizer:
def __init__(self, learning_rate=0.01, epsilon=1e-8):
self.learning_rate = learning_rate
self.epsilon = epsilon
self.G = None
def update(self, theta, gradients):
if self.G is None:
self.G = np.zeros_like(theta)
self.G += gradients**2
theta = theta - self.learning_rate * gradients / (np.sqrt(self.G) + self.epsilon)
return theta
AdamW (Adam with Weight Decay)
class AdamWOptimizer(AdamOptimizer):
def __init__(self, learning_rate=0.001, beta1=0.9, beta2=0.999,
epsilon=1e-8, weight_decay=0.01):
super().__init__(learning_rate, beta1, beta2, epsilon)
self.weight_decay = weight_decay
def update(self, theta, gradients):
# Apply weight decay
theta = theta * (1 - self.learning_rate * self.weight_decay)
# Regular Adam update
return super().update(theta, gradients)
Practical Tips
- Start with Adam: It’s generally a good default choice
- Tune learning rate: Most important hyperparameter
- Use learning rate scheduling: Reduce learning rate during training
- Monitor convergence: Plot loss curves to understand behavior
- Consider the problem: Different optimizers work better for different problems
Conclusion
The evolution from SGD to Adam represents decades of optimization research. Each algorithm addresses specific challenges:
- SGD: Simple but effective for convex problems
- Momentum: Accelerates convergence and reduces oscillations
- RMSprop: Adapts learning rate per parameter
- Adam: Combines momentum and adaptive learning rates
Understanding these algorithms helps you make informed choices for your specific machine learning problems. The key is to understand the trade-offs and experiment with different optimizers based on your specific use case.
For interactive experiments and detailed implementations, check out the GitHub repository. Try different optimizers on various loss functions to gain intuition about their behavior!