Gradient Descent Optimization: From SGD to Adam

Optimization is at the heart of machine learning. The choice of optimizer can significantly impact training speed, convergence, and final model performance. In this post, we’ll explore the evolution of gradient descent algorithms and understand when to use each one.

Open Table of Contents

The Optimization Problem
1. Batch Gradient Descent
2. Stochastic Gradient Descent (SGD)
3. Mini-batch Gradient Descent
4. Momentum
5. RMSprop
6. Adam (Adaptive Moment Estimation)
Comparing Optimizers
When to Use Each Optimizer
- SGD
- Momentum
- RMSprop
- Adam
Advanced Variations
- AdaGrad
- AdamW (Adam with Weight Decay)
Practical Tips
Conclusion

The Optimization Problem

In machine learning, we want to minimize a loss function L(θ) with respect to parameters θ:

θ* = argmin L(θ)

Gradient descent iteratively updates parameters in the direction of steepest descent:

θ_{t+1} = θ_t - α ∇L(θ_t)

Where α is the learning rate and ∇L(θ_t) is the gradient of the loss function.

1. Batch Gradient Descent

The vanilla gradient descent computes gradients using the entire dataset:

def batch_gradient_descent(X, y, theta, learning_rate, epochs):
    m = len(X)
    costs = []
    
    for epoch in range(epochs):
        # Compute predictions
        predictions = X.dot(theta)
        
        # Compute cost
        cost = (1/(2*m)) * np.sum((predictions - y)**2)
        costs.append(cost)
        
        # Compute gradients
        gradients = (1/m) * X.T.dot(predictions - y)
        
        # Update parameters
        theta = theta - learning_rate * gradients
    
    return theta, costs

Advantages:

Guaranteed to converge to global minimum for convex functions
Stable gradient estimates

Disadvantages:

Slow for large datasets
Memory intensive
Cannot handle online learning

2. Stochastic Gradient Descent (SGD)

SGD uses individual samples to compute gradients:

def stochastic_gradient_descent(X, y, theta, learning_rate, epochs):
    m = len(X)
    costs = []
    
    for epoch in range(epochs):
        # Shuffle data
        indices = np.random.permutation(m)
        X_shuffled = X[indices]
        y_shuffled = y[indices]
        
        epoch_cost = 0
        
        for i in range(m):
            # Single sample
            xi = X_shuffled[i:i+1]
            yi = y_shuffled[i:i+1]
            
            # Compute prediction and cost
            prediction = xi.dot(theta)
            cost = 0.5 * (prediction - yi)**2
            epoch_cost += cost
            
            # Compute gradient
            gradient = xi.T.dot(prediction - yi)
            
            # Update parameters
            theta = theta - learning_rate * gradient
        
        costs.append(epoch_cost[0] / m)
    
    return theta, costs

Advantages:

Fast updates
Can escape local minima
Suitable for online learning

Disadvantages:

Noisy gradients
May not converge to exact minimum
Sensitive to learning rate

3. Mini-batch Gradient Descent

Combines benefits of both batch and SGD:

def mini_batch_gradient_descent(X, y, theta, learning_rate, epochs, batch_size=32):
    m = len(X)
    costs = []
    
    for epoch in range(epochs):
        # Shuffle data
        indices = np.random.permutation(m)
        X_shuffled = X[indices]
        y_shuffled = y[indices]
        
        epoch_cost = 0
        
        for i in range(0, m, batch_size):
            # Mini-batch
            X_batch = X_shuffled[i:i+batch_size]
            y_batch = y_shuffled[i:i+batch_size]
            
            # Compute predictions and cost
            predictions = X_batch.dot(theta)
            cost = (1/(2*len(X_batch))) * np.sum((predictions - y_batch)**2)
            epoch_cost += cost * len(X_batch)
            
            # Compute gradients
            gradients = (1/len(X_batch)) * X_batch.T.dot(predictions - y_batch)
            
            # Update parameters
            theta = theta - learning_rate * gradients
        
        costs.append(epoch_cost / m)
    
    return theta, costs

4. Momentum

Momentum accumulates gradients from previous steps to accelerate convergence:

class MomentumOptimizer:
    def __init__(self, learning_rate=0.01, momentum=0.9):
        self.learning_rate = learning_rate
        self.momentum = momentum
        self.velocity = None
    
    def update(self, theta, gradients):
        if self.velocity is None:
            self.velocity = np.zeros_like(theta)
        
        # Update velocity
        self.velocity = self.momentum * self.velocity - self.learning_rate * gradients
        
        # Update parameters
        theta = theta + self.velocity
        
        return theta

Mathematical Formula:

v_t = β * v_{t-1} + α * ∇L(θ_t)
θ_{t+1} = θ_t - v_t

Advantages:

Faster convergence
Reduces oscillations
Better at escaping local minima

Disadvantages:

Additional hyperparameter (momentum coefficient)
May overshoot minimum

5. RMSprop

RMSprop addresses the diminishing learning rate problem by adapting the learning rate:

class RMSpropOptimizer:
    def __init__(self, learning_rate=0.001, decay_rate=0.9, epsilon=1e-8):
        self.learning_rate = learning_rate
        self.decay_rate = decay_rate
        self.epsilon = epsilon
        self.cache = None
    
    def update(self, theta, gradients):
        if self.cache is None:
            self.cache = np.zeros_like(theta)
        
        # Update cache
        self.cache = self.decay_rate * self.cache + (1 - self.decay_rate) * gradients**2
        
        # Update parameters
        theta = theta - self.learning_rate * gradients / (np.sqrt(self.cache) + self.epsilon)
        
        return theta

Mathematical Formula:

E[g²]_t = β * E[g²]_{t-1} + (1-β) * (∇L(θ_t))²
θ_{t+1} = θ_t - α * ∇L(θ_t) / (√(E[g²]_t) + ε)

6. Adam (Adaptive Moment Estimation)

Adam combines momentum and RMSprop:

class AdamOptimizer:
    def __init__(self, learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8):
        self.learning_rate = learning_rate
        self.beta1 = beta1
        self.beta2 = beta2
        self.epsilon = epsilon
        self.m = None  # First moment
        self.v = None  # Second moment
        self.t = 0     # Time step
    
    def update(self, theta, gradients):
        self.t += 1
        
        if self.m is None:
            self.m = np.zeros_like(theta)
            self.v = np.zeros_like(theta)
        
        # Update biased first moment estimate
        self.m = self.beta1 * self.m + (1 - self.beta1) * gradients
        
        # Update biased second moment estimate
        self.v = self.beta2 * self.v + (1 - self.beta2) * gradients**2
        
        # Compute bias-corrected first moment estimate
        m_hat = self.m / (1 - self.beta1**self.t)
        
        # Compute bias-corrected second moment estimate
        v_hat = self.v / (1 - self.beta2**self.t)
        
        # Update parameters
        theta = theta - self.learning_rate * m_hat / (np.sqrt(v_hat) + self.epsilon)
        
        return theta

Mathematical Formula:

m_t = β₁ * m_{t-1} + (1-β₁) * ∇L(θ_t)
v_t = β₂ * v_{t-1} + (1-β₂) * (∇L(θ_t))²
m̂_t = m_t / (1 - β₁^t)
v̂_t = v_t / (1 - β₂^t)
θ_{t+1} = θ_t - α * m̂_t / (√v̂_t + ε)

Comparing Optimizers

Let’s compare these optimizers on a simple 2D optimization problem:

import numpy as np
import matplotlib.pyplot as plt

def rosenbrock(x, y):
    """The Rosenbrock function - a classic optimization test case"""
    return (1 - x)**2 + 100 * (y - x**2)**2

def rosenbrock_gradient(x, y):
    """Gradient of the Rosenbrock function"""
    dx = -2 * (1 - x) - 400 * x * (y - x**2)
    dy = 200 * (y - x**2)
    return np.array([dx, dy])

def optimize_rosenbrock(optimizer, steps=1000, start_point=[-1.5, 2.0]):
    """Optimize the Rosenbrock function using given optimizer"""
    theta = np.array(start_point)
    trajectory = [theta.copy()]
    
    for _ in range(steps):
        grad = rosenbrock_gradient(theta[0], theta[1])
        theta = optimizer.update(theta, grad)
        trajectory.append(theta.copy())
    
    return np.array(trajectory)

# Test different optimizers
optimizers = {
    'SGD': SGDOptimizer(learning_rate=0.001),
    'Momentum': MomentumOptimizer(learning_rate=0.001, momentum=0.9),
    'RMSprop': RMSpropOptimizer(learning_rate=0.01),
    'Adam': AdamOptimizer(learning_rate=0.01)
}

trajectories = {}
for name, optimizer in optimizers.items():
    trajectories[name] = optimize_rosenbrock(optimizer)

# Visualize results
fig, ax = plt.subplots(figsize=(12, 8))

# Plot contour
x = np.linspace(-2, 2, 100)
y = np.linspace(-1, 3, 100)
X, Y = np.meshgrid(x, y)
Z = rosenbrock(X, Y)

contour = ax.contour(X, Y, Z, levels=np.logspace(0, 3, 20), colors='lightgray')
ax.clabel(contour, inline=True, fontsize=8)

# Plot trajectories
colors = ['red', 'blue', 'green', 'orange']
for i, (name, trajectory) in enumerate(trajectories.items()):
    ax.plot(trajectory[:, 0], trajectory[:, 1], 
            color=colors[i], label=name, linewidth=2, alpha=0.7)
    ax.scatter(trajectory[0, 0], trajectory[0, 1], 
               color=colors[i], s=100, marker='o')
    ax.scatter(trajectory[-1, 0], trajectory[-1, 1], 
               color=colors[i], s=100, marker='x')

# Mark global minimum
ax.scatter(1, 1, color='red', s=200, marker='*', label='Global Minimum')

ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_title('Optimizer Comparison on Rosenbrock Function')
ax.legend()
ax.grid(True, alpha=0.3)
plt.show()

When to Use Each Optimizer

SGD

Use when: Simple problems, interpretability is important
Avoid when: Large datasets, complex loss landscapes

Momentum

Use when: Loss function has many local minima, oscillatory behavior
Avoid when: Very noisy gradients

RMSprop

Use when: Sparse gradients, recurrent neural networks
Avoid when: Non-stationary objectives

Adam

Use when: General-purpose optimization, deep learning
Avoid when: Simple convex problems, when you need theoretical guarantees

Advanced Variations

AdaGrad

class AdaGradOptimizer:
    def __init__(self, learning_rate=0.01, epsilon=1e-8):
        self.learning_rate = learning_rate
        self.epsilon = epsilon
        self.G = None
    
    def update(self, theta, gradients):
        if self.G is None:
            self.G = np.zeros_like(theta)
        
        self.G += gradients**2
        theta = theta - self.learning_rate * gradients / (np.sqrt(self.G) + self.epsilon)
        
        return theta

AdamW (Adam with Weight Decay)

class AdamWOptimizer(AdamOptimizer):
    def __init__(self, learning_rate=0.001, beta1=0.9, beta2=0.999, 
                 epsilon=1e-8, weight_decay=0.01):
        super().__init__(learning_rate, beta1, beta2, epsilon)
        self.weight_decay = weight_decay
    
    def update(self, theta, gradients):
        # Apply weight decay
        theta = theta * (1 - self.learning_rate * self.weight_decay)
        
        # Regular Adam update
        return super().update(theta, gradients)

Practical Tips

Start with Adam: It’s generally a good default choice
Tune learning rate: Most important hyperparameter
Use learning rate scheduling: Reduce learning rate during training
Monitor convergence: Plot loss curves to understand behavior
Consider the problem: Different optimizers work better for different problems

Conclusion

The evolution from SGD to Adam represents decades of optimization research. Each algorithm addresses specific challenges:

SGD: Simple but effective for convex problems
Momentum: Accelerates convergence and reduces oscillations
RMSprop: Adapts learning rate per parameter
Adam: Combines momentum and adaptive learning rates

Understanding these algorithms helps you make informed choices for your specific machine learning problems. The key is to understand the trade-offs and experiment with different optimizers based on your specific use case.

For interactive experiments and detailed implementations, check out the GitHub repository. Try different optimizers on various loss functions to gain intuition about their behavior!

Gradient Descent Optimization: From SGD to Adam

Gradient Descent Optimization: From SGD to Adam

Table of Contents

The Optimization Problem

1. Batch Gradient Descent

2. Stochastic Gradient Descent (SGD)

3. Mini-batch Gradient Descent

4. Momentum

5. RMSprop

6. Adam (Adaptive Moment Estimation)

Comparing Optimizers

When to Use Each Optimizer

SGD

Momentum

RMSprop

Adam

Advanced Variations

AdaGrad

AdamW (Adam with Weight Decay)

Practical Tips

Conclusion

Share this post

Related Posts

Understanding Transformer Architecture: A Deep Dive into Attention Mechanisms

Build Your Own AI Search Engine with Google's Gemini and LangGraph

My VS Code Gained a Superpower- Taming LangChain with Live Context