Skip to content

Gradient Descent Optimization: From SGD to Adam

Gradient Descent Optimization: From SGD to Adam

Optimization is at the heart of machine learning. The choice of optimizer can significantly impact training speed, convergence, and final model performance. In this post, we’ll explore the evolution of gradient descent algorithms and understand when to use each one.

Table of Contents

Open Table of Contents

The Optimization Problem

In machine learning, we want to minimize a loss function L(θ) with respect to parameters θ:

θ* = argmin L(θ)

Gradient descent iteratively updates parameters in the direction of steepest descent:

θ_{t+1} = θ_t - α ∇L(θ_t)

Where α is the learning rate and ∇L(θ_t) is the gradient of the loss function.

1. Batch Gradient Descent

The vanilla gradient descent computes gradients using the entire dataset:

def batch_gradient_descent(X, y, theta, learning_rate, epochs):
    m = len(X)
    costs = []
    
    for epoch in range(epochs):
        # Compute predictions
        predictions = X.dot(theta)
        
        # Compute cost
        cost = (1/(2*m)) * np.sum((predictions - y)**2)
        costs.append(cost)
        
        # Compute gradients
        gradients = (1/m) * X.T.dot(predictions - y)
        
        # Update parameters
        theta = theta - learning_rate * gradients
    
    return theta, costs

Advantages:

  • Guaranteed to converge to global minimum for convex functions
  • Stable gradient estimates

Disadvantages:

  • Slow for large datasets
  • Memory intensive
  • Cannot handle online learning

2. Stochastic Gradient Descent (SGD)

SGD uses individual samples to compute gradients:

def stochastic_gradient_descent(X, y, theta, learning_rate, epochs):
    m = len(X)
    costs = []
    
    for epoch in range(epochs):
        # Shuffle data
        indices = np.random.permutation(m)
        X_shuffled = X[indices]
        y_shuffled = y[indices]
        
        epoch_cost = 0
        
        for i in range(m):
            # Single sample
            xi = X_shuffled[i:i+1]
            yi = y_shuffled[i:i+1]
            
            # Compute prediction and cost
            prediction = xi.dot(theta)
            cost = 0.5 * (prediction - yi)**2
            epoch_cost += cost
            
            # Compute gradient
            gradient = xi.T.dot(prediction - yi)
            
            # Update parameters
            theta = theta - learning_rate * gradient
        
        costs.append(epoch_cost[0] / m)
    
    return theta, costs

Advantages:

  • Fast updates
  • Can escape local minima
  • Suitable for online learning

Disadvantages:

  • Noisy gradients
  • May not converge to exact minimum
  • Sensitive to learning rate

3. Mini-batch Gradient Descent

Combines benefits of both batch and SGD:

def mini_batch_gradient_descent(X, y, theta, learning_rate, epochs, batch_size=32):
    m = len(X)
    costs = []
    
    for epoch in range(epochs):
        # Shuffle data
        indices = np.random.permutation(m)
        X_shuffled = X[indices]
        y_shuffled = y[indices]
        
        epoch_cost = 0
        
        for i in range(0, m, batch_size):
            # Mini-batch
            X_batch = X_shuffled[i:i+batch_size]
            y_batch = y_shuffled[i:i+batch_size]
            
            # Compute predictions and cost
            predictions = X_batch.dot(theta)
            cost = (1/(2*len(X_batch))) * np.sum((predictions - y_batch)**2)
            epoch_cost += cost * len(X_batch)
            
            # Compute gradients
            gradients = (1/len(X_batch)) * X_batch.T.dot(predictions - y_batch)
            
            # Update parameters
            theta = theta - learning_rate * gradients
        
        costs.append(epoch_cost / m)
    
    return theta, costs

4. Momentum

Momentum accumulates gradients from previous steps to accelerate convergence:

class MomentumOptimizer:
    def __init__(self, learning_rate=0.01, momentum=0.9):
        self.learning_rate = learning_rate
        self.momentum = momentum
        self.velocity = None
    
    def update(self, theta, gradients):
        if self.velocity is None:
            self.velocity = np.zeros_like(theta)
        
        # Update velocity
        self.velocity = self.momentum * self.velocity - self.learning_rate * gradients
        
        # Update parameters
        theta = theta + self.velocity
        
        return theta

Mathematical Formula:

v_t = β * v_{t-1} + α * ∇L(θ_t)
θ_{t+1} = θ_t - v_t

Advantages:

  • Faster convergence
  • Reduces oscillations
  • Better at escaping local minima

Disadvantages:

  • Additional hyperparameter (momentum coefficient)
  • May overshoot minimum

5. RMSprop

RMSprop addresses the diminishing learning rate problem by adapting the learning rate:

class RMSpropOptimizer:
    def __init__(self, learning_rate=0.001, decay_rate=0.9, epsilon=1e-8):
        self.learning_rate = learning_rate
        self.decay_rate = decay_rate
        self.epsilon = epsilon
        self.cache = None
    
    def update(self, theta, gradients):
        if self.cache is None:
            self.cache = np.zeros_like(theta)
        
        # Update cache
        self.cache = self.decay_rate * self.cache + (1 - self.decay_rate) * gradients**2
        
        # Update parameters
        theta = theta - self.learning_rate * gradients / (np.sqrt(self.cache) + self.epsilon)
        
        return theta

Mathematical Formula:

E[g²]_t = β * E[g²]_{t-1} + (1-β) * (∇L(θ_t))²
θ_{t+1} = θ_t - α * ∇L(θ_t) / (√(E[g²]_t) + ε)

6. Adam (Adaptive Moment Estimation)

Adam combines momentum and RMSprop:

class AdamOptimizer:
    def __init__(self, learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8):
        self.learning_rate = learning_rate
        self.beta1 = beta1
        self.beta2 = beta2
        self.epsilon = epsilon
        self.m = None  # First moment
        self.v = None  # Second moment
        self.t = 0     # Time step
    
    def update(self, theta, gradients):
        self.t += 1
        
        if self.m is None:
            self.m = np.zeros_like(theta)
            self.v = np.zeros_like(theta)
        
        # Update biased first moment estimate
        self.m = self.beta1 * self.m + (1 - self.beta1) * gradients
        
        # Update biased second moment estimate
        self.v = self.beta2 * self.v + (1 - self.beta2) * gradients**2
        
        # Compute bias-corrected first moment estimate
        m_hat = self.m / (1 - self.beta1**self.t)
        
        # Compute bias-corrected second moment estimate
        v_hat = self.v / (1 - self.beta2**self.t)
        
        # Update parameters
        theta = theta - self.learning_rate * m_hat / (np.sqrt(v_hat) + self.epsilon)
        
        return theta

Mathematical Formula:

m_t = β₁ * m_{t-1} + (1-β₁) * ∇L(θ_t)
v_t = β₂ * v_{t-1} + (1-β₂) * (∇L(θ_t))²
m̂_t = m_t / (1 - β₁^t)
v̂_t = v_t / (1 - β₂^t)
θ_{t+1} = θ_t - α * m̂_t / (√v̂_t + ε)

Comparing Optimizers

Let’s compare these optimizers on a simple 2D optimization problem:

import numpy as np
import matplotlib.pyplot as plt

def rosenbrock(x, y):
    """The Rosenbrock function - a classic optimization test case"""
    return (1 - x)**2 + 100 * (y - x**2)**2

def rosenbrock_gradient(x, y):
    """Gradient of the Rosenbrock function"""
    dx = -2 * (1 - x) - 400 * x * (y - x**2)
    dy = 200 * (y - x**2)
    return np.array([dx, dy])

def optimize_rosenbrock(optimizer, steps=1000, start_point=[-1.5, 2.0]):
    """Optimize the Rosenbrock function using given optimizer"""
    theta = np.array(start_point)
    trajectory = [theta.copy()]
    
    for _ in range(steps):
        grad = rosenbrock_gradient(theta[0], theta[1])
        theta = optimizer.update(theta, grad)
        trajectory.append(theta.copy())
    
    return np.array(trajectory)

# Test different optimizers
optimizers = {
    'SGD': SGDOptimizer(learning_rate=0.001),
    'Momentum': MomentumOptimizer(learning_rate=0.001, momentum=0.9),
    'RMSprop': RMSpropOptimizer(learning_rate=0.01),
    'Adam': AdamOptimizer(learning_rate=0.01)
}

trajectories = {}
for name, optimizer in optimizers.items():
    trajectories[name] = optimize_rosenbrock(optimizer)

# Visualize results
fig, ax = plt.subplots(figsize=(12, 8))

# Plot contour
x = np.linspace(-2, 2, 100)
y = np.linspace(-1, 3, 100)
X, Y = np.meshgrid(x, y)
Z = rosenbrock(X, Y)

contour = ax.contour(X, Y, Z, levels=np.logspace(0, 3, 20), colors='lightgray')
ax.clabel(contour, inline=True, fontsize=8)

# Plot trajectories
colors = ['red', 'blue', 'green', 'orange']
for i, (name, trajectory) in enumerate(trajectories.items()):
    ax.plot(trajectory[:, 0], trajectory[:, 1], 
            color=colors[i], label=name, linewidth=2, alpha=0.7)
    ax.scatter(trajectory[0, 0], trajectory[0, 1], 
               color=colors[i], s=100, marker='o')
    ax.scatter(trajectory[-1, 0], trajectory[-1, 1], 
               color=colors[i], s=100, marker='x')

# Mark global minimum
ax.scatter(1, 1, color='red', s=200, marker='*', label='Global Minimum')

ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_title('Optimizer Comparison on Rosenbrock Function')
ax.legend()
ax.grid(True, alpha=0.3)
plt.show()

When to Use Each Optimizer

SGD

  • Use when: Simple problems, interpretability is important
  • Avoid when: Large datasets, complex loss landscapes

Momentum

  • Use when: Loss function has many local minima, oscillatory behavior
  • Avoid when: Very noisy gradients

RMSprop

  • Use when: Sparse gradients, recurrent neural networks
  • Avoid when: Non-stationary objectives

Adam

  • Use when: General-purpose optimization, deep learning
  • Avoid when: Simple convex problems, when you need theoretical guarantees

Advanced Variations

AdaGrad

class AdaGradOptimizer:
    def __init__(self, learning_rate=0.01, epsilon=1e-8):
        self.learning_rate = learning_rate
        self.epsilon = epsilon
        self.G = None
    
    def update(self, theta, gradients):
        if self.G is None:
            self.G = np.zeros_like(theta)
        
        self.G += gradients**2
        theta = theta - self.learning_rate * gradients / (np.sqrt(self.G) + self.epsilon)
        
        return theta

AdamW (Adam with Weight Decay)

class AdamWOptimizer(AdamOptimizer):
    def __init__(self, learning_rate=0.001, beta1=0.9, beta2=0.999, 
                 epsilon=1e-8, weight_decay=0.01):
        super().__init__(learning_rate, beta1, beta2, epsilon)
        self.weight_decay = weight_decay
    
    def update(self, theta, gradients):
        # Apply weight decay
        theta = theta * (1 - self.learning_rate * self.weight_decay)
        
        # Regular Adam update
        return super().update(theta, gradients)

Practical Tips

  1. Start with Adam: It’s generally a good default choice
  2. Tune learning rate: Most important hyperparameter
  3. Use learning rate scheduling: Reduce learning rate during training
  4. Monitor convergence: Plot loss curves to understand behavior
  5. Consider the problem: Different optimizers work better for different problems

Conclusion

The evolution from SGD to Adam represents decades of optimization research. Each algorithm addresses specific challenges:

  • SGD: Simple but effective for convex problems
  • Momentum: Accelerates convergence and reduces oscillations
  • RMSprop: Adapts learning rate per parameter
  • Adam: Combines momentum and adaptive learning rates

Understanding these algorithms helps you make informed choices for your specific machine learning problems. The key is to understand the trade-offs and experiment with different optimizers based on your specific use case.


For interactive experiments and detailed implementations, check out the GitHub repository. Try different optimizers on various loss functions to gain intuition about their behavior!