gradient descent is a common descent method. It is an iterative, greedy algorithm that usually finds a local minimum of a differentiable objective function by iteratively adjusting in the direction of greatest (steepest) descent.

  • : feature
  • : fixed learning rate, determining the size step
  • cost function
  • : gradient of function , determining the directions of descent descent

General algorithm

Theoretically guaranteed to find local minimum for functions that are , when we’re using line search algorithms for step sizes.

Inputs: function , direction of descent , initial point

  1. Pick the initial, starting point
  2. Calculate the gradient at this point
  3. Find the direction of steepest descent
  4. Find step size using one of these
    1. calculate the step size using a line search algorithm, usually backtracking line search ()
    2. use fixed one (learning rate)
    3. use shrinking step size
  5. Step to the next point , using the step size in the direction
  6. Repeat until the terminal conditions, e.g. for a fixed number of iterations

Outputs: ,

Fixed step size: Pros and Cons

  • Using the learning rate, the gradient descent would take straighter steps to descent towards the minimum. With proper learning rate, , the gradient descent could converge to local minimum with less computational cost. This is the convention for machine learning.
  • The cons is that the function must be more simple and we must have good heuristic for a good learning rate.

Code

import autograd.numpy as np
from autograd import grad, hessian
 
def rosenbrock(x):
    return (1 - x[0])**2 + 5 * (x[1] - x[0]**2)**2
 
def backtracking(f, gradient_func, d, xi, alpha=1, p=0.5, c=1e-4):
    while f(xi + alpha * d) > f(xi) + c * alpha * gradient_func(f, xi) @ d:
        alpha *= p
    return alpha
 
# Gradient descent function with backtracking line search 
def gradient_descent(f, x0, epsilon=1e-6, max_iterations=1000):
    xi = np.array(x0, dtype=float)
    step_counter = 0
    grad = grad
    
    for i in range(max_iterations):
        grad_norm = np.linalg.norm(grad)
 
        # Termination condition: if the gradient norm is small enough, stop
        if grad_norm < epsilon:
            break
 
        # Calculate descent direction
        di = -grad / grad_norm
 
        # Use backtracking line search to find the optimal step size (alpha)
        alpha = backtracking(f, numerical_gradient, di, xi)
 
        # Update position
        x_next = xi + alpha * di
 
        # Check for function value improvement
        if abs(f(x_next) - f(xi)) < epsilon:
            break
        
        # Update for the next iteration
        xi = x_next
        grad = numerical_gradient(f, xi, epsilon)
        step_counter += 1
 
    return xi, f(xi), step_counter
 
# Test the gradient descent function with Rosenbrock function
x_optimal, f_optimal, steps = gradient_descent(rosenbrock, [1, 2])
 
print(f"Optimal x: {x_optimal}")
print(f"Function value at optimal x: {f_optimal}")
print(f"Number of steps: {steps}")

Usage

Type

Algorithm (ML convention)

  1. Declare the learning algorithm
  2. Initialize
    1. parameters (usually set to 0’s or randomized)
    2. learning rate
    3. max number of iterations
  3. Run the gradient descent algorithm for times to update parameters and compute cost function
  4. Plot cost function VS number of training iterations. Check if gradient descent converges

Formula

where each iteration performs simultaneous updates on for all weight parameters

  • is the number of training examples in the data set
  • is the model’s prediction for , while is the actual value

Code

(ML Specialization)

Tips

Feature Scaling

feature scaling

Regularization

After add regularization to cost function, we implement new gradient descent:

The regularized part decreases each iteration by a little bit.