gradient descent is a common descent method. It is an iterative, greedy algorithm that usually finds a local minimum of a differentiable objective function by iteratively adjusting in the direction of greatest (steepest) descent.

: feature
: fixed learning rate, determining the size step
cost function
: gradient of function , determining the directions of descent descent

General ML algorithm

After differentiating the cost function, we get batch gradient descent

: current iteration of the algorithm

: vector of features, with being feature

: the training example

: the feature of the training example

In each iteration gradient descent algorithm simultaneously updates ALL parameters (aka a vector of features). With each step of gradient descent, a parameter comes closer to the optimal values that will achieve the lowest cost

General algorithm

Theoretically guaranteed to find local minimum for functions that are , when we’re using line search algorithms for step sizes.

Inputs: function , direction of descent , initial point

Pick the initial, starting point
Calculate the gradient at this point
Find the direction of steepest descent
Find step size using one of these
1. calculate the step size using a line search algorithm, usually backtracking line search ()
2. use fixed one (learning rate)
3. use shrinking step size
Step to the next point , using the step size in the direction
Repeat until the terminal conditions, e.g. for a fixed number of iterations

Outputs: ,

Intuition

Intuitively, gradient descent is like finding the lowest point in a hilly landscape.

Imagine you’re on a hill, and you want to reach the lowest point (minimum).

Start at any random spot on the hill.

Look around and feel the slope. Go downhill in the steepest direction.

Repeat step 3: move a little, check the slope, and adjust your direction.

Keep doing this until you reach the lowest point, where the slope is nearly flat.

Fixed step size: Pros and Cons

Using the learning rate, the gradient descent would take straighter steps to descent towards the minimum. With proper learning rate, , the gradient descent could converge to local minimum with less computational cost. This is the convention for machine learning.

The cons is that the function must be more simple and we must have good heuristic for a good learning rate.

Code

import autograd.numpy as np
from autograd import grad, hessian
 
def rosenbrock(x):
    return (1 - x[0])**2 + 5 * (x[1] - x[0]**2)**2
 
def backtracking(f, gradient_func, d, xi, alpha=1, p=0.5, c=1e-4):
    while f(xi + alpha * d) > f(xi) + c * alpha * gradient_func(f, xi) @ d:
        alpha *= p
    return alpha
 
# Gradient descent function with backtracking line search 
def gradient_descent(f, x0, epsilon=1e-6, max_iterations=1000):
    xi = np.array(x0, dtype=float)
    step_counter = 0
    grad = grad
    
    for i in range(max_iterations):
        grad_norm = np.linalg.norm(grad)
 
        # Termination condition: if the gradient norm is small enough, stop
        if grad_norm < epsilon:
            break
 
        # Calculate descent direction
        di = -grad / grad_norm
 
        # Use backtracking line search to find the optimal step size (alpha)
        alpha = backtracking(f, numerical_gradient, di, xi)
 
        # Update position
        x_next = xi + alpha * di
 
        # Check for function value improvement
        if abs(f(x_next) - f(xi)) < epsilon:
            break
        
        # Update for the next iteration
        xi = x_next
        grad = numerical_gradient(f, xi, epsilon)
        step_counter += 1
 
    return xi, f(xi), step_counter
 
# Test the gradient descent function with Rosenbrock function
x_optimal, f_optimal, steps = gradient_descent(rosenbrock, [1, 2])
 
print(f"Optimal x: {x_optimal}")
print(f"Function value at optimal x: {f_optimal}")
print(f"Number of steps: {steps}")

Usage

Minimize cost function, equivalent to optimizing regression models

Type

mini-batch gradient descent: split data into small batches of similar number of points, and teach each batch to update parameters

Algorithm (ML convention)

Declare the learning algorithm
Initialize
1. parameters (usually set to 0’s or randomized)
2. learning rate
3. max number of iterations
Run the gradient descent algorithm for times to update parameters and compute cost function
Plot cost function VS number of training iterations. Check if gradient descent converges

How to choose the learning rate

(ML Specialization) (ML Specialization)

Experiment with different learning rates. The following problem is common to descent methods

If it is too small, gradient descent might be slow to converge

If it is too large, gradient descent might overshoot and fail to reach the minimum

Plot the function (e.g. cost function) over a few iterations (~20) to how the gradient descent works

cost function should decrease after every iteration

If learning rate works, increase it by 3X until gradient descent doesn’t work

Use a smaller learning rate when gradient descent is not working, but not too low

When there are multiple local minima, starting values of parameters and learning rate will lead the model to end up at different minimum

Formula

where each iteration performs simultaneous updates on for all weight parameters

is the number of training examples in the data set
is the model’s prediction for , while is the actual value

Code

(ML Specialization)

Tips

Feature Scaling

feature scaling

Regularization

After add regularization to cost function, we implement new gradient descent:

The regularized part decreases each iteration by a little bit.

My (Chiffon) Nguyen

Explorer

gradient descent

General algorithm

Code

Usage

Type

Algorithm (ML convention)

Formula

Code

Tips

Feature Scaling

Regularization

Graph View

Table of Contents

Backlinks