gradient descent is a common descent method. It is an iterative, greedy algorithm that usually finds a local minimum of a differentiable objective function by iteratively adjusting in the direction of greatest (steepest) descent.
: feature : fixed learning rate, determining the size step - cost function
: gradient of function , determining the directions of descent descent
General ML algorithm
After differentiating the cost function, we get batch gradient descent
: current iteration of the algorithm : vector of features, with being feature : the training example : the feature of the training example In each iteration
gradient descent algorithm simultaneously updates ALL parameters (aka a vector of features). With each step of gradient descent, a parameter comes closer to the optimal values that will achieve the lowest cost
General algorithm
Theoretically guaranteed to find local minimum for functions that are , when we’re using line search algorithms for step sizes.
Inputs: function
- Pick the initial, starting point
- Calculate the gradient at this point
- Find the direction of steepest descent
- Find step size
using one of these - calculate the step size
using a line search algorithm, usually backtracking line search () - use fixed one (learning rate)
- use shrinking step size
- calculate the step size
- Step to the next point
, using the step size in the direction - Repeat until the terminal conditions, e.g. for a fixed number of iterations
Outputs:
Intuition
Intuitively, gradient descent is like finding the lowest point in a hilly landscape.
- Imagine you’re on a hill, and you want to reach the lowest point (minimum).
- Start at any random spot on the hill.
- Look around and feel the slope. Go downhill in the steepest direction.
- Repeat step 3: move a little, check the slope, and adjust your direction.
- Keep doing this until you reach the lowest point, where the slope is nearly flat.
Fixed step size: Pros and Cons
- Using the learning rate, the gradient descent would take straighter steps to descent towards the minimum. With proper learning rate,
, the gradient descent could converge to local minimum with less computational cost. This is the convention for machine learning. - The cons is that the function must be more simple and we must have good heuristic for a good learning rate.
Code
import autograd.numpy as np
from autograd import grad, hessian
def rosenbrock(x):
return (1 - x[0])**2 + 5 * (x[1] - x[0]**2)**2
def backtracking(f, gradient_func, d, xi, alpha=1, p=0.5, c=1e-4):
while f(xi + alpha * d) > f(xi) + c * alpha * gradient_func(f, xi) @ d:
alpha *= p
return alpha
# Gradient descent function with backtracking line search
def gradient_descent(f, x0, epsilon=1e-6, max_iterations=1000):
xi = np.array(x0, dtype=float)
step_counter = 0
grad = grad
for i in range(max_iterations):
grad_norm = np.linalg.norm(grad)
# Termination condition: if the gradient norm is small enough, stop
if grad_norm < epsilon:
break
# Calculate descent direction
di = -grad / grad_norm
# Use backtracking line search to find the optimal step size (alpha)
alpha = backtracking(f, numerical_gradient, di, xi)
# Update position
x_next = xi + alpha * di
# Check for function value improvement
if abs(f(x_next) - f(xi)) < epsilon:
break
# Update for the next iteration
xi = x_next
grad = numerical_gradient(f, xi, epsilon)
step_counter += 1
return xi, f(xi), step_counter
# Test the gradient descent function with Rosenbrock function
x_optimal, f_optimal, steps = gradient_descent(rosenbrock, [1, 2])
print(f"Optimal x: {x_optimal}")
print(f"Function value at optimal x: {f_optimal}")
print(f"Number of steps: {steps}")
Usage
- Minimize cost function, equivalent to optimizing regression models
Type
- mini-batch gradient descent: split data into small batches of similar number of points, and teach each batch to update parameters
Algorithm (ML convention)
- Declare the learning algorithm
- Initialize
- parameters (usually set to 0’s or randomized)
- learning rate
- max number of iterations
- Run the gradient descent algorithm for
times to update parameters and compute cost function - Plot cost function VS number of training iterations. Check if gradient descent converges
How to choose the learning rate
(ML Specialization) (ML Specialization)
- Experiment with different learning rates. The following problem is common to descent methods
- If it is too small, gradient descent might be slow to converge
- If it is too large, gradient descent might overshoot and fail to reach the minimum
- Plot the function (e.g. cost function) over a few iterations (~20) to how the gradient descent works
- cost function should decrease after every iteration
- If learning rate works, increase it by 3X until gradient descent doesn’t work
- Use a smaller learning rate
when gradient descent is not working, but not too low - When there are multiple local minima, starting values of parameters and learning rate will lead the model to end up at different minimum
Formula
where each iteration
is the number of training examples in the data set is the model’s prediction for , while is the actual value
Code
Tips
Feature Scaling
Regularization
After add regularization to cost function, we implement new gradient descent:
The regularized part decreases