In machine learning, cost function describes how well the current learning algorithm is performing on the given dataset.

The goal is to choose parameters so that learning algorithm is the closest to of the dataset. In other words, the objective function is to minimize the cost function

: a vector of parameters
: the number of training examples
: learning algorithm
: a matrix with dimension of x number of features
: the example in the dataset

cost function VS loss function

(Stackexchange)

loss function (or just ) is associated with one training data point

cost function computes the average value of loss function over ALL training data points

Linear models

In linear models, minimizing cost function is equivalent to minimizing the residuals

Code

(Source)

#sample cost function for linear regression
import numpy as np
 
def compute_cost(x, y, w, b): 
    """
    Computes the cost function for linear regression.
    
    Args:
      x (ndarray (m,)): Data, m examples 
      y (ndarray (m,)): target values
      w,b (scalar)    : model parameters  
    
    Returns
        total_cost (float): The cost of using w,b as the parameters for linear regression
               to fit the data points in x and y
    """
    # number of training examples
    m = x.shape[0] 
    
    cost_sum = 0 
    for i in range(m): 
        f_wb = w * x[i] + b   
        cost = (f_wb - y[i]) ** 2  
        cost_sum = cost_sum + cost  
    total_cost = (1 / (2 * m)) * cost_sum  
 
    return total_cost
 
J = compute_cost(X, y, theta=np.array([0.0, 0.0]))
print('With theta = [0, 0] \nCost computed = %.2f' % J)
print('Expected cost value (approximately) 32.07\n')

With regularization

Cost function with regularization

regularization is added directly to the cost function

For example, mean squared error (cost of linear regression) with Ridge regularization:

: a vector of weights, a scalar-valued bias term

: number of training examples

: learning algorithm to fit a vector of features

: training example in the dataset

: regularization parameter. How much we want to shrink the impact of some predictors. The larger the more penalty.

: number of features

: weight parameter (to be penalized)

In practice, we might or might not penalize parameter , because it makes little difference. It’s just a number.

How to choose

(Source)

is inversely proportional to Variance

Increasing the regularization parameter reduces overfitting by reducing the size of the parameters. For some parameters that are near zero, this reduces the effect of the associated features. However, extremely large might lead to underfitting In contrast, a very small can leave overfitting unsolved.

Try different values for , each doubling the previous:

Minimize cost function with regularization, as we do normal cost function.

Evaluate parameters on the validation set

Pick parameters that has the lowest validation error

Report test error (or cross validation error as an estimate)

Why the procedure?

When we start with small , cost function with regularization on training set is increasing proportionately

We can then find where the cost function of cross-validation set is the smallest (lowest validation error)

Link to original

My (Chiffon) Nguyen

Explorer

cost function

Linear models

Code

With regularization

Cost function with regularization

How to choose

Graph View

Table of Contents

Backlinks