regression models the relationship, often linear, between one or more predictors and the response variable (true label). Then, we can use this model to generate predicted values ; performance is quantified by a function of the difference between the true and predicted values

Typical statistical representation for true labels

Typical machine learning representation for predicted values

For easier calculations, in practice they’re usually laid out as following for the model:

is an matrix, which has data points and features/predictors. is a vector representing feature .
- can be non-linear, such as polynomial features to construct polynomial regression
(equiv. ) is a vector of true values. is the vector of predicted values (same dimensions)
is the vector of weights (equiv. slopes ) for the predictors
is the inner product between and , and the result has the dimensions
is a scalar value representing the bias term (equiv. intercept ). For the dimensions to align, the value is repeated times to create a vector of dimensions
is the vector of error terms (which are usually residuals)

Note that the notations are consistent here: bold for vectors and matrices, otherwise scalar values. However, such consistency is impossible out there. We need to read carefully for the notations; often time, is enough to mean matrix of data, and is enough to mean all predicted values.

Assumptions (LINH)

(Statology) (Coursera)

L - Linearity: There exists a linear relationship between two quantitative variables, tested with a scatterplot sns.pairplot
I - Independence of observations: Each observation in the dataset is independent.
No correlation between errors. Otherwise, the estimated standard errors will tend to underestimate the true standard errors. As a result, confidence and prediction intervals will be narrower than they should be. time series does NOT qualify for this assumption
N - Normality of errors: The plot of errors is relatively normally distributed statsmodels.api.qqplot
H - Homoscedasticity of errors: The errors have constant variance at every level of (homoscedastic shape). It could be checked by creating a fitted value vs. residual (QQ) plot
Little to no collinearity (used in multiple regression): No or little correlation between pairs of independent variables. This can be quantified using variance inflation factor

When an assumption is violated

Transform the data → change the interpretation of the results
Consider a different kind of model

More details: When an assumption is violated

Linearity

Transform one or both of the variables, such as taking the logarithm.

Normality

Transform one or both variables. Most commonly, this would involve taking the logarithm of the outcome variable.

When the outcome variable is right skewed, such as income, the normality of the residuals can be affected. So, taking the logarithm of the outcome variable can sometimes help with this assumption.

If you transform a variable, you will need to reconstruct the model and then recheck the normality assumption to be sure. If the assumption is still not satisfied, you’ll have to continue troubleshooting the issue.

Independent observations

Take just a subset of the available data, e.g.,

when conducting a survey, just keeping the responses of one person in each household.

perhaps the number of bikes rented out is independent if the data is taken once every 2 hours, instead of once every 15 minutes.

Homoscedasticity

Define a different outcome variable.

If you are interested in understanding how a city’s population correlates with the number of restaurants in a city, you know that some cities are much more populous than others. You can then redefine the outcome variable as the ratio of population to restaurants.

Transform the Y variable.

Sometimes taking the logarithm or transforming the Y variable in another way can potentially fix inconsistencies with the homoscedasticity assumption.

Linear Algebra: Normal equation

Transclude of normal-equation

Calculus: Gradient Descent

Process

Check assumptions

feature scaling

Compute loss and cost functions

Construct a best-fit line using gradient descent (regularization if needed)

Calculate adjusted R-squared

Calculate p-value

Loss and cost functions
mean squared error
The loss function used in regression is called mean squared error

where:

Incorporating that loss function into the cost function :
Code
import numpy as np 
 
def compute_cost(X, y, w, b): 
"""
compute cost
Args:
X (ndarray (m,n)): Data, m examples with n features
y (ndarray (m,)) : target values
w (ndarray (n,)) : model parameters  
b (scalar)       : model parameter
 
Returns:
cost (scalar): cost
"""
m = X.shape[0]
cost = 0.0
for i in range(m):                                
f_wb_i = np.dot(X[i], w) + b           #(n,)(n,) = scalar (see np.dot)
cost = cost + (f_wb_i - y[i])**2       #scalar
cost = cost / (2 * m)                      #scalar    
return cost
 
cost = compute_cost(X_train, y_train, w_init, b_init)
Link to original
Gradient descent

gradient descent for multiple variables:

where, is the number of features, parameters , , are updated simultaneously and where

m is the number of training examples in the data set

is the model’s prediction, while is the target value

Regularized linear regression

Regularization

After add regularization to cost function, we implement new gradient descent:

The regularized part decreases each iteration by a little bit.
Link to original
Link to original

My (Chiffon) Nguyen

Explorer

regression

Assumptions (LINH)

When an assumption is violated

Linearity

Normality

Independent observations

Homoscedasticity

Linear Algebra: Normal equation

Calculus: Gradient Descent

Process

Loss and cost functions

mean squared error

Gradient descent

Regularized linear regression

Regularization

Graph View

Table of Contents

Backlinks