regression models the relationship, often linear, between one or more predictors and the response variable (true label). Then, we can use this model to generate predicted values ; performance is quantified by a function of the difference between the true and predicted values

Typical statistical representation for true labels

Typical machine learning representation for predicted values

For easier calculations, in practice they’re usually laid out as following for the model:

  • is an matrix, which has data points and features/predictors. is a vector representing feature .
  • (equiv. ) is a vector of true values. is the vector of predicted values (same dimensions)
  • is the vector of weights (equiv. slopes ) for the predictors
  • is the inner product between and , and the result has the dimensions
  • is a scalar value representing the bias term (equiv. intercept ). For the dimensions to align, the value is repeated times to create a vector of dimensions
  • is the vector of error terms (which are usually residuals)

Note that the notations are consistent here: bold for vectors and matrices, otherwise scalar values. However, such consistency is impossible out there. We need to read carefully for the notations; often time, is enough to mean matrix of data, and is enough to mean all predicted values.

 

Assumptions (LINH)

(Statology) (Coursera)

  • L - Linearity: There exists a linear relationship between two quantitative variables, tested with a scatterplot sns.pairplot
  • I - Independence of observations: Each observation in the dataset is independent.
  • No correlation between errors. Otherwise, the estimated standard errors will tend to underestimate the true standard errors. As a result, confidence and prediction intervals will be narrower than they should be. time series does NOT qualify for this assumption
  • N - Normality of errors: The plot of errors is relatively normally distributed statsmodels.api.qqplot
  • H - Homoscedasticity of errors: The errors have constant variance at every level of (homoscedastic shape). It could be checked by creating a fitted value vs. residual (QQ) plot
  • Little to no collinearity (used in multiple regression): No or little correlation between pairs of independent variables. This can be quantified using variance inflation factor

When an assumption is violated

  • Transform the data change the interpretation of the results
  • Consider a different kind of model

Linear Algebra: Normal equation

Transclude of normal-equation

Calculus: Gradient Descent

Process

  1. Check assumptions
  2. feature scaling
  3. Compute loss and cost functions
  4. Construct a best-fit line using gradient descent (regularization if needed)
  5. Calculate adjusted R-squared
  6. Calculate p-value

Loss and cost functions

mean squared error

The loss function used in regression is called mean squared error

where:

Incorporating that loss function into the cost function :

Link to original

Gradient descent

gradient descent for multiple variables:

where, is the number of features, parameters , , are updated simultaneously and where

  • m is the number of training examples in the data set
  • is the model’s prediction, while is the target value

Regularized linear regression

Regularization

After add regularization to cost function, we implement new gradient descent:

The regularized part decreases each iteration by a little bit.

Link to original

Link to original