regression models the relationship, often linear, between one or more predictors and the response variable (true label). Then, we can use this model to generate predicted values ; performance is quantified by a function of the difference between the true and predicted values
Typical statistical representation for true labels
Typical machine learning representation for predicted values
For easier calculations, in practice they’re usually laid out as following for the model:
is an matrix, which has data points and features/predictors. is a vector representing feature .
can be non-linear, such as polynomial features to construct polynomial regression
(equiv. ) is a vector of true values. is the vector of predicted values (same dimensions)
is the vector of weights (equiv. slopes ) for the predictors
is the inner product between and , and the result has the dimensions
is a scalar value representing the bias term (equiv. intercept ). For the dimensions to align, the value is repeated times to create a vector of dimensions
is the vector of error terms (which are usually residuals)
Note that the notations are consistent here: bold for vectors and matrices, otherwise scalar values. However, such consistency is impossible out there. We need to read carefully for the notations; often time, is enough to mean matrix of data, and is enough to mean all predicted values.
L - Linearity: There exists a linear relationship between two quantitative variables, tested with a scatterplotsns.pairplot
I - Independence of observations: Each observation in the dataset is independent.
No correlation between errors. Otherwise, the estimated standard errors will tend to underestimate the true standard errors. As a result, confidence and prediction intervals will be narrower than they should be. time series does NOT qualify for this assumption
N - Normality of errors: The plot of errors is relatively normally distributed statsmodels.api.qqplot
H - Homoscedasticity of errors: The errors have constant variance at every level of (homoscedastic shape). It could be checked by creating a fitted value vs. residual (QQ) plot
Transform the data → change the interpretation of the results
Consider a different kind of model
More details: When an assumption is violated
Linearity
Transform one or both of the variables, such as taking the logarithm.
Normality
Transform one or both variables. Most commonly, this would involve taking the logarithm of the outcome variable.
When the outcome variable is right skewed, such as income, the normality of the residuals can be affected. So, taking the logarithm of the outcome variable can sometimes help with this assumption.
If you transform a variable, you will need to reconstruct the model and then recheck the normality assumption to be sure. If the assumption is still not satisfied, you’ll have to continue troubleshooting the issue.
Independent observations
Take just a subset of the available data, e.g.,
when conducting a survey, just keeping the responses of one person in each household.
perhaps the number of bikes rented out is independent if the data is taken once every 2 hours, instead of once every 15 minutes.
Homoscedasticity
Define a different outcome variable.
If you are interested in understanding how a city’s population correlates with the number of restaurants in a city, you know that some cities are much more populous than others. You can then redefine the outcome variable as the ratio of population to restaurants.
Transform the Y variable.
Sometimes taking the logarithm or transforming the Y variable in another way can potentially fix inconsistencies with the homoscedasticity assumption.