(Code Lab)
anomaly detection is an unsupervised learning algorithm to find unusual events.
What's anomaly?
An example is considered anomaly if some of its features
has lower probability than a given small probability (
is a vector of features)
Assumptions
- Anomalies in data occur only very rarely
- The features of data anomalies are significantly different from those of normal instances
Applications
Anomalous data is linked to some sort of problem or rare event such as
- Hacking
- Fraud
- New, unseen defects (previously seen defects can be detected using supervised learning)
- Textual errors
anomaly detection supervised learning very small number of anomalous/positive examples comparable number of positive & negative examples future anomalies may completely differ from any of anomalous examples trained enough positive examples in the training set, future positive examples are similar e.g. spam
Algorithm
(ML Spec)
Given a training set with
Choose features for anomaly
(How to choose features)
Choose
They should
- be normally distributed. When doing exploratory data analysis, plot their histograms.
- take on unusually large or small values in the event of anomalies
Work with non-normal features
If features are skewed, try transforming using
Fit parameters
- Estimate parameter for normal distribution using maximum likelihood estimation
import numpy as np
def estimate_gaussian(X):
"""
Calculates mean and variance of all features
in the dataset
Args:
X (ndarray): (m, n) Data matrix
Returns:
mu (ndarray): (n,) Mean of all features
var (ndarray): (n,) Variance of all features
"""
m, n = X.shape
mu = 1 / m * np.sum(X, axis = 0)
var = 1 / m * np.sum(np.square(X-mu), axis = 0)
return mu, var
- Fit parameters for ALL
features:
Density estimation
- Compute
for each new example
- Anomaly if
Evaluation
Real-number evaluation
(ML Spec) When developing a learning algorithm (choosing features, etc.), making decisions is much easier if we have a way of evaluating our learning algorithm.
- Fit model
on non-anomalous labeled training set - Tune parameters on cross-validation set with few anomalies