(Code Lab)

anomaly detection is an unsupervised learning algorithm to find unusual events.

What's anomaly?

An example is considered anomaly if some of its features has lower probability than a given small probability

( is a vector of features)

Assumptions

  • Anomalies in data occur only very rarely
  • The features of data anomalies are significantly different from those of normal instances

Applications

Anomalous data is linked to some sort of problem or rare event such as

  • Hacking
  • Fraud
  • New, unseen defects (previously seen defects can be detected using supervised learning)
  • Textual errors

Algorithm

(ML Spec) Given a training set with examples, each example has features

Choose features for anomaly

(How to choose features) Choose features, denoted , that might be indicative of anomalies

They should

  • be normally distributed. When doing exploratory data analysis, plot their histograms.
  • take on unusually large or small values in the event of anomalies

Fit parameters

  1. Estimate parameter for normal distribution using maximum likelihood estimation
import numpy as np
 
def estimate_gaussian(X): 
    """
    Calculates mean and variance of all features 
    in the dataset
    
    Args:
        X (ndarray): (m, n) Data matrix
    
    Returns:
        mu (ndarray): (n,) Mean of all features
        var (ndarray): (n,) Variance of all features
    """
    m, n = X.shape
    mu = 1 / m * np.sum(X, axis = 0)
    var = 1 / m * np.sum(np.square(X-mu), axis = 0)        
	return mu, var
  1. Fit parameters for ALL features:

Density estimation

  1. Compute for each new example
  1. Anomaly if

Evaluation

Real-number evaluation

(ML Spec) When developing a learning algorithm (choosing features, etc.), making decisions is much easier if we have a way of evaluating our learning algorithm.

  1. Fit model on non-anomalous labeled training set
  2. Tune parameters on cross-validation set with few anomalies
You can't use 'macro parameter character #' in math modey=\begin{cases} 1 \quad \text{if } p(x) <\epsilon \; \text{(anomaly)} \\ 0 \quad \text{if } p(x) \geq \epsilon \; \text{(normal)} \end{cases}$$ 3. Test on a test set (including few anomalies) > [!example]- Eg. Detect flawed aircraft engines > > ![](https://i.imgur.com/H8tQVVn.jpg) An alternative is to have no test set, with all anomalous examples in cross-validation set. However, it should only be used when there are very few labeled anomalous examples, because no test set leads to higher risk of [[overfitting]] ## Error analysis Common problem: - We want $p(x)$ large for normal examples $x$ and small for anomalous examples, but $p(x)$ is comparable (e.g. both large) for both types ➡️ the algorithm fails to flag some examples as anomalies [[error analysis]]: - Manually look through should-be anomalous examples - **Identify additional, unused features that distinguish anomalies from normal examples**