Iterative loop of ML development (example)
Choose architecture
Split data
Split large data into train-validation-test
- Popular splits are
- 70-15-15
- 80-10-10
- 60-20-20
- cross validation
Explanatory data analysis
- Examine
- the size of the data (no of features, of training examples)
- missing data
- skewness
Specify model
- Specify the model
- parameters
- learning algorithm
- Determine cost function with regularization
Train model
We can have multiple model candidates, which will be considered and chosen in the evaluating stage
- Train on the training set
- Compute training error
Evaluate
(Source)
- For each model, compute validation error (= cost function applied to the validation set or cross validation)
- Pick the model with lowest validation error
Error diagnostics
- Bias-Variance
- error analysis
- For skewed datasets: precision and sensitivity
Improve (for the next round)
- Add more data on category where error occurs (after error analysis)
- Transfer learning
Test
- Confirm results on a test set
- Calculate test error (generalization error)
Why don't we fit any parameters to test set?
This procedure ensures that we haven’t accidentally fit anything to the test set so that it’s a NEW, fair and not overly optimistic estimate of how well the model will generalize to new data. BEST PRACTICE: Make all decisions and tweaks to the learning algorithm before touching the test set.