XGBoost is an optimized gradient boosting package of tree ensemble for fast training, effective regularization of features, and tunable hyper-parameters
Summary of innovations
XGBoost inherits advantages of decision trees
- Interpretable
- Non-parametric
Added properties compared to decision tree (ML)
- tree ensemble
- gradient boosting
- regularization
- handle missing values
Algorithm
Additive Training (From Gradient Boosting)
It is intractable to learn all the trees at once. Instead, we use an additive strategy: fix what we have learned, and add one new tree at a time. We write the prediction value at step
as . Then we have Which tree do we want at each step? A natural thing is to add the one that optimizes our objective function (comprising loss function
and regularization that defines the model complexity) For many losses of interest (except mean squared error), it is not so easy to get a nice form to take gradient. In the general case, we take the Taylor expansion of the loss function up to the second order:
where
is the first-order gradient of loss function, and is the Hessian matrix of loss function ⭐ After we remove all the constants, the specific objective at step
becomes This becomes our optimization goal for the new tree.
One important advantage is that the value of the objective function only depends on
and . This is how XGBoost supports every loss function, including logistic regression and pairwise ranking, by using exactly the same solver that takes and as input!