A walk through Gradient Boosting and Lasso Regression for the RuleFit algorithm
Ever wondered if there had been an algorithm that had the ability to include feature interactions in a model like a decision tree and at the same time had the interpretability and simplicity of a Linear Regression? I present to you the RuleFit algorithm, which takes each tree, from a tree ensemble, disassembles it into decision rules and uses them as additional features in a linear Lasso model. This article is a walkthrough of each of the steps that make up the RuleFit algorithm.
Herein, we will go through the algorithm Gradient Boosting for tree generation. Owing to its lack of direct interpretability, the trees generated by the Gradient Boosting method will be used as features to a Lasso regression model. As an answer to this lack of interpretability, the Lasso model will be introduced and finally, we will take a look at the results of the RuleFit algorithm.
Gradient Boosting for Tree Generation
The first step is to consider an algorithm that creates a lot of trees, so let’s take one of the blackbox models which is difficult to interpret directly – Gradient Boosting. Boosting is the process which combines the output of many “weak” classifiers to produce a powerful “committee”. Initially, boosting methods were only used for classification tasks but were later developed to suit regression as well.
Apply a classifier, named classifier 1, on the training dataset. At this point, each of the observation is given a weight wi=1N, i = 1(1)N, where N is the total number of observations. Observations that were wrongly classified by classifier 1 have their weights increased and the ones which were correctly classified have their weights decreased. Another classifier, classifier 2, is fitted on these weighted observations. Again, the misclassified observations are given more weight and vice versa, and classifier 3 is fitted on these newly weighted observations. Say, this process goes on up to M steps. Then the final classifier is a function of the weighted sum of these classifiers (classifier 1, 2, 3 and so on), where the weights are m, m = 1(1)M. These m weights are higher for more accurate classifiers. Since the final classifier is a weighted sum, it has the property of being an additive model. This algorithm is also known as AdaBoost.M1.
Let’s take the entire algorithm step by step to understand the role of the gradient in Gradient Boosting.
Classifier 1 has weights w1 , w2 , w3 , …, wNassociated with its observations.
Classifier 2 has weights w1+1 , w2+2 , w3+3, …, wN+Nassociated with its observations, where iis the change in weight associated with ithobservation.
So, roughly we can write, classifier 2 = classifier 1 + 1, where 1represents the changes that occur upon the observations.
In general, classifier m = classifier (m-1) + m.
Now in the case of gradient boosted trees, the term mis derived from the negative gradient of a suitably chosen loss function.
Lasso Regression for Interpretability
In the Adaboost.M1. model, there is no direct interpretation of the weights associated with the observations at each step of classification or of the weights associated with each of the classifiers. This lack of interpretability follows in a gradient boosted model, but what we do get is a collection of trees or decision rules which can be used as features for an interpretable regression model, and for this purpose, lasso regression is used.
A lasso regression simply penalizes the weights of the features which are not much relevant to the model. So it performs a kind of feature selection and introduces sparsity. But how does it identify which features to choose? One explanation is the following.
Let us consider two features X1and X2 with a target variable Y. It is expected that Y varies with changes in the predictors.
Consider that the following figure shows the distribution of X1and X2 with the Y-axis sticking out of the figure. The variables X1and X2 can be divided into principal components, where a pair of principal components are the axes of a new coordinate system in which every point has a new (X1, X2 ) value. These principal components do not mean anything physically and are just chosen to give one set of axes lots of variation. This technique is also called principal component analysis or PCA in short.
The different principal components are represented in different colors. The principal component which lies in the direction which shows the maximum variance in the data is known as the largest principal component and the smallest principal component lies in the direction of minimum variance.
Now it is obvious that the response variable will tend to vary the most in the direction in which the input data varies the most, i.e., the direction of the largest principal component and the parameters estimated using this direction will have less variance and hence, be more accurate. The inverse of this situation will follow for the smallest principal component.
Lasso regression protects against the potentially high variance of parameters estimated in the short directions, i.e., the components which carry a small variance of the distribution of the feature variables and shrink the coefficients of such components to zero, more than that of the high variance components.
Observations on RuleFit algorithm
One thing to mention is that the original raw features are also added to the Lasso model, as trees are not able to represent simple linear relationships well. The interpretations of the raw features for a RuleFit algorithm are the same as that of a regression model. For decision rules, if all the conditions of a rule r apply, the predicted outcome changes by k, where k is the learned weight for rules r in the model.
Using the RuleFit algorithm on the Bike Sharing Dataset of the UCI ML repository gave at par results with a default parameterized Gradient Boosting Regressor and Random Forest. This adds weight to the notion that you need not be deprived of accuracy for interpretability. There were 140 rules out of 1729 which got a nonzero weight and the ones to get maximum weights were all interaction features.
Thus, the RuleFit algorithm serves as the perfect bridge between the inclusion of interaction and nonlinearity by decision rules and the interpretability of simple linear models. Yet, it remains one of the most unexplored algorithms of the data science world.