During the fourth week at Zipfian, we implemented a few more basic machine learning algorithms from scratch using Python. One of the more interesting ones for me is the decision tree. As a classification technique, the decision tree has quite a number of appealing features, the chief of which are its simplicity and interpretability. Despite its conceptual simplicity, it is actually not that straightforward to implement from a programming perspective. Hopefully, this post can serve as a brief overview on how to program a simple decision tree.
Like many other algorithms we’ve learned, the first step is to define a cost function that we will be minimizing. In the case of a decision tree, the cost function determines how variance is calculated at each branch of the tree. For our implementation, we used a concept called entropy to calculate this variance:
In general, entropy measures the amount of disorder in a set and our goal is minimize entropy after each split. Given the mathematical definition of entropy above, here is one way to write the entropy
function in Python:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
|
In the code above, we first calculate the variable prob
which gives the probability of occurence for each class/label in a set. Then we plug this probability into the entropy formula to derive the total entropy for the set.
Now that we have a general function for calculating entropy, we can think about how to use it to determine the best split. Below is the pseudocode from our assignment which provides a roadmap on how to implement a decision tree:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
|
As can be seen from this pseudocode, an important step within this BuildTree
function is to “find the best feature and value to split the data”. At this point, we were introduced to a critical concept called information gain which determines how we should judge the “effectiveness” of a particular split.
comprises the “children” sets from the original parent set (i.e. pre-split set) of while represents an individual child branch. In our case, since we’re only dealing with binary splits, we only have two children branches in . The information gain for each split is calculated as follows:
where Here, we’ve finally encountered our splitting criteria. is the entropy of the parent branch , while and are the entropy of child branch and child branch , respectively.
In short, this formula says that the information gain of a split is equal to the difference between the entropy of the parent branch (before the split) and the combined entropy the children branches (after the split).
The lower the entropy of the combined children branches, the higher the information gain. So the best split at any node is the one that results in the lowest combined entropy of the children branches (therefore, the highest information gain).
[Talk about recursion next]