To Grow a (Decision) Tree

During the fourth week at Zipfian, we implemented a few more basic machine learning algorithms from scratch using Python. One of the more interesting ones for me is the decision tree. As a classification technique, the decision tree has quite a number of appealing features, the chief of which are its simplicity and interpretability. Despite its conceptual simplicity, it is actually not that straightforward to implement from a programming perspective. Hopefully, this post can serve as a brief overview on how to program a simple decision tree.

Like many other algorithms we’ve learned, the first step is to define a cost function that we will be minimizing. In the case of a decision tree, the cost function determines how variance is calculated at each branch of the tree. For our implementation, we used a concept called entropy to calculate this variance:

$H(y) = -\displaystyle \sum_{i=1}^{m} P(c_i)\log_2(P(c_i))$

In general, entropy measures the amount of disorder in a set and our goal is minimize entropy after each split. Given the mathematical definition of entropy above, here is one way to write the entropy function in Python:

import pandas as pd
import numpy as np
import math

def entropy(y):
    '''
    INPUT: NUMPY ARRAY
    OUTPUT: FLOAT

    Return the entropy of the array y.
    '''
    total = 0
    for cl in np.unique(y):
        prob = np.sum(y == cl) / float(len(y))
        total += prob * math.log(prob)
    return -total

In the code above, we first calculate the variable prob which gives the probability of occurence for each class/label in a set. Then we plug this probability into the entropy formula to derive the total entropy for the set.

Now that we have a general function for calculating entropy, we can think about how to use it to determine the best split. Below is the pseudocode from our assignment which provides a roadmap on how to implement a decision tree:

'''
function BuildTree:
    If every item in the dataset is in the same class
    or there is no feature left to split the data:
        return a leaf node with the class label
    Else:
        find the best feature and value to split the data 
        split the dataset
        create a node
        for each split
            call BuildTree and add the result as a child of the node
        return node
'''

As can be seen from this pseudocode, an important step within this BuildTree function is to “find the best feature and value to split the data”. At this point, we were introduced to a critical concept called information gain which determines how we should judge the “effectiveness” of a particular split.

$Gain(S,D) = H(S) - \displaystyle \sum_{V\in D} \frac{\lvert V\rvert}{\lvert S\rvert}H(V)$

$D$ comprises the “children” sets from the original parent set (i.e. pre-split set) of $S$ while $V$ represents an individual child branch. In our case, since we’re only dealing with binary splits, we only have two children branches $(A,B)$ in $D$ . The information gain for each split is calculated as follows:

$Gain(S,A,B) = H(S) - \displaystyle \frac{\lvert A\rvert}{\lvert S\rvert}H(A) - \displaystyle \frac{\lvert B\rvert}{\lvert S\rvert}H(B)$

where $S = A \cup B$ Here, we’ve finally encountered our splitting criteria. $H(S)$ is the entropy of the parent branch $S$ , while $H(A)$ and $H(B)$ are the entropy of child branch $A$ and child branch $B$ , respectively.

In short, this formula says that the information gain of a split is equal to the difference between the entropy of the parent branch (before the split) and the combined entropy the children branches (after the split).

The lower the entropy of the combined children branches, the higher the information gain. So the best split at any node is the one that results in the lowest combined entropy of the children branches (therefore, the highest information gain).

[Talk about recursion next]

Codex

To Grow a (Decision) Tree

Comments