To Grow a (Decision) Tree

Jun 7th, 2014 | Comments

During the fourth week at Zipfian, we implemented a few more basic machine learning algorithms from scratch using Python. One of the more interesting ones for me is the decision tree. As a classification technique, the decision tree has quite a number of appealing features, the chief of which are its simplicity and interpretability. Despite its conceptual simplicity, it is actually not that straightforward to implement from a programming perspective. Hopefully, this post can serve as a brief overview on how to program a simple decision tree.

Like many other algorithms we’ve learned, the first step is to define a cost function that we will be minimizing. In the case of a decision tree, the cost function determines how variance is calculated at each branch of the tree. For our implementation, we used a concept called entropy to calculate this variance:

$H(y) = -\displaystyle \sum_{i=1}^{m} P(c_i)\log_2(P(c_i))$

Gradient Descent the Python Way

May 27th, 2014 | Comments

During our third week at Zipfian, we implemented the linear regression model by using various Python libraries (e.g. statsmodel and scikit-learn). We also coded the normal equation ( $\beta = (X^TX)^{-1}X^TY$ ) directly as a Python function. In addition to using standard library functions to perform linear regressions, we also implemented an optimization technique called gradient descent to approximate the analytical solution for deriving the coefficients of a linear regression. When a matrix is non-invertible, one would have to use gradient descent to arrive at the coefficients of a linear regression since a unique solution does not exist in this case. In fact, gradient descent is a general optimization technique for finding the local minimum of a function and as such, can be applied to many other machine learning situations where an analytical solution is either too cumbersome or is infeasible. So I thought it’d be quite useful to get a better handle on this important tool.

Here’s the general gradient descent algorithm:

$\begin{align*} \theta_{j+1} = \theta_j - \alpha \frac{\partial}{\partial\theta_j} J(\theta) \end{align*}$

where $J(\theta)$ is a function that we want to minimize and $\alpha$ is the learning rate.

Solving the Monty Hall Problem Using Monte Carlo Simulation

Dec 29th, 2013 | Comments

This exercise is from Harvard’s CS109 course.

In the Monty Hall game show, contestants try to guess which of 3 closed doors contain a cash prize (goats are behind the other two doors). Of course, the odds of choosing the correct door are 1 in 3. As a twist, the host of the show occasionally opens a door after a contestant makes his or her choice. This door is always one of the two the contestant did not pick, and is also always one of the goat doors (note that it is always possible to do this, since there are two goat doors). At this point, the contestant has the option of keeping his or her original choice, or swtiching to the other unopened door. The question is: is there any benefit to switching doors? The answer surprises many people who haven’t heard the question before.

We can answer the problem by running simulations in Python.

Exploring NLP in Python

Dec 29th, 2013 | Comments

Came across a great tutorial on the basics of natural language processing (NLP) and classification.

The example in this tutorial uses a Python library called gensim which (according to its website) is the “the most robust, efficient and hassle-free piece of software to realize unsupervised semantic modelling from plain text.” As far as I understand it, it’s a very handy and commonly-used library for NLP.

One important tool/algorithm used for classification is something called the Vector Space Model. Basically, all documents or queries are represented as “vectors of identifiers”, such as an index of words and use the angle (theta) between vectors as a similarity measure (explained further later). Incidentally, if you ranked the similarity of each document to a query, you’d have a search engine.

Blog Archives