Machine Learning with Decision Trees

I've been playing around with scikit-learn, Python's machine learning toolkit over the last couple weeks, in conjunction with Georgia Tech's Machine Learning course hosted on Udacity.

The course is broken up into three sections:

  • Supervised Learning
  • Unsupervised Learning
  • Reinforcement Learning

I decided to use scikit-learn to try and build a decision tree in order to predict the political party (democrat or republican) of a voter based on that voter's 16 yes/no votes. This is considered classification which falls under supervised learning.

The data

Let's look at the data we have to work with. It comes from UCI's machine learning repository which provides a bunch of high quality data sets that are useful for testing out machine learning algorithms.

In house-votes-84.data:

republican,n,y,n,y,y,y,n,n,n,y,?,y,y,y,n,y
republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,?
democrat,?,y,y,?,y,y,n,n,n,n,y,n,y,y,n,n
democrat,n,y,y,n,?,y,n,n,n,n,y,n,y,n,n,y
democrat,y,y,y,n,y,y,n,n,n,n,y,?,y,y,y,y
....

There are 435 rows of data, but some of them contain ? which indicate that person abstained from voting on a measure. Once we remove any incomplete rows, we are left with 232 rows. That should be enough to create a decision tree.

Validation

But wait, how will we know if our decision tree is any good? We should split the 232 rows of good data equally into two sets: our training set and our validation set. In other words, we'll build the decision tree with 116 data points of input and then check if our decision tree accurately predicts the political party of the remaining 116 voters.

Reading the data into scikit-learn

Let's describe our data using machine learning terms. We have 16 features (the different voting measures, ex: handicapped-infants, water-project-cost-sharing, adoption-of-the-budget-resolution) and 2 classes (political party ex: republican, democrat). We had 435 data points, but some had missing values, so we ended up with 232 instances of useful data.

By convention, scikit-learn takes a 2D array X, sparse or dense, of size [n_samples, n_features] holding the training samples, and an array Y of integer values, size [n_samples], holding the class labels for the training samples.

We want to turn

republican,n,n,n,y,y,y,n,n,n,n,n,y,y,y,n,n
democrat,n,y,y,n,y,y,y,y,n,n,y,n,y,n,y,y

into something along the lines of

X = [[0,0,0,1,1,1,0,0,0,0,0,1,1,1,0,0], [0,1,1,0,1,1,1,1,0,0,1,0,1,0,1,1]]
Y = [0, 1]

where 0 represents n and 1 represents y for X and
where 0 represents republican and 1 represents democrat for Y

Training

As it turns out, cleaning and loading the data is the bulk of the work. Building the decision tree is fairly simple in scikit-learn:

clf = tree.DecisionTreeClassifier()
clf = clf.fit(X, Y) 

scikit-learn uses an optimized version of the Classification and Regression Trees algorithm. There's more information available at the scikit-learn page on decision trees.

Predicting

While I was reading data in, I separated every other record into a set of validation data. I named the 2D array of votes X_unseen and 1D array of classes (democrat, republican) Y_unseen to indicate that the machine learning algorithm never looks at them.

predictions = clf.predict(X_unseen)

How did we do?

wrongPredictions = 0
for index, val in enumerate(predictions):
    if Y_unseen[index] != val:
        wrongPredictions = wrongPredictions + 1
print ("Correctly predicted %s out of %s instances" 
% ((len(Y_unseen) - wrongPredictions), len(Y_unseen)))

Let's run the program:

$ python house-votes.py
> Total instances: 232
> Training instances: 116
> Validation instances: 116
> Correctly predicted 112 out of 116 instances

That's a 96% success rate of accurately predicting the political party of a voter based on his or her 16 votes. Pretty neat.

What might be more interesting is plotting the success rate based on the number of training samples. Let's give that a go:

As it turns out we only need to train the classifier on a relatively small sample size to get >90% accuracy.

Examining the decision tree

We can go further and look at the decision tree that is created. This can be used to determine which of the 16 votes are the most polarizing. In other words, if you had classify a voter as a Republican or a Democrat based only on a few votes, what would they be?

I've annotated some of the house measures in the decision tree above. You can see that out of the 116 instances, if a voter said no to the physician fee freeze (or in other words X[3] < 0.5), the voter is a democrat. If they approved the physician fee freeze, there are some other votes we need to examine (such as the synfuels corporation cutback) before we can predict their political party.

Try it yourself

Here's the complete code I used. The file house-votes-84.data can be found here.

    from sklearn import tree

    with open('house-votes-84.data', 'r') as inputFile:
        lines = inputFile.readlines()

    X = []
    Y = []
    X_unseen = []
    Y_unseen = []

    count = 0
    for line in lines:
        if not '?' in line:
            # print(line)
            count = count + 1
            values = line.split(',')
            party = values[0]

            if 'democrat' in party:
                klass = 1
            if 'republican' in party:
                klass = 0

            instanceValues = []
            for index in enumerate(values, start=1):
                if index[0] != 1:
                    if 'n' in index[1]:
                        instanceValues.append(0)
                    if 'y' in index[1]:
                        instanceValues.append(1)

            # Take every other sample and set aside for validation of our decision tree
            # In other words, don't include these rows in our training data
            if count % 2 == 0:
                Y_unseen.append(klass)
                X_unseen.append(instanceValues)
            else:
                Y.append(klass)
                X.append(instanceValues)

    print ("Total instances:", count)
    print ("Training instances:", len(X))
    print ("Validation instances:", len(X_unseen))

    clf = tree.DecisionTreeClassifier()
    clf = clf.fit(X, Y)
    predictions = clf.predict(X_unseen)

    # print ("Predictions:", predictions)
    # print ("Actual vals:", Y_unseen)

    wrongPredictions = 0
    for index, val in enumerate(predictions):
        if Y_unseen[index] != val:
            wrongPredictions = wrongPredictions + 1
    print ("Correctly predicted %s out of %s instances" % ((len(Y_unseen) - wrongPredictions), len(Y_unseen)))