Support Vector Machine

by Daniel Pollithy Using a support vector machine to classify workdays

I use the same data set as in my last post which is a list of days (weekday, day of the month, month) from the last years labeled with 1 or 0 depending on whether I went to work on that particular day or not.

Objective: Train a model that can discriminate weekdays from days of the weekend (see the last post for more details).

Support Vector Machine If you want to classify the data points in this image you can draw a line which defines a border. All the points on one side belong to one class and vice versa. In above image only the “important” points have a thick border. They are called the support vectors.

The intuition behind the support vector machine is to fit the line (decision boundary) with the greatest distance to all support vectors.

The data This plot show the weekdays on the x-axis in relation to the work days (red) and non-work days (blue). It is already visible that the two classes cannot be differentiated well on weekdays.

Training

We use svm from scikit-learn. (See http://scikit-learn.org/stable/auto_examples/svm/plot_iris.html)

from sklearn import svm
from sklearn.metrics import accuracy_score

X, y = get_training_data()
X_test, y_test = get_test_data()

models = (svm.SVC(kernel='linear'),
svm.LinearSVC(),
svm.SVC(kernel='rbf', gamma=0.7),
svm.SVC(kernel='poly', degree=3))
models = (clf.fit(X, y) for clf in models)

scores = (accuracy_score(y_test, clf.predict(X_test)) for clf in models)

Now we have got four different models.

Let’s visualize them:

titles = ('SVC with linear kernel',
'LinearSVC (linear kernel)',
'SVC with RBF kernel',
'SVC with polynomial (degree 3) kernel')

# Set-up 2x2 grid for plotting.
fig, sub = plt.subplots(2, 2)

X0, X1 = X[:, 0], X[:, 1]
xx, yy = make_meshgrid(X0, X1)

for clf, title, ax, score in zip(models, titles, sub.flatten(), scores):
plot_contours(ax, clf, xx, yy,
cmap=plt.cm.coolwarm, alpha=0.8)
ax.scatter(X0, X1, c=y, cmap=plt.cm.coolwarm, s=20, edgecolors='k')
ax.set_xlim(xx.min(), xx.max())
ax.set_ylim(yy.min(), yy.max())
ax.set_xlabel('Days of the week')
ax.set_ylabel('Day of the month')
ax.set_xticks(())
ax.set_yticks(())
ax.set_title(title + " accuracy={}".format(score))

plt.show()

Running the training and cross validation with a total of ~250 rows already takes four minutes on my average laptop.

Evaluation

The result looks really interesting (the labeling of the axes should be the other way around): The SVC with linear kernel has an accuracy_score of 74% => error: 26%

The linear kernel produces a line that perfectly separates monday-friday from saturday and sunday. Using this svm would be enough to solve the objective.

The radial based function kernel (RBF) does an incredible good job fitting the small clusters in the data but I guess that these patterns won’t occur in the test data.

Train and evaluate again

I don’t expect there to be a correlation between day of the month and the weekday. Maybe there is one between month and day of the week. Now three support vector classifiers achieve to draw a good decision line for the weekend which results in a good score for an artifical weekend data set. But the errors for the normal test data set is higher than before.

Last try

The last possible correlation we could explore is day of the month and month. Three of the classifiers fail to find a decision boundary but the support vector classifier with radial based function kernel outputs an interesting carmouflage pattern which performs as good as the weekend boundary found by the svc with linear kernel (SVC with RBF kernel 0.7530864197530864).