A Guideline for Statistical and Machine Learning

June 24, 2014

There is a lot of literature in books and in the web around the details of machine learning algorithms, for example, on how to calculate the centroid of k-means, or the distance of k-NN, or the coefficients of linear regressions, however there isn’t a lot of material around why and how to pick the best algorithm for a particular use-case.

I find this to be a bit of an irony, as one of the goals of ML is to allow you to see the big picture, yet the procedures available today for selecting a ML technique focus on the tree, rather than on the forrest.

The selection of a ML algorithm or model should be driven by two main things: the input data, or observations of your sample, and the question you would like to answer, or goal. For example, if your observations are all numeric, then more likely than not you will be applying a regression regardless of anything else, likewise if your goal is to spot an anomaly, than using a decision tree won’t be very helpful.

I tried to summarize some of these decisions in the following slides:

Further, here is a simple (and somewhat naive) flow chart describing the steps:

This is by no means complete, particularly the unsupervised models section, and really just an initial effort. All feedback is very welcomed, and will be considered.