A Guideline for Statistical and Machine Learning

There is a lot of literature in books and in the web around the details of machine learning algorithms, for example, on how to calculate the centroid of k-means, or the distance of k-NN, or the coefficients of linear regressions, however there isn’t a lot of material around why and how to pick the best algorithm for a particular use-case.

I find this to be a bit of an irony, as one of the goals of ML is to allow you to see the big picture, yet the procedures available today for selecting a ML technique focus on the tree, rather than on the forrest.

The selection of a ML algorithm or model should be driven by two main things: the input data, or observations of your sample, and the question you would like to answer, or goal. For example, if your observations are all numeric, then more likely than not you will be applying a regression regardless of anything else, likewise if your goal is to spot an anomaly, than using a decision tree won’t be very helpful.

I tried to summarize some of these decisions in the following slides:

Further, here is a simple (and somewhat naive) flow chart describing the steps:

This is by no means complete, particularly the unsupervised models section, and really just an initial effort. All feedback is very welcomed, and will be considered.


3 Responses to A Guideline for Statistical and Machine Learning

  1. Prabhu says:


    • Alexandre Alves says:

      Very good, I particularly liked the categorization of ‘looking for structure (clustering)’ in addition to ‘quantity (regression) and category (qualification)’. Thanks for sharing.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: