A Guideline for Statistical and Machine Learning

June 24, 2014

There is a lot of literature in books and in the web around the details of machine learning algorithms, for example, on how to calculate the centroid of k-means, or the distance of k-NN, or the coefficients of linear regressions, however there isn’t a lot of material around why and how to pick the best algorithm for a particular use-case.

I find this to be a bit of an irony, as one of the goals of ML is to allow you to see the big picture, yet the procedures available today for selecting a ML technique focus on the tree, rather than on the forrest.

The selection of a ML algorithm or model should be driven by two main things: the input data, or observations of your sample, and the question you would like to answer, or goal. For example, if your observations are all numeric, then more likely than not you will be applying a regression regardless of anything else, likewise if your goal is to spot an anomaly, than using a decision tree won’t be very helpful.

I tried to summarize some of these decisions in the following slides:

Further, here is a simple (and somewhat naive) flow chart describing the steps:

This is by no means complete, particularly the unsupervised models section, and really just an initial effort. All feedback is very welcomed, and will be considered.


Misconceptions on Machine Learning

March 5, 2014

Computer science is all about modeling. If one has the right model for a problem, then solving the problem becomes an engineering exercise.

Part of the work of creating a model is defining the right set of abstractions to work with, yet here is where the problem lies, if one uses these abstractions without fully understanding the underlying model and foundations. There is no better example of this problem then in the hot field of machine learning.

A simple definition of machine learning is the application of algorithms that attempt to learn with past input to predict future results.

We can see this everywhere today, from Facebook’s suggestions of new friends, Amazon’s product proposals, to LinkedIn’s job offers. Compound this now with Big Data. Due to this high demand, several libraries and tools have been created to support the use of machine learning.

This is all goodness, except that people are using these tools in some cases to provide critical business services without taking care of learning the underlying (statistical) models that support them. I have seen this over and over again. I believe other people also have, such as my friend Opher in this blog.

Let’s take a look at a few naive examples to illustrate the issues.

In our first example, we would like to predict a person’s weight from the person’s height. The height is what is called the explanatory or independent variable, and the weight is our predicted, response, or dependent variable. Following, we have a sample input for this data:

    height (cm) weight (kg)
1  180.0               77.0
2  177.8               69.9
3  175.0               69.5
4  170.0              68.0

Note that I am using real subjects for this data, but I can’t name them otherwise some people (at my immediate family) may get a bit upset. 🙂

There is a general heuristics in science known as Occam’s razor that says one should always try the simplest approach to solving a problem. This may sound ridiculously obvious, but believe me if you are used to statistics, it is not really that surprising that one actually had to explicit state this.

Anyway, adhering to this heuristics, one of the simplest model, which actually works well in general and is widely employed, is what is known as the linear model. A linear model is nothing more than finding a linear equation of the form y = intercept + slope * x (in the case of a single independent variable) that fits well to the input data. There are several algorithms that do this, one of them being the least square algorithm.

Using our example, we can ‘fit’ a linear model to the input data as follows in R:

lm(formula = weight ~ height, data = example1)

And if we drill down into the result:

(Intercept)       height  
-59.8261       0.7452

Or, if replacing in the original formula:

weight = -59.8261 + 0.7452 * height

A careful analysis of this would show that the coefficients are a bit off, however keep in mind that I trained my model using a sample of only 4 entries. This is a very small sample indeed, typically one fits their models using hundreds, or thousands, or even millions of elements. In fact, there is a field in statistics called power analysis that helps you define how large your sample has to be so as that you can have a certain level of confidence, say 95%, in your results. There are many variables to this calculation, but surprisingly enough the minimum number needed is generally not very large, sometimes just in the two digits range.

Regardless, let’s use this model to try to predict the weight of another subject whose height is 150 cm. This is done as follows in R:

predict(fit, list(height = 150))
-> 51.94918

As it stands, I know for a fact that this person’s weight (she may deny it) is around 52-53 kg, so in spite of everything our model has done remarkable well, wouldn’t you say?

Well, let’s now try this approach with a different scenario.

Say we have a sample with the number of frauds that has happened in a particular location as per its zip code. Here is the sample:

  zipcode fraud
1   91333    13
2   91332    10
3   91331     5
4   91330    15

For example, in the zip code 91333, there has been 13 frauds (in the last month). Could we create a linear model for this data, for example, in the form of:

fit2 <- lm(fraud ~ zipcode, example2)

Well, let’s try to use it:

predict(fit2, list(zipcode=91000))
-> 43.9
predict(fit2, list(zipcode=91334))
-> 10.5

Obviously, the results make absolutely no sense. What went wrong? Several things, to begin with there is really no linearity between the variables zip code and the number of frauds. But there is even a more subtle problem, whenever you are using linear models, there is a requirement that the variances (actually, the residuals) of the predicted variable be normally distributed. What this means is that, for example, your predicted variable cannot be a ‘count data’, like we have in this fraud example, or a binary true or false response, or even a categorical value, like high/medium/low, as none of these have a normal distribution. Even more than that, say that in our first example, we had included the measures of children alongside the measures for adults. As the variance for the weight and height correlation is different for adults and children, they would not be normally distributed, and hence wouldn’t work well for our linear model.

What this means is that it is not enough to simply apply a machine learning algorithm to your data, first and foremost you must understand your data and its distribution, which is not a trivial task.

The adoption of machine learning is very welcoming, however one must not forget the human aspect of choosing the right models and algorithms.