Vectorized Functional Programing

January 27, 2014

There is a lot of new-born enthusiasm for functional programming, especially with the introduction of Lambda expressions in Java 8.

Functional programming is indeed quite interesting, but let me introduce this through an example.

Recently, I needed to calculate the correlation between a set of variables representing different characteristics of vehicles. For example, I wanted to find out if the weight (wt) of a vehicle influenced negatively (or positively) its miles per gallon (mpg), or if there is a relationship between the number of cylinders (cyl) of a vehicle and its horse-power (hp).

Having done that, I generated a table whose rows listed the variables that are strongly correlated. Here is an example of the output:

        x      y      corrCoef        prob
1  mpg   cyl            -0.85      6.11e-10
2  mpg disp           -0.85      9.38e-10
3  mpg   wt            -0.87      1.29e-10
4  cyl   disp             0.90      1.80e-12
5  cyl      hp            0.83       3.47e-09
6  cyl       vs          -0.81        1.84e-08
7 disp     wt           0.89        1.22e-11

My next task is to sort this table by decreasing order of correlation. In other words, I would like to see first those variables that correlate the most. Yet, note that it doesn’t matter if it is a positive (direct correlation as in the case of number of cylinders vs HP) or a negative (indirect correlation as in the case of mpg vs weight) correlation, in other words, I want to sort the absolute value of the correlation coefficients (column 3).

Here is where it starts to get interesting. Considering this table to be named as carsCorrTable, I can do this with the following expression:

carsCorrTable <- carsCorrTable[order(abs(carsCorrTable[,3]), decreasing=TRUE),]

Let’s distill this:

  1. abs(carsCorrTable[,3]): apply the absolute function to all members of the third column of carsCorrTable
  2. order(abs(carsCorrTable[,3], decreasing=TRUE): order in decreasing values
  3. carsCorrTable[order(abs(carsCorrTable[,3]), decreasing=TRUE),]: select the ordered rows
  4. carsCorrTable <- carsCorrTable[order(abs(carsCorrTable[,3]), decreasing=TRUE),]: reassign to the original table

Note the conciseness and powerfulness of the original expression! It all boils down to applying several functions to vectors of values. In other words, it is the combination of functional programming and vector-based programming together.

In my mind, a programming model has the right fluency for a task if you are able to code the task in an intuitive manner where you were sure it would not work the first time around, and yet it does! This is exactly what happened to me in this case, even though I hadn’t done this before, I was able to come up with the previous expression rather quickly and (perhaps because of it) was expecting to have to spend the next few hours debugging it, yet it simply worked as expected.

Consider how you would have to do this using your general imperative language. You would need to iterate through each correlation coefficient and calculate its absolute value, then again iterate through the result to sort them, and perhaps do a final iteration to assign the sorted values back to the table. I didn’t have the heart to implement this, but I bet you it wouldn’t look nice in Java 7, C++, etc.

Fortunately, Java 8 introduces lambda expressions and Iterable.forEach(), which should yield nicely clean programs as the previous one. Interestingly, note how in this example I used R, which has been available for over two decades now. (Granted, not sure it would have mattered to introduce these features before, would the community be matured enough to use it? Is it now?)

There is still one very interesting and fundamental question to this, which is how does it compare the execution time (i.e. time complexity) of running this small task in its vectorized functional form and in its imperative form. Would it be the same? Would the vectorized implementation be able to execute in parallel as expected, and therefore yield better results? Or would it eventually all boil down to nested iterations with some polynomial order of growth?