Map-Reduce with JDK8, Clojure, Scala, Spark, and yes, Lisp!

August 19, 2014

With the recent addition of Streams and Lambda expressions in JDK8,  a developer can finally adopt a functional programming style in Java, and, for example, implement algorithms that make use of the popular Map-Reduce design pattern natively with the standard Java SDK.

Starting with the examples provided by the JDK 8 Stream tutorial, let’s explore the use of functional programming and applications of the Map-Reduce design pattern.

First, let’s define a simple list-based collection of persons:

roster = new LinkedList<Person>();
roster.add(new Person("Lucas", Person.Gender.MALE, 30));
roster.add(new Person("Juliana", Person.Gender.FEMALE, 25));
roster.add(new Person("Gabriel", Person.Gender.MALE, 8));

Next, let’s associate two mapping functions to these collection, the first function filters each Person element by its gender, and another mapping function maps a Person to their name (i.e. one could think of this as a projection function):

roster.stream().
    filter(Person::isMale).
        forEach(e -> System.out.println(e.getName()));

The result of running this program is:

Lucas
Gabriel

Rather than iterating through the collection, we traverse its stream representation (i.e. roster.stream()). This decouples the application of our mapping functions from the actual collection data structure, therefore the function application can be done in different ways, for example, serially or concurrently. This is know as the characteristics of the stream and allows for the optimization of the mapping and reduction implementation.

The previous code made use of two mapping functions, the filter() and the forEach() functions. Both of these take a function as their arguments. This is what is known as high-order functions. In the case of the example, we do this in two different forms, by referencing to the isMale() method of the Person class, and by creating an anonymous function at the time it is needed. This latter form is known as a lambda abstraction, and in Java, takes the form of input-arguments -> function-body.

In our next example, let’s incorporate a reduction function into the mix:

double sum =
roster.stream()
    .filter(p -> p.getGender() == Person.Gender.MALE)
        .mapToInt(e -> e.getAge() /*Person::getAge*/)
            .reduce(0, (a, b) -> a + b);

The reduce() function takes a binary function, that is, a function that has two input arguments or operands. The binary function we use is sum. The first argument is the result of a previous sum function or 0 if it is the first application of sum. This allows us to associatively add together the elements of our stream. For example, if the stream has elements 1, 2, 3, 5, 8, then one particular instance of invocations of reduce is:

(0, 1) -> 1
(1, 2) -> 3
(3, 3) -> 6
(6, 5) -> 11
(11, 8) -> 19

As expected, the result of running the previous program is:

38.0

If one looks deeper into the JDK8 API, one can see that it has been really well thought of and organized. There are useful abstractions for Unit-returning (void) functions, called Consumers, Tuple-returning Producers, n-ary functions, etc.

Would one also say this is novel? I would hardly say so; consider how this same example can be implemented in Lisp, a language developed by John McCarthy in the late 1950s (that’s more than 60 years ago!).

First, let’s define our list-based Person container in Lisp:

(defparameter *roster* '((Lucas male 30)(Juliana female 25)(Gabriel male 8)))

And the equivalent of our isMale() method strictly as a function:

(defun ismale (x) (if (eq (second x) 'male) t nil))

And, finally, the equivalent of our map to int and reduce by summing:

(reduce '+
    (mapcar (lambda (x) (third x))
        (remove-if-not 'ismale *roster*)))

The prefix  form may strike some as odd, but the general structure is essentially the same. That is, the application of the reduce of ‘+’ to the result of a mapping to int to the result of a filter to the predicate isMale to the roster collection. In Lisp, a lambda abstraction is defined as lambda (input-arguments) (function-body).

One could even argue that the Lisp program is even cleaner (and more compact) than the Java one!

Lisp is not the only option. In fact, Java itself has at least another two options with Clojure and Scala, as demonstrated next:

Clojure:

(def *roster* '((Lucas male 30)(Juliana female 25)(Gabriel male 8)))
(defn ismale [x] (if (= (compare (second x) 'male) 0) true nil)
(reduce +
    (map (fn [x] (nth x 2))
        (filter ismale *roster*)))

Scala:

var roster = List(List("Lucas", "male", 30),List("Juliana", "female", 25),List("Gabriel", "male", 8))
def isMale(x:List[Any]) = {if (x(1) == "male") true else false}
roster.filter(isMale).
    map((x:List[Any]) => x(2).asInstanceOf[Int]).
        reduce((x:Int, y:Int) => x + y)

The syntax of a lambda expression in Clojure is fn [input-arguments] (function-body) and in Scala it is (input-arguments) => function-body.

Clojure has a closer resemblance to Lisp, where as Scala with its static typing and its infix form reminds us of the imperative programming style in Java.

If one is used to Big Data and Hadoop, then another option is Spark. For sake of completeness, here is the Spark version of the use-case:

val roster = sc.textFile("examples/Roster.txt") // space-delimeted text file with one Person by line
roster.filter(line => line.contains(" male")).
    map(line => new Integer(line.split(" ")(2))).
        reduce((a,b) => a + b)

Historically, whenever the Java SDK picks up a new API, it tends to surpass other existing forms of this same API. For example, see how java.util.logging is more predominant today than log4j. This is only natural, due to the easy availability of the SDK. Considering this, I wonder what will be the implications of java.util.stream to the Apache Hadoop API.

I would definitely appreciate being able to use the SDK to author a Map-Reduce application, and then simply configure its Characteristics to DISTRIBUTED to have it run over a Hadoop cluster, wouldn’t you?


A Guideline for Statistical and Machine Learning

June 24, 2014

There is a lot of literature in books and in the web around the details of machine learning algorithms, for example, on how to calculate the centroid of k-means, or the distance of k-NN, or the coefficients of linear regressions, however there isn’t a lot of material around why and how to pick the best algorithm for a particular use-case.

I find this to be a bit of an irony, as one of the goals of ML is to allow you to see the big picture, yet the procedures available today for selecting a ML technique focus on the tree, rather than on the forrest.

The selection of a ML algorithm or model should be driven by two main things: the input data, or observations of your sample, and the question you would like to answer, or goal. For example, if your observations are all numeric, then more likely than not you will be applying a regression regardless of anything else, likewise if your goal is to spot an anomaly, than using a decision tree won’t be very helpful.

I tried to summarize some of these decisions in the following slides:

Further, here is a simple (and somewhat naive) flow chart describing the steps:
Image

This is by no means complete, particularly the unsupervised models section, and really just an initial effort. All feedback is very welcomed, and will be considered.


Demystifying Linear Regressions as a Tool for Inference

April 17, 2014

Linear regressions are one of the simplest algorithms for predicting quantitative responses. In fact, some people may even consider it dull when compared to other advances approaches, like Support Vector Machines. However, I find that not only linear regressions provide good average prediction results, but, more importantly, its simplicity and transparency makes it an ideal tool for trying to understand the data in itself, that is, the relationship between the predictor (explanatory) variables and the responses. In this sense, linear regressions are better suited as a mechanism for inferring the data, rather predicting it.

The best way of arguing for this is by going through an example. As usual, let’s use R, and load the mtcars data-set (1974 Motor Trend US magazine comprising of the fuel consumption and 10 aspects of automobile design and performance for 32 automobiles):

> head(mtcars)
                   mpg cyl disp hp drat wt    qsec vs am gear carb
 Mazda RX4         21.0 6  160 110 3.90 2.620 16.46 0 1  4    4
 Mazda RX4 Wag     21.0 6  160 110 3.90 2.875 17.02 0 1  4    4
 Datsun 710        22.8 4  108 93  3.85 2.320 18.61 1 1  4    1
 Hornet 4 Drive    21.4 6  258 110 3.08 3.215 19.44 1 0  3    1
 Hornet Sportabout 18.7 8  360 175 3.15 3.440 17.02 0 0  3    2
 Valiant           18.1 6  225 105 2.76 3.460 20.22 1 0  3    1

Next, let’s create a simple linear regression model for this data. We will use just some of the predictor variables that seem more important, like number of cylinders (cyl), horsepower (hp), weight (lb/1000), 1/4 mile time (qsec), and number of gears (gear). We construct the linear regression model by specifying the response variable (mpg) and the predictor variables (cyl + hp + wt + qsec + gear). This is calling fitting the model to the training data. Here is an example:

> fit = lm(mpg ~ cyl + hp + wt + qsec + gear, data=mtcars)
> summary(fit)

Call:
lm(formula = mpg ~ cyl + hp + wt + qsec + gear, data = mtcars)

Residuals:
Min 1Q Median 3Q Max
-3.3969 -1.5852 -0.5171 1.0712 5.5914

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 26.96517 15.13161 1.782 0.08643 .
cyl -0.45775 0.83952 -0.545 0.59023
hp -0.01808 0.01671 -1.082 0.28923
wt -3.41354 1.02454 -3.332 0.00259 **
qsec 0.38753 0.55312 0.701 0.48975
gear 0.72536 1.13460 0.639 0.52821

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.576 on 26 degrees of freedom
Multiple R-squared: 0.8468, Adjusted R-squared: 0.8173
F-statistic: 28.74 on 5 and 26 DF, p-value: 8.227e-10


The sheer amount of information is overwhelming, which tends to make people ignore the details. As we shall see, all the data in the fitted model are equally important, and need to be considered. Let’s take this in parts.

First, let’s assess the accuracy of the fitted model by using the metrics located at the end of the summary:

Residual standard error: 2.576 on 26 degrees of freedom
Multiple R-squared: 0.8468, Adjusted R-squared: 0.8173

Both the residual standard error (RSE) and the R-squared metrics represent how far are the estimated responses from the actual responses (when considering the training data). The RSE value of 2.57 indicates that on average the estimated response deviates in 2.57 mpg from the actual response.

For example, let’s apply this to the first row of the data, which is for ‘Mazda RX4′:

cyl hp wt qsec gear -> mpg
6 110 2.620 16.46 4 -> 21.0

If we apply the coefficients generated by the fitted model to the first row of ‘Mazda RX4″, we get the estimated mpg of 22.5669. The actual response should have been 21.0 (as it can be seen in the data itself), hence its residual (deviation) is 1.5669 (i.e. 22.5669 – 21.0), which is pretty close to the RSE of 2.57.

If you missed it, how did we exactly come up with the estimated mpg response of 22.57? This is an example of a prediction, and really just means applying the values of the predictor variables (i.e. cyl, hp, wt, qsec, gear) to the estimated coefficients provided by the regression model (i.e. -0.45, -0.01, -3.41, 0.38, -0.72). In the case of ‘Mazda RX4′, the values are respectively 6 (cyl), 110 (hp), 2.62 (wt), 16.46 (qsec), and 4 (gear). Hence, the calculation becomes estimated-MazdaRX4-mpg = 6 * -0.45775 + 110 * -0.01808 + 2.62 * -3.41354 + 16.46 * 0.38753 + 4 * 0.72536 + 26.96517. The last number is the intercept and represents the baseline when all other predictor variables are zero. It is like saying that the mpg would be 26.96 should the car have no cyl, hp, wt, etc. In this particular case, it obviously makes no sense, but in other cases, it does actually help you establish a baseline. Of course, there is an automated way of predicting in R, which yields exactly the same value as the above calculation:

> predict(fit, data.frame(cyl=c(6), hp=c(110), wt=c(2.62), qsec=c(16.46), gear=c(4)))
22.5669

Now that you understand the concept of residuals, the Residuals summary in the beginning of the model summary should also make sense:

Min        1Q         Median   3Q      Max
-3.3969 -1.5852 -0.5171   1.0712 5.5914

This is saying that in the worst case there is a deviation of 5.59 in the mpg response, and that most of the estimates deviate between -1.5 to 1.07 mpg from the actual responses.

Let’s get back to the RSE. The issue with RSE is that it is a relative metric. Is a residual of 2.57 good or bad? To answer this question, you need to consider its unit, which in this case is mpg, and the general context of the problem. I don’t know much about cars, so it is hard to say if a 2.57 error is a large error or something that can be ignored. To answer this, let’s look at the R-squared. The R-squared is a proportion, it measures how well the model fits the data by comparing the residual variance to the total variance of the training data. In other words, it verifies if the variance of the estimated response is related to the fitted model or is it already inherent in the response before the regression is performed. The R-squared ranges from 0 to 1. A value close to 1 means that there is a good fit, and conversely a value close to 0 means that perhaps a linear model is not a good model for the data, and some other approach should be tried. In this example, the value of 0.81 indicates that we have the right model, but there is some room for improvements.

Finally, the F-statistics and p-value are used to determine if there is correlation between the predictor variables and the responses. For easy of reference, here are their values in our example:

F-statistic: 28.74 on 5 and 26 DF, p-value: 8.227e-10

What we are trying to establish is if the variance in the estimated response is just a matter of chance, or does it really relate somehow to the predictor variables? In mathematical terms, if there is no correlation, then it means that the coefficients all tend to be zero (i.e. if y = coef * x, and if y doesn’t vary with changes to x, then coef must be equal to zero), and the variances seen in the estimated responses are related to the standard deviation itself of the actual responses. In this case, F-statistics becomes the ratio of the standard deviation (or rather the square) by itself and is equal to 1 or a small number close to 1. If there is correlation, then F-statistics is some other higher number. The p-value is the probability of getting this F-statistics by chance. If it is a very small value, generally below .05, it means that the likelihood of just being unlucky and getting this same value is very low, and therefore unlikely.

To summary it, we are looking for:

  • R-squared close to 1, and
  • F-statistics higher than 1, and
  • p-value very low (i.e. < 0.05)

If this is the case, then it means that there is some (linear) correlation between the predictor variables and the response and the linear regressions is likely doing a good job of fitting the training data.

Having established that the linear regression model is good, next let’s look at what it gives us. We do this by considering the predictor variable coefficients and its parameters. Let’s start with the wt variable:

      Estimate  (coef)  Std. Error    t value   Pr(>|t|)
hp -0.01808              0.01671       -1.082    0.28923

This is saying that (holding all other variables constant) a change to the motor horsepower causes a -0.018 change to miles per gallon consumption. In math terms, this is equivalent to mpg = -0.018 * hp + others. This in itself is very interesting information. It tells us that if we were to increase the motor’s hp by 100 for the Mazda RX4, then its mpg will decrease from 21.0 to 19.2! This gives us a great tool for inference and analysis.

However, we must be careful, and again consider the full set of information provided before taking full measures. For example, the model tells us that there is a chance of error for this coefficient of 0.016. In other words, the correct coefficient could vary, in average, by 0.016 from the starting value of -0.018. The t-value tells us that the standard error is high in terms of the coefficient value we have. This is correct, the error is almost as much as the coefficient itself, hence the t-value close to 1. As rule of thumb, we are looking for high t-values. The expectation is that the coefficient varies in the range of [coef - 2 * SE, coef + 2 * SE]. This represents a confidence level of 95%. If you consider that the t-distribution is similar to a normal distribution, and that in a normal distribution, 95% of the values are within 2 standard deviation, then it makes sense to think that you have a 95% confidence that the coefficient value varies within two standard errors. In this case, it means that the hp coefficient can go from -0.05 to 0.014, which is not very assuring. Further, the Pr value says that there is 28% chance of us having gotten this coefficient by chance, that is, rather than because of correlation between hp and mpg. Again, this is not very assuring, we would generally like a Pr of less than 5%, that is, Pr < 0.05.

Let’s look for other predictors that do better. Luckily, R simplifies this for us by placing increasing levels of ‘*’ next to those variables that seemly correlate better. In our case, this is the wt predictor variable:

       Estimate    Std. Error    t value    Pr(>|t|)
wt  -3.41354     1.02454       -3.332     0.00259 **

The model tells us that there is a 0.2% chance of wt and mpg NOT being correlated, which is very low. And the t-value is also reassuringly high. Hence, it seems safe enough for us to assume that an increase in weight in average causes a decrease of 3.4 in mpg. Again, very powerful and useful information!

To confirm this finding, let’s plot the values of wt versus mpg for the training data, and then draw a line representing the regression model of mpg = intercept + coefficient * wt:

As it can be seen, these two variables are indeed highly correlated and linear in nature.

There is a lot of material in this article, however there are really just two important take-aways.

First, linear regression is like the white-box testing in machine learning, and what it lacks in accuracy, it more than compensates in transparency, allowing us to infer about the data, rather than just do black-box crystal-ball like predictions.

Second, don’t ignore the details, they are there for a reason, and need to be considered.


Misconceptions on Machine Learning

March 5, 2014

Computer science is all about modeling. If one has the right model for a problem, then solving the problem becomes an engineering exercise.

Part of the work of creating a model is defining the right set of abstractions to work with, yet here is where the problem lies, if one uses these abstractions without fully understanding the underlying model and foundations. There is no better example of this problem then in the hot field of machine learning.

A simple definition of machine learning is the application of algorithms that attempt to learn with past input to predict future results.

We can see this everywhere today, from Facebook’s suggestions of new friends, Amazon’s product proposals, to LinkedIn’s job offers. Compound this now with Big Data. Due to this high demand, several libraries and tools have been created to support the use of machine learning.

This is all goodness, except that people are using these tools in some cases to provide critical business services without taking care of learning the underlying (statistical) models that support them. I have seen this over and over again. I believe other people also have, such as my friend Opher in this blog.

Let’s take a look at a few naive examples to illustrate the issues.

In our first example, we would like to predict a person’s weight from the person’s height. The height is what is called the explanatory or independent variable, and the weight is our predicted, response, or dependent variable. Following, we have a sample input for this data:

    height (cm) weight (kg)
1  180.0               77.0
2  177.8               69.9
3  175.0               69.5
4  170.0              68.0

Note that I am using real subjects for this data, but I can’t name them otherwise some people (at my immediate family) may get a bit upset. :-)

There is a general heuristics in science known as Occam’s razor that says one should always try the simplest approach to solving a problem. This may sound ridiculously obvious, but believe me if you are used to statistics, it is not really that surprising that one actually had to explicit state this.

Anyway, adhering to this heuristics, one of the simplest model, which actually works well in general and is widely employed, is what is known as the linear model. A linear model is nothing more than finding a linear equation of the form y = intercept + slope * x (in the case of a single independent variable) that fits well to the input data. There are several algorithms that do this, one of them being the least square algorithm.

Using our example, we can ‘fit’ a linear model to the input data as follows in R:

lm(formula = weight ~ height, data = example1)

And if we drill down into the result:

Coefficients:
(Intercept)       height  
-59.8261       0.7452

Or, if replacing in the original formula:

weight = -59.8261 + 0.7452 * height

A careful analysis of this would show that the coefficients are a bit off, however keep in mind that I trained my model using a sample of only 4 entries. This is a very small sample indeed, typically one fits their models using hundreds, or thousands, or even millions of elements. In fact, there is a field in statistics called power analysis that helps you define how large your sample has to be so as that you can have a certain level of confidence, say 95%, in your results. There are many variables to this calculation, but surprisingly enough the minimum number needed is generally not very large, sometimes just in the two digits range.

Regardless, let’s use this model to try to predict the weight of another subject whose height is 150 cm. This is done as follows in R:

predict(fit, list(height = 150))
-> 51.94918

As it stands, I know for a fact that this person’s weight (she may deny it) is around 52-53 kg, so in spite of everything our model has done remarkable well, wouldn’t you say?

Well, let’s now try this approach with a different scenario.

Say we have a sample with the number of frauds that has happened in a particular location as per its zip code. Here is the sample:

  zipcode fraud
1   91333    13
2   91332    10
3   91331     5
4   91330    15

For example, in the zip code 91333, there has been 13 frauds (in the last month). Could we create a linear model for this data, for example, in the form of:

fit2 <- lm(fraud ~ zipcode, example2)

Well, let’s try to use it:

predict(fit2, list(zipcode=91000))
-> 43.9
predict(fit2, list(zipcode=91334))
-> 10.5

Obviously, the results make absolutely no sense. What went wrong? Several things, to begin with there is really no linearity between the variables zip code and the number of frauds. But there is even a more subtle problem, whenever you are using linear models, there is a requirement that the variances (actually, the residuals) of the predicted variable be normally distributed. What this means is that, for example, your predicted variable cannot be a ‘count data’, like we have in this fraud example, or a binary true or false response, or even a categorical value, like high/medium/low, as none of these have a normal distribution. Even more than that, say that in our first example, we had included the measures of children alongside the measures for adults. As the variance for the weight and height correlation is different for adults and children, they would not be normally distributed, and hence wouldn’t work well for our linear model.

What this means is that it is not enough to simply apply a machine learning algorithm to your data, first and foremost you must understand your data and its distribution, which is not a trivial task.

The adoption of machine learning is very welcoming, however one must not forget the human aspect of choosing the right models and algorithms.


Vectorized Functional Programing

January 27, 2014

There is a lot of new-born enthusiasm for functional programming, especially with the introduction of Lambda expressions in Java 8.

Functional programming is indeed quite interesting, but let me introduce this through an example.

Recently, I needed to calculate the correlation between a set of variables representing different characteristics of vehicles. For example, I wanted to find out if the weight (wt) of a vehicle influenced negatively (or positively) its miles per gallon (mpg), or if there is a relationship between the number of cylinders (cyl) of a vehicle and its horse-power (hp).

Having done that, I generated a table whose rows listed the variables that are strongly correlated. Here is an example of the output:

        x      y      corrCoef        prob
1  mpg   cyl            -0.85      6.11e-10
2  mpg disp           -0.85      9.38e-10
3  mpg   wt            -0.87      1.29e-10
4  cyl   disp             0.90      1.80e-12
5  cyl      hp            0.83       3.47e-09
6  cyl       vs          -0.81        1.84e-08
7 disp     wt           0.89        1.22e-11

My next task is to sort this table by decreasing order of correlation. In other words, I would like to see first those variables that correlate the most. Yet, note that it doesn’t matter if it is a positive (direct correlation as in the case of number of cylinders vs HP) or a negative (indirect correlation as in the case of mpg vs weight) correlation, in other words, I want to sort the absolute value of the correlation coefficients (column 3).

Here is where it starts to get interesting. Considering this table to be named as carsCorrTable, I can do this with the following expression:

carsCorrTable <- carsCorrTable[order(abs(carsCorrTable[,3]), decreasing=TRUE),]

Let’s distill this:

  1. abs(carsCorrTable[,3]): apply the absolute function to all members of the third column of carsCorrTable
  2. order(abs(carsCorrTable[,3], decreasing=TRUE): order in decreasing values
  3. carsCorrTable[order(abs(carsCorrTable[,3]), decreasing=TRUE),]: select the ordered rows
  4. carsCorrTable <- carsCorrTable[order(abs(carsCorrTable[,3]), decreasing=TRUE),]: reassign to the original table

Note the conciseness and powerfulness of the original expression! It all boils down to applying several functions to vectors of values. In other words, it is the combination of functional programming and vector-based programming together.

In my mind, a programming model has the right fluency for a task if you are able to code the task in an intuitive manner where you were sure it would not work the first time around, and yet it does! This is exactly what happened to me in this case, even though I hadn’t done this before, I was able to come up with the previous expression rather quickly and (perhaps because of it) was expecting to have to spend the next few hours debugging it, yet it simply worked as expected.

Consider how you would have to do this using your general imperative language. You would need to iterate through each correlation coefficient and calculate its absolute value, then again iterate through the result to sort them, and perhaps do a final iteration to assign the sorted values back to the table. I didn’t have the heart to implement this, but I bet you it wouldn’t look nice in Java 7, C++, etc.

Fortunately, Java 8 introduces lambda expressions and Iterable.forEach(), which should yield nicely clean programs as the previous one. Interestingly, note how in this example I used R, which has been available for over two decades now. (Granted, not sure it would have mattered to introduce these features before, would the community be matured enough to use it? Is it now?)

There is still one very interesting and fundamental question to this, which is how does it compare the execution time (i.e. time complexity) of running this small task in its vectorized functional form and in its imperative form. Would it be the same? Would the vectorized implementation be able to execute in parallel as expected, and therefore yield better results? Or would it eventually all boil down to nested iterations with some polynomial order of growth?


Detecting Fraud using Event Processing

September 5, 2013

Let’s consider a simple fraud detection scenario, one which I have run into with customers in the past.

Say we want to detect when a particular credit card customer spends more than 10000 dollars in a 24 hour (rolling) period.

Offhand, this appears to be a simple scenario to implement. Let’s get to it. Intuitively, this seems like a good use-case for pattern matching queries:

SELECT

M.clientId, M.txId, “ALARM” AS status

FROM creditcardTransactionStream

MATCH_RECOGNIZE (

PARTITION BY cardNumber

MEASURES cardNumber as clientId, transactionId AS txId

PATTERN (A)

DEFINE A AS price > 10000

) AS M

We can break down this query into the following steps:

  1. Partition the events coming through the credit-card transaction stream by its cardNumber;
  2. Check if the price of the transaction (event) is greater than 10000 dollars. We call this pattern A, or more precisely, the correlation variable A.
  3. When the pattern is detected, a process called matching, you can specify how to represent the matched pattern. These are called measures of the pattern. In this particular case, two measures are defined, one identifying the card number and another for the transaction ID.
  4. The final step is to output the two measures (i.e. cardNumber, transactionId) and a status string set to ‘ALARM’ indicating that the customer has spent more than 10000 dollars.

This pattern matching query is able to identify if a single transaction is above the threshold, but not multiple transactions that individually may cross the threshold.

For example, consider the following table of input and output events:

Time (h) Input Output
0 {cardNumber=10, transactionId=1, price=10000} {clientId=10, txId=1, ‘ALARM’}
1 {cardNumber=10, transactionId=2, price=5000}
2 {cardNumber=10, transactionId=3, price=6000}

In this case, note how there is no output at time t = 2h even though transactions 2 and 3 collectively do cross the 10K threshold.

The following query addresses this issue:

SELECT

M.clientId, M.txId, “ALARM” AS status

FROM creditcardTransactionStream

MATCH_RECOGNIZE (

PARTITION BY cardNumber

MEASURES cardNumber as clientId, transactionId as txId

PATTERN (A+? B) WITHIN 24 HOURS

DEFINE B as SUM(A.price) + B.price > 10000

) as M

One way to look at pattern matching is to consider it as a state diagram, where each correlation variable (e.g. A) represents a state in the state diagram. Typically, you need at least two states, the starting state and the accepting state (i.e. the end state). In this particular scenario, state A represents the individual transactions and state B represents  the condition where collectively the events cross the 10K threshold.

However, there are a few caveats. For example, could we go along with a single state (i.e. A) using the expression ‘A as SUM(A.price)’? No, we can’t define a variable in terms of an aggregation of itself, hence the reason we created variable B, which is then defined as the aggregation of A. Because we intent to aggregate several events as part of A, A must be defined as a collection, that is, a correlation group variable. In this case, we do this by using the regular expression operator of ‘+’ as in ‘A+’. This indicates that the correlation variable A should attempt to match one or more events.

Further, why do we need to add ‘B.price’ to the ‘SUM(a.price)’ as in the expression ‘B as SUM(A.price) + B.price’ ? This is needed because the ending event itself is a transaction and thus its price needs to be considered as part of the equation. If we don’t do that, then the price of the last received event would be ignored.

Why do we need the token ‘?’ as in the expression ‘A+?’ in the query? Typically, correlation group variables, as it is the case of A, are greedy, that is, they try to match as many events as possible. However, because A and B are similar, if we keep A greedy, then B would never match. To avoid this situation, we use the token ‘?’ to indicate A should be reluctant, and match the minimal set of events possible, therefore also allowing B to match and accept the final sequence.

Finally, note that the clause ‘within 24 hours’ specifies the window of interest. This means that the sum of the events are always constrained to a rolling 24 hours window. Here is a state diagram that represents the query:

state-machine

Using this new query, we will get a proper alarm output when the query receives the third event (i.e transactionId = 3). However, there is one new problem, this query ceases to output an alarm when we receive the first event (i.e. transactionId = 1).

Why is that? The problem is that the new query expects at least two events, one that matches variable A and another to match variable B. We can fix this by introducing an alternation in our regular expression, as follows:

PATTERN (C | (A*? B)) WITHIN 24 HOURS

DEFINE B as SUM(A.price) + B.price > 10000, C as price > 10000

In this case, either we match the expression ‘(A+? B)’ or the expression ‘C’, where C is simply the case where we have a single event whose price is already crossing the threshold. Note that we need to specify C as the first variable in our pattern, otherwise the query would only consider the ‘(A+? B)’ expression and never have a chance to match for ‘C’. Also note that any variable that is not defined, as it is the case of A, means that it is always true for all events.

Are we done?

Not yet, consider the case where we want to output the total sum of the transactions that are crossing the threshold. The introduction of an alternation makes this a bit harder, as the query is either matching C or A and B. How can we solve this?

We can address this by using subsets. Subsets are correlation variables that represent the union of other correlation variables. Take a look a look at the modified query:

SELECT M.clientId, M.total, “ALARM” as status

FROM creditcardTransactionStream

MATCH_RECOGNIZE (

PARTITION BY cardNumber

MEASURES cardNumber as clientId, SUM(ABC.price) as total

PATTERN (C | (A+? B)) WITHIN 24 HOURS

SUBSET ABC = (A,B,C)

DEFINE B as SUM(A.price) + B.price > 10000, C as price > 10000

) as M

As you can see, pattern matching queries gives us a lot of power, however this comes at a price, as one needs to consider a lot of things, such as the starting conditions, the accepting conditions, and all the corner-case scenarios, which are never trivial, even for apparently simple use-cases.

It also highlights the crucial but subtle point that as much as one tries to create black-box solutions for event processing scenarios, real-world use-cases are always more complicated and have more subtleties then originally conceived and for this reason there is always going to be a place for computational rich domain-specific event processing languages.


New book on Event Processing and EPL theory

July 4, 2013

In march of this year, I had the pleasure to publish my second book: Getting Started with Oracle Event Processing 11g

As I learned the hard way, authoring a book is really hard stuff, so this time around I had the satisfaction of working with two co-authors, Robin Smith and Lloyd Williams, both outstanding product managers at Oracle.

Although the book is focused primarily on Oracle’s Event Processing product (version 11g), it also includes a lot of material around the underlying fundamentals and concepts of the (complex) event processing technology employed by the product, so that it should make it interesting to the event processing enthusiastics in general, particularly those interested on learning the theory behind event processing languages.

The first two chapters provide a very good landscape of the market for event processing, along the way describing a few important use-cases that are addressed by this technology. Chapter 3 and 4 describe how to get events in and out of the system, and how to model the system using the concept of a Event Processing Network (EPN).

Chapters 5, 8, and 11 provide a very deep description of Oracle’s CQL (Continuous Query Language). Amongst other things, they get into several interesting and advanced topics:

  • The conceptual differences between streams, and relations;
  • How different timing models, such as application-based and system-based time models, influence the results;
  • A formal explanation of how relational algebra in SQL is extended to support streams;
  • Shows how to implement interesting pattern matching scenarios, such as that of missing events, and the W pattern; and
  • Describes how CQL is extended to support JDBC, Java, and Spatial technology, allowing one to not only process events in time, but also in terms of location.

Chapters 6, 7, and 9 describe how to manage the overall system, both in terms of configuration, but also performance, and how to scale-up and scale-out, particularly explaining how a Data Grid can be used in conjunction with event processing to greatly improve scalability and fault tolerance. Finally, chapters 10 and 12 tie everything together with a case study and discusses future directions.

I hope you will have as much fun reading this book as I had writing it.

If you have any questions along the way, feel free to send them to me.


Follow

Get every new post delivered to your Inbox.