In my opinion, the very fact that we are comparing CEP/ESP and rules engines is a sign that these subjects are not yet very well understood.
Partly, the reason for the confusion is that CEP/ESP, rules engines, and RETE are at different levels of abstractions.
CEP/ESP is a technology. It is a collection of techniques, methodologies, and implementation procedures intended on solving a common set of problems of a domain, namely, in the case of CEP/ESP, (loosely speaking) to achieve online processing of streaming data (i.e. events) to support, among other things, real-time business decision.
Although CEP and ESP have different definitions, for the purpose of this article I will use both terms interchangeably.
There are several aspects to CEP: visualization of events, storage of events, programming model for the development of CEP applications, rule (or query) governance, the runtime processing engine, etc.
With respect to the runtime processing engine, there are several implementations options. However, before considering an engine implementation, one must first consider what language, or what language paradigm, should be surfaced to the user so that the user can program to this engine.
One option is to take the well-known imperative programming language paradigm and use any one of the popular procedural or object-oriented languages such as Java as the event processing language of choice that is surfaced to the user. For example, this is the approach taken when one implements their CEP application with hard-coded rules directly in Java.
Another option is to use a more data-driven approach. Considering that one is dealing with events, which is a type or extension of data, this seems an appropriate choice.
Within the realm of data-driven approaches, one option is to use SQL. SQL, needless to say, has proven to be a very convenient and pervasive domain-specific language for dealing with data, which allows one to leverage existing skill-sets. Another common option is to use logic programming. Note that there is an equally intriguing discussion around CEP and SQL currently going on.
Logic programming is a very interesting, but long subject. In a logic programming language, programs consist of a set of statements that specify what is true about a desired goal. This is different than in imperative languages (e.g. Java), where programs specify the set of statements that must be executed to achieve the desired goal.
More specifically, in logic programming, programs contain only logical statements. Logical statements are statements that are either true or false. Logical statements that are always true are called axioms. The role of a logic program is to infer new logical statements called theorems from the axioms through some automated algorithm. The automated algorithm used by the program is called its inference rules. Theorems are the goals that represent the desired results.
Pure logic programming is very powerful indeed. In pure logic programming, at no time one specifies how to derive the goals, that is, there are no control statements (e.g. if-then-else). Pure logic programming is also very hard to implement. Hence people came with simpler models that support sub-sets of the full logic programming paradigm. One of these is called First Order Predicate Calculus.
In First Order Predicate Calculus, logical statements are formed of constants, variables and their quantifiers (e.g. for all values, for any value), predicates, and connectives. Much of the expressiveness and power of First Order Predicate Calculus comes from the connectives. Common connectives are the logic operators AND, OR, NOT. Additional connectives are implication and equivalence. To say that A implies B, means that B is true when A is also true.
It turns out that even the restricted sub-set First Order Predicate Calculus is hard to implement. Thus people have further simplified it in attempts to facilitate implementations; Horn Clauses is the most commonly used sub-set of First Order Predicate Calculus. Horn Clauses do not support the specification of variable quantifiers, and of the equivalence operator, among other limitations. Prolog, probably the most widely used logic programming language, supports Horn Clauses.
Loosely speaking, a rule-based system or rule-based engine is an implementation of a logic programming language, usually based upon Horn Clauses, with some few procedural extensions. These systems are sometimes called production systems. In a production system, logical statements that are constants (i.e. always true or false) are called assertions, and compose what’s called the working memory. Logical statements that make use of the implication connective, that is, follow a if-then structure are called productions, and make up the production memory.
Rules engine execute by following a recognize-act algorithm; that is, as new logical statements are made available (i.e. inputted by the user), the engine matches (i.e. recognizes) the left-hand side of all productions against the contents of the working memory. This process results into a set of production instantiations. Next the engine executes (i.e. acts) the right-hand side, called actions, of the matched instantiated productions. The actions may result in new content in the working memory. Changes to the working memory causes the engine to re-enter the recognize-act cycle, which continues until no more matches occur.
One procedural extension to rules engines is to be able to directly manipulate the content of the working memory. This is done in the right-hand-side (i.e. actions) of the productions, where one may create, delete, and modify assertions present in the working memory.
Matching through thousands of productions against thousands of assertions can be quite expensive. A naive approach would cost O(productions * assertions). Matching can take around 90% of the processing time of a rules engine.
So let’s consider a way of optimizing this. Essentially what the engine is trying to do is to find out which assertions match against the left-hand-side of the productions, in another word, the end goal is to find a set of productions. What if we create a graph of nodes, where the nodes represent the conditions we are trying to match, and as we traverse the graph matching these nodes we eventually reach a terminal node representing the production for which the conditions apply?
For example, consider the following four productions P1, P2, P3, P4:
P1: if attr1 = ‘value1’ and attr2 = ‘value2’ then action1
P2: if attr1 = ‘value1’ and attr2 = ‘value3’ then action2
P3: if attr1 = ‘value4’ and attr2 = ‘value5’ then action3
P4: if attr1 = ‘value6’ and attr2 = ‘value7’ then action4
The corresponding matching graph is:
Now consider that assertion A is inputted:
A1(attr1 = ‘value1’, attr2 = ‘value2’)
The first step is to match for attr1, which will take the engine to the child node n2, and then to match attr2 to the terminal node P1, which is our goal. This has a total of 4 comparisons in the worst-case. A (very) naive implementation would take 8 comparisons in the worst-case.
Although over-simplified, this is the general idea of the RETE (i.e. Latin for network) matching algorithm invented by Dr. Charles Forgy. RETE also has additional considerations for optimizing the handling of the binding of variables, of joins, and techniques for keeping the network of matching nodes up-to-date, among other things.
RETE is a very efficient algorithm for performing matching in productions systems. Does it look as if it is generic enough in its entirely or parts of it to be used in different contexts? Yes, it certainly does to me. In fact, we use some of the ideas, with combination of hash functions, in the implementation of some aspects of our CEP engine, which is based upon SQL and not on a logical programming language.
To put all of these in perspective, let’s look at an example. Let’s say you want to find out if a certain stock has changed its value more than 20% when compared to a previous value in the past. Consider that a stock is defined as:
Stock: SYMBOL, VALUE, TIMESTAMP
First, let’s start with a logic programming language. I will use Prolog in this case:
stockUp(X) :- stock(X, Price1, Time1), stock(X, Price2, Time2),
Time2 > Time1, Price2 > (Price1 * 1.20).
If we feed the following assertions into the Prolog’s working memory:
stock(aaa, 10, 0).
stock(bbb, 15, 1).
stock(ccc, 10, 2).
stock(aaa, 12, 3).
stock(bbb, 14, 4).
stock(ccc, 13, 5).
We will get the following result:
stockUp(X = ccc)
Which means that the last stock assertion (i.e. ‘stock(ccc,13,5)’) generates the creation of the new assertion ‘stockUp‘ for symbol ‘ccc’.
Next, let’s try with BEA EPL, which mostly follows the SQL idiom:
INSERT INTO StockUp
SELECT a.symbol FROM Stock a, Stock b RETAIN ALL EVENTS
WHERE a.symbol = b.symbol and (a.price > b.price * 1.20)
This EPL statement provides the same result. However, in this case, we will be feeding the engine with Java Bean events rather than assertions. So, instead of creating the assertion:
stock(aaa, 10, 0)
We will create the Java Bean:
new Stock(“aaa”, 10.0)
It is interesting how events and assertions map seamlessly to each other, at least for this simple scenario. Intuitively, this seems to make sense to me, as both are conceptually immutable and represent some known aspect of the problem domain.
Note how we specify RETAIN ALL EVENTS in the EPL query. Using rule-based systems as an analogy, the RETAIN clause provides a declarative way of managing the CEP engine’s working memory. In the case of this example, it is doing exactly the same thing as the Prolog program, that is, keeping all assertions forever, which is probably not a realistic situation.
Consider this next EPL statement:
INSERT INTO StockUp
SELECT symbol FROM Stock
RETAIN 10 EVENTS WITH LARGEST timestamp PARTITION BY symbol
WHERE price > (prev(1,price) * 1.20)
In the first EPL query, we were considering that the stock events were ordered in non-decreasing time within the stream. This is a fair assumption to make considering that we are dealing with streams executing within an underlying stream infrastructure; whereas we couldn’t really make this assumption with regards to logical statements, and hence the need of explicitly including the timestamp in the Prolog predicate clause. Nevertheless, for sake of comparison, let’s assume that the events are not ordered, which means that the EPL query itself has to take the ordering into account. This is done with the WITH LARGEST timestamp clause. This clause will keep the stock stream in ascending order of timestamp; in my opinion WITH LARGEST timestamp is more declarative than the logical counter-part timestamp2 > timestamp1.
Likewise, note how the second EPL statement does need to join on symbol, which is explicitly done by both the logical statement ‘stock(X, Price1, Time1), stock(X, Price2, Time2)‘ and by the first EPL statement ‘a.symbol = b.symbol‘. Instead, the clause PARTITION BY symbol will group the stocks by symbol and allow one to compare (in this case using the previous function) within just that group.
Needless to say, this second EPL query provides the same result as of the first one, but in a more declarative manner. The one semantic difference between these two is that the 2nd query only considers 10 events at a time (i.e. RETAIN 10 EVENTS), rather than all events (i.e. RETAIN ALL EVENTS). In another word, we are declaratively restricting the working memory to 10 events in this case.
Ok, so one may argue that rule-based systems are better suited to handle inference, which my simple example fails to demonstrate.
That is a valid point, so let’s consider the scenario where stocks are part of a cluster that moves together, that is, if several other stocks within the same cluster go up, it is likely that the remaining ones within the cluster will also go up in the near future.
A simple production rule for this scenario is:
stockUp(aaa) :- stockUp(bbb), stockUp(ccc)
If one creates the following new assertion:
stock(bbb, 25, 6).
One will get the following result, which is expected:
stockUp(X = bbb)
And, in addition, the engine would infer that:
stockUp(X = aaa)
Can we do something similar with EPL? Certainly, here is the query:
INSERT INTO StockUp
SELECT ‘aaa’ AS symbol
FROM StockUp a, StockUp b RETAIN 10 MINUTES
WHERE a.symbol = ‘bbb’ AND b.symbol = ‘ccc’
Again, it is expressed in terms of SQL joins, which arguably is a less natural form of expressing inference, but nonetheless correct.
Thankfully, EPL also supports a pattern matching idiom, in contrast to just a SQL idiom, which, in my opinion, provides better expressiveness for describing inference:
INSERT INTO StockUp
SELECT ‘aaa’ AS symbol
MATCHING StockUp(symbol = ‘bbb’) AND StockUp(symbol = ‘ccc’)
RETAIN 10 MINUTES
So which language is better? SQL-based languages? Rule-based languages?
In my opinion, the discussion around CEP technology is larger than the discussion around which language to use for its engine, and definitely larger than discussing which matching algorithm, RETE or otherwise, this engine should be built upon.
When discussing the CEP technology, I find it important to focus on use-cases.
Regarding programming languages, there is a place for more than one language in CEP, and, equally, one does not want to be restricted by any one language in particular.