Oracle CEP 11gR1 – official support for CQL

July 1, 2009

Today Oracle announced the release of the product Oracle CEP, version 11gR1, which I am gladly part of the team.

Oracle CEP 11gR1 is the next release after 10.3, which is the re-brand of BEA’s Weblogic Event Server.

There are several new features in this release, but the flag-stone is the inclusion of CQL (Continuous Query Language) as the default event processing language.

CQL is important in several aspects:

  • It brings us closer to converging towards a standard language for event processing
  • It is based on a solid theoretical foundation, leveraging relational calculus and extending it to include the concept of stream of events
  • Full support for pattern matching, a new stream-only operator
  • Solid engineering, including several query plan optimizations, for example, a unbounded stream that uses the insert stream operator is converted to a ‘NOW’ window using the relation-stream operator
  • Implementation abstractions, for example, a relation can be bound to either a RDBMS table or to a Coherence cache, without any query re-write.

Other features worth-mentioning are:

And more coming soon…


Best-case analysis is bogus

June 21, 2009

One deals with worst-case analysis of algorithms in a day-to-day basis, especially if one is working with systems that have real-time constraints, such as CEP. As I’ve written in the past, real-time systems do not necessarily need to run fast, that is, they can take days to finish, but if the real-time system states that it will finish a task in 1 hour, then it must. To be able to make such time assertions with certainty, the real-time system developer must know the worst-case time for all algorithms that make up such a task.

As an example, what is the worst-case cost of inserting 1 element in a sorted linked list of size n? A naive approach is to start at the first position and compare the elements in the linked list until you find one that is larger than the element being inserted, in which case then you insert the new element in that position and move the remaining elements back. The worst-case scenario is when the element to be inserted is larger than any other element in the list, thus the cost being the cost of n node traversals (i.e. n*c1) and the cost of the actual insertion at the end (i.e. c2). Or, eliminating the constants, we would arrive at the asymptotic worst-case cost of n.

What about the best-case analysis of this algorithm?

Well, I believe that best-case analysis are rather bogus. It has very limited analytical and statistical value, as it not only adds no insight to the analysis of the algorithm, but it also rather hardly presents a real-world case.

However, the biggest problem of best-case algorithm analysis is that the result can be manufactured. To test for best-case scenarios means that you can choose your best input, which allows the algorithm developer to hard-code the algorithm for this well known best input. For example, the algorithm could do the following:

if (bestInput) then return x;

// otherwise do proper work

In other words, you could have a very complex algorithm whose worst-case cost is n square, but whose best-case cost for this one selected input is 1!

Is this valid? Well, it is a valid algorithm, but, in my opinion, a bogus analysis.


CEP as a Continuous Query Engine

June 10, 2009

CEP is commonly referenced as a continuous query engine.
What exactly is a continuous query engine? How is it different than a non-continuous query engine, such as a database query engine?

First and foremost, CEP is not passively waiting for a customer request to execute work. Instead, the customer registers CEP queries into the engine and the engine executes these queries, even when there are no input events to be processed. Contrast this with the case of a database system, which waits for a DBA to send a query before processing the input data. This is the reason why CEP is sometimes called an inverted database:

In a database, the queries change often, and the input (data) changes less often. In CEP, the input (event) changes often, and the queries changes less often.

Let’s consider different CEP scenarios, where in each one “continuous” has a slightly different meaning:

  • Filtering

In the case of queries performing plain filtering over each single input event, the CEP engine is executing work only when a new event arrives, and otherwise it is idle. If the flow of input events is high, one has the impression that the engine is continuously running.

  • Time-windows

In the case of queries that make use of time-windows, for example, a query that calculates the average price of stocks received in the last 10 seconds, then the CEP engine must purge events out of the time-window as time progresses. In this sense, the engine is actually doing work even if there are no input events.

  • User-defined functions

Consider a query that makes use of a function that is not idempotent, for example, a function that returns the current time:

SELECT getCurrentTime() as time FROM stream

This is an interesting case, should the CEP engine output a new event only when a new input event arrives in the stream, or should the CEP engine output a new event every time getCurrentTime() returns a new value? If we are true to the continuous-aspect of a continuous query engine, then it seems the CEP engine should do the latter case, that is, output a new event every time progression.

Note that the output event does not depend upon the input event, such it is the case of the following example:

SELECT getCurrentTime(), stream.property FROM stream [NOW]

  • State-space

Finally, let’s consider the state-space used by the CEP engines, or, for that matter, by any continuous query engine. The state-space is the set of variables used internally in a system. If the processing is continuous, then one would assume that the variables being used while processing the input values is also of a continuous nature, that is, composed of a vector of real (or complex) numbers, and calculated through differential equations. Instead, this ends up not being the case, the state-space is more likely than not a set of discreet numbers, like integers.

So, as it stands today, most continuous query engines have discreet state-spaces instead of continuous state-spaces.


Javaone 2009 CEP presentation

May 3, 2009

CEP is an innovative technology; it deals with two problems that are prominent and relevant today: real-time processing and high-volume of data. I argue that CEP is here to stay.

Nonetheless, one can’t say it has become mainstream yet.

I believe the reason being is that CEP is not particularly simple; for instance, I am sure one will find several different definitions of the term CEP if one looks around. Partly, I think the culprit  is that we’ve spent too much time trying to define what is CEP, rather than what it does. With that intent on mind, I will be presenting a CEP Design Patterns at the upcoming J1 2009.

The idea is to focus on simple, practical problems around event processing, and then show how these problems can be solved with different CEP design patterns. The material is not yet finalized, however so far I am planning on discussing the following patterns:

  1. Event filtering
  2. New event detection
  3. Partitioned event filtering
  4. Old event detection
  5. Event enrichment
  6. Event aggregation
  7. Batched event aggregation
  8. Event correlation
  9. Missing event detection

The current schedule is to present on Wed at 3:00. Hope to see you there.


CEP glossary

September 11, 2008

The Event Processing Technical Society (EPTS) has updated its glossary of CEP terms.

Having a common language is indeed one of the first steps towards being able to discuss a technology.

Fortunately enough, several of the terms are already well-accepted and used. For instances, Oracle CEP (i.e. former BEA Event Server) treats the following terms as first-class citizens in its programming model: Event Source, Event Sink, Stream, Processor (a.k.a Event Processing Agent), Event Type, Event Processing Network, Event Processing Language, and Rule. This allows the user to author an event processing application by leveraging existing event processing design patterns: define the event types, define the sources of events, define the sinks of events, define the intermediate processors, connect these forming a EPN, configure the processors with EPL rules. It becomes a simple enough task if one understands the glossary.

Nevertheless, I would like to suggest one additional term: a relation. To understand relations, one has to understand streams first.

The glossary defines stream as “a linearly ordered sequence of events”. At first glance, one may think of a stream as the physical pipe, or connection, between sources and sinks. However, that is not entirely correct. Rather, an event channel, which is defined as “a conduit in which events are transmitted…” represents the idea of a connection; a stream extends this idea and adds processing semantic to it, precisely it states that a stream not only contains events, but that these events must be totally ordered.

Let’s consider an example. Consider an event type consisting of two attributes, a integer, and a string.

The following sequence is a stream ordered by the first attribute: {1, “a”}, {2, “b”}, {3, “a”}.
Conversely, the following sequence is NOT a stream: {1, “a”}, {2, “b”}, {1, “a”}. In this latter case, this sequence of events is termed a event cloud. Streams are a type of event clouds.

This may seem rather odd and arbitrary. But this additional semantic is very helpful. For one thing, it allows CEP engines to optimize. Because streams are generally unbounded, i.e. they never end, one has to define a window of time (or length) on top of a stream to be able to do any interesting computation. Having a ordered sequence allows the engine to progress the computation for a window without having to keep old values forever.

Considering the example at hand, let’s say one wants to calculate the number (i.e. count) of similar strings in the last 3 seconds, and that the first attribute provides the time-stamp of the event in seconds.

If the events were not ordered in time, then how would the engine know when it is safe enough to output the correct result?

Consider that time starts at t = 0 and the following input: {2, “a”}, {3, “a”}.

Should we wait for event {1,?} before outputting the result? Was this event lost or delayed? What if we don’t wait, and output the result, and then the event {1,?} ends up arriving afterward, should we output the result again?

As it can be noted, the conceptual model gets very complicated. Better to keep the conceptual model simple, and move this complexity elsewhere, such as an adapter that might be re-ordering events using some time-out strategy.

A “relation” is another type of event cloud. It is also a totally ordered set of events. In addition, it has an attribute that denotes whether the event is an insert, delete, or update event. These three different kind of events allows one to model a table, or more precisely, a instantaneous finite bag of tuples at some instant of time.

Consider a table whose rows contain a single column of type string.

The sequence of events {+, 1, “a”}, {+, 2, “b”} creates the table { {”a”}, {”b”} } at time t = 2.

Then the following event arrives: {-, 3, “a”}. This will result into the table being updated to { {”b”} } at time t = 3.

Keep in mind that a relation, in the context of event processing, and similarly to a stream, is still a sequence of streaming events, which pass through some event channel. However, differently than a stream, these events have additional semantic into them to allow one to represent actual finite tables.

Why is this needed? It is very common for events to be enriched with additional data. Sometimes this data is very dynamic, in which case this is modeled as a join between two streams; sometimes this data is somewhat static, changing less often, this case is better modeled as the join between a stream and a relation. If you think of it in this terms, it almost seems like a stream, seen as containing only insert events, is a sub-type of relation.

Why called it “relation”? Mainly because this is the term used by CQL, the foundation work for data stream management.

The CEP glossary does an excellent job of setting up the base model, and extending it to event patterns. With “relation“, we complete the model by embracing the data stream management piece of it.


A OSGi-Powered Database System

August 17, 2008

On the article One size fits all: A concept whose time has come and gone, Stonebraker comments that in the past some RDBMSs advocated the idea of a single solution to solve all database management needs.

He further points out that data management needs have grown since the inception of the technology in the 1970’s. Database systems started off focusing on OLTP, that is, transaction (i.e. particularly update) intensive applications, moved to OLAP, which is analytical, that is, read-intensive in nature, and hasn’t stopped since. Other interesting examples of new application areas related to data management are stream processing, web search and indexing, XML management, and local and distributed caches.

Today, database systems are often taught in computer science courses, to the point that it is common to sub-divide its study into separate functional areas, such as:

  • Query compiler
  • Query optimizer
  • Transaction manager
  • Concurrency manager
  • Index manager
  • Buffer (Memory) manager
  • Storage manager
  • Logging and recovery manager

Consider a DBMS system that is implemented by the modularization of each one of these functional areas, for example, which could be done by composing each functional area as a separate OSGi bundle.

This DBMS system could then be easily configured for different database management needs. Let’s revisit the application areas mentioned previously, and see how modularization could help us achieve them:

  • Stream processing: extend the query compiler and optimizer to handle stream-based algebra; drop transaction, and storage manager, as there are no atomicity nor durability across events.
  • Web search: use index manager based upon inverse lookups instead of B-trees.
  • Distributed cache: distribute the logging and recovery manager.
Of course, this is a over-simplification of the challenges and work involved, however it does strike me that most of the components already exist, it is just a matter of modularizing and re-using them correctly to assemble the next database management application…

Best Complex Event Processing Solution

July 15, 2008

Recently, Waters published its ranking for best solution and services for the year of 2008.

I am very glad to see WebLogic Event Server, re-branded as Oracle CEP, as the winner of the best complex event processing (CEP) solution.

There is still plenty for us to do, but I do think we have come a long way in the past two/three years, and we have constantly tried to innovate, both at the container level as well as at the programming model.

Interesting enough, there is a separate Best Streaming Data Management Solution category, which was awarded to the company Streambase.

Personally, I do think there is an implementation difference between streaming data management systems (SDMS), whose roots are deep from DBMS technology, and complex event processing systems (CEP), term which I believe was coined by David Luckham, focusing on event relationships (e.g. causality, aggregation). The keyword being implementation difference, as there is a large overlap on the use-cases that both address.

Regardless, I find it intriguing that Waters not only does not state the differences between the categories, but also uses the term CEP several times in the SDMS category.

I guess the verdict is that there is still confusion amongst the experts regarding event and stream processing… And that both products must be very good.


A Leicht Introduction to OSGi

June 23, 2008

Tomorrow, on June 24, Ian Skerrett is organizing a day of Eclipse and OSGi related technical sessions at Google, Mountain View CA.

I will be presenting a gentle introduction to OSGi, focusing on differences between using OSGi’s framework APIs and Spring-DM.

On a related note, I also presented another gentle introduction to OSGi at the 2008 OSGi Community Event at Berlin on June 11, however at this event the focus was on how WebLogic Event Server is built upon the OSGi technology.

I have high hope that someday I will know enough on any of these topics so that I would be able to give an advanced presentation. :-)


Moving on from dev2dev…

June 22, 2008

With the acquisition of BEA Systems by Oracle, I suppose that those who have blogged in the past on dev2dev should move their blogs and articles to Oracle’s Technology Network (OTN) web-site, or look for some personal alternative.

I have decided to do the latter. I will be importing all my published dev2dev blogs into this blog account.

Partially, I was enticed by the additional freedom one has when using one’s own blog account, such as being able to select extra widgets, like calendars, tag clouds, etc; hopefully this offsets the supposedly increased publicity had I opted on moving onward to OTN.


CEP, Rules Engines, and SQL

August 13, 2007

There has been a lot of discussions on the differences between CEP/ESP and rules engines, specifically regarding RETE implementations. I particularly like this entry from Tim Bass on the subject.

In my opinion, the very fact that we are comparing CEP/ESP and rules engines is a sign that these subjects are not yet very well understood.

Partly, the reason for the confusion is that CEP/ESP, rules engines, and RETE are at different levels of abstractions.

CEP/ESP is a technology. It is a collection of techniques, methodologies, and implementation procedures intended on solving a common set of problems of a domain, namely, in the case of CEP/ESP, (loosely speaking) to achieve online processing of streaming data (i.e. events) to support, among other things, real-time business decision.

Although CEP and ESP have different definitions, for the purpose of this article I will use both terms interchangeably.

There are several aspects to CEP: visualization of events, storage of events, programming model for the development of CEP applications, rule (or query) governance, the runtime processing engine, etc.

With respect to the runtime processing engine, there are several implementations options. However, before considering an engine implementation, one must first consider what language, or what language paradigm, should be surfaced to the user so that the user can program to this engine.

One option is to take the well-known imperative programming language paradigm and use any one of the popular procedural or object-oriented languages such as Java as the event processing language of choice that is surfaced to the user. For example, this is the approach taken when one implements their CEP application with hard-coded rules directly in Java.

Another option is to use a more data-driven approach. Considering that one is dealing with events, which is a type or extension of data, this seems an appropriate choice.

Within the realm of data-driven approaches, one option is to use SQL. SQL, needless to say, has proven to be a very convenient and pervasive domain-specific language for dealing with data, which allows one to leverage existing skill-sets. Another common option is to use logic programming. Note that there is an equally intriguing discussion around CEP and SQL currently going on.

Logic programming is a very interesting, but long subject. In a logic programming language, programs consist of a set of statements that specify what is true about a desired goal. This is different than in imperative languages (e.g. Java), where programs specify the set of statements that must be executed to achieve the desired goal.

More specifically, in logic programming, programs contain only logical statements. Logical statements are statements that are either true or false. Logical statements that are always true are called axioms. The role of a logic program is to infer new logical statements called theorems from the axioms through some automated algorithm. The automated algorithm used by the program is called its inference rules. Theorems are the goals that represent the desired results.

Pure logic programming is very powerful indeed. In pure logic programming, at no time one specifies how to derive the goals, that is, there are no control statements (e.g. if-then-else). Pure logic programming is also very hard to implement. Hence people came with simpler models that support sub-sets of the full logic programming paradigm. One of these is called First Order Predicate Calculus.

In First Order Predicate Calculus, logical statements are formed of constants, variables and their quantifiers (e.g. for all values, for any value), predicates, and connectives. Much of the expressiveness and power of First Order Predicate Calculus comes from the connectives. Common connectives are the logic operators AND, OR, NOT. Additional connectives are implication and equivalence. To say that A implies B, means that B is true when A is also true.

It turns out that even the restricted sub-set First Order Predicate Calculus is hard to implement. Thus people have further simplified it in attempts to facilitate implementations; Horn Clauses is the most commonly used sub-set of First Order Predicate Calculus. Horn Clauses do not support the specification of variable quantifiers, and of the equivalence operator, among other limitations. Prolog, probably the most widely used logic programming language, supports Horn Clauses.

Loosely speaking, a rule-based system or rule-based engine is an implementation of a logic programming language, usually based upon Horn Clauses, with some few procedural extensions. These systems are sometimes called production systems. In a production system, logical statements that are constants (i.e. always true or false) are called assertions, and compose what’s called the working memory. Logical statements that make use of the implication connective, that is, follow a if-then structure are called productions, and make up the production memory.

Rules engine execute by following a recognize-act algorithm; that is, as new logical statements are made available (i.e. inputted by the user), the engine matches (i.e. recognizes) the left-hand side of all productions against the contents of the working memory. This process results into a set of production instantiations. Next the engine executes (i.e. acts) the right-hand side, called actions, of the matched instantiated productions. The actions may result in new content in the working memory. Changes to the working memory causes the engine to re-enter the recognize-act cycle, which continues until no more matches occur.

One procedural extension to rules engines is to be able to directly manipulate the content of the working memory. This is done in the right-hand-side (i.e. actions) of the productions, where one may create, delete, and modify assertions present in the working memory.

Matching through thousands of productions against thousands of assertions can be quite expensive. A naive approach would cost O(productions * assertions). Matching can take around 90% of the processing time of a rules engine.

So let’s consider a way of optimizing this. Essentially what the engine is trying to do is to find out which assertions match against the left-hand-side of the productions, in another word, the end goal is to find a set of productions. What if we create a graph of nodes, where the nodes represent the conditions we are trying to match, and as we traverse the graph matching these nodes we eventually reach a terminal node representing the production for which the conditions apply?

For example, consider the following four productions P1, P2, P3, P4:

P1: if attr1 = ‘value1′ and attr2 = ‘value2′ then action1

P2: if attr1 = ‘value1′ and attr2 = ‘value3′ then action2

P3: if attr1 = ‘value4′ and attr2 = ‘value5′ then action3

P4: if attr1 = ‘value6′ and attr2 = ‘value7′ then action4

The corresponding matching graph is:

A RETE Network

Now consider that assertion A is inputted:

A1(attr1 = ‘value1′, attr2 = ‘value2′)

The first step is to match for attr1, which will take the engine to the child node n2, and then to match attr2 to the terminal node P1, which is our goal. This has a total of 4 comparisons in the worst-case. A (very) naive implementation would take 8 comparisons in the worst-case.

Although over-simplified, this is the general idea of the RETE (i.e. Latin for network) matching algorithm invented by Dr. Charles Forgy. RETE also has additional considerations for optimizing the handling of the binding of variables, of joins, and techniques for keeping the network of matching nodes up-to-date, among other things.

RETE is a very efficient algorithm for performing matching in productions systems. Does it look as if it is generic enough in its entirely or parts of it to be used in different contexts? Yes, it certainly does to me. In fact, we use some of the ideas, with combination of hash functions, in the implementation of some aspects of our CEP engine, which is based upon SQL and not on a logical programming language.

To put all of these in perspective, let’s look at an example. Let’s say you want to find out if a certain stock has changed its value more than 20% when compared to a previous value in the past. Consider that a stock is defined as:

Stock: SYMBOL, VALUE, TIMESTAMP

First, let’s start with a logic programming language. I will use Prolog in this case:

stockUp(X)  :-  stock(X, Price1, Time1),  stock(X, Price2, Time2),

                           Time2 > Time1, Price2 > (Price1 * 1.20).

If we feed the following assertions into the Prolog’s working memory:

stock(aaa, 10, 0).

stock(bbb, 15, 1).

stock(ccc, 10, 2).

stock(aaa, 12, 3).

stock(bbb, 14, 4).

stock(ccc, 13, 5).

We will get the following result:

stockUp(X = ccc)

Which means that the last stock assertion (i.e. ’stock(ccc,13,5)’) generates the creation of the new assertion ‘stockUp‘ for symbol ‘ccc’.

Next, let’s try with BEA EPL, which mostly follows the SQL idiom:

INSERT INTO StockUp

SELECT a.symbol FROM Stock a, Stock b RETAIN ALL EVENTS

WHERE a.symbol = b.symbol and (a.price > b.price * 1.20)

This EPL statement provides the same result. However, in this case, we will be feeding the engine with Java Bean events rather than assertions. So, instead of creating the assertion:

stock(aaa, 10, 0)

We will create the Java Bean:

new Stock(”aaa”, 10.0)

It is interesting how events and assertions map seamlessly to each other, at least for this simple scenario. Intuitively, this seems to make sense to me, as both are conceptually immutable and represent some known aspect of the problem domain.

Note how we specify RETAIN ALL EVENTS in the EPL query. Using rule-based systems as an analogy, the RETAIN clause provides a declarative way of managing the CEP engine’s working memory. In the case of this example, it is doing exactly the same thing as the Prolog program, that is, keeping all assertions forever, which is probably not a realistic situation.

Consider this next EPL statement:

INSERT INTO StockUp

SELECT symbol FROM Stock

RETAIN 10 EVENTS WITH LARGEST timestamp PARTITION BY symbol

WHERE price > (prev(1,price) * 1.20)

In the first EPL query, we were considering that the stock events were ordered in non-decreasing time within the stream. This is a fair assumption to make considering that we are dealing with streams executing within an underlying stream infrastructure; whereas we couldn’t really make this assumption with regards to logical statements, and hence the need of explicitly including the timestamp in the Prolog predicate clause. Nevertheless, for sake of comparison, let’s assume that the events are not ordered, which means that the EPL query itself has to take the ordering into account. This is done with the WITH LARGEST timestamp clause. This clause will keep the stock stream in ascending order of timestamp; in my opinion WITH LARGEST timestamp is more declarative than the logical counter-part timestamp2 > timestamp1.

Likewise, note how the second EPL statement does need to join on symbol, which is explicitly done by both the logical statement ’stock(X, Price1, Time1),  stock(X, Price2, Time2)‘ and by the first EPL statement ‘a.symbol = b.symbol‘. Instead, the clause PARTITION BY symbol will group the stocks by symbol and allow one to compare (in this case using the previous function) within just that group.

Needless to say, this second EPL query provides the same result as of the first one, but in a more declarative manner. The one semantic difference between these two is that the 2nd query only considers 10 events at a time (i.e. RETAIN 10 EVENTS), rather than all events (i.e. RETAIN ALL EVENTS). In another word, we are declaratively restricting the working memory to 10 events in this case.

Ok, so one may argue that rule-based systems are better suited to handle inference, which my simple example fails to demonstrate.

That is a valid point, so let’s consider the scenario where stocks are part of a cluster that moves together, that is, if several other stocks within the same cluster go up, it is likely that the remaining ones within the cluster will also go up in the near future.

A simple production rule for this scenario is:

stockUp(aaa) :- stockUp(bbb), stockUp(ccc)

If one creates the following new assertion:

stock(bbb, 25, 6).

One will get the following result, which is expected:

stockUp(X = bbb)

And, in addition, the engine would infer that:

stockUp(X = aaa)

Can we do something similar with EPL? Certainly, here is the query:

INSERT INTO StockUp

SELECT ‘aaa’ AS symbol

FROM StockUp a, StockUp b RETAIN 10 MINUTES

WHERE a.symbol = ‘bbb’ AND b.symbol = ‘ccc’

Again, it is expressed in terms of SQL joins, which arguably is a less natural form of expressing inference, but nonetheless correct.

Thankfully, EPL also supports a pattern matching idiom, in contrast to just a SQL idiom, which, in my opinion, provides better expressiveness for describing inference:

INSERT INTO StockUp

SELECT ‘aaa’ AS symbol

MATCHING StockUp(symbol = ‘bbb’) AND StockUp(symbol = ‘ccc’)

RETAIN 10 MINUTES

So which language is better? SQL-based languages? Rule-based languages?

In my opinion, the discussion around CEP technology is larger than the discussion around which language to use for its engine, and definitely larger than discussing which matching algorithm, RETE or otherwise, this engine should be built upon.

When discussing the CEP technology, I find it important to focus on use-cases.

Regarding programming languages, there is a place for more than one language in CEP, and, equally, one does not want to be restricted by any one language in particular.