“New Event-Processing Design Patterns using CEP” at edBPM

September 1, 2009

Next week in Germany there is a very compelling workshop on event-driven BPM.

Unfortunately I won’t be able to attend this conference for personal reasons, however I (proudly) did have a paper accepted to it. As it stands, a paper that has not been presented looses much of its value, nonetheless I provide it here in an attempt to spawn some discussion, and mitigate my non-attendance at the workshop.


Language requirements for CEP and DEBS 2009

July 28, 2009

This year on DEBS (Distributed Event-Based Systems), there is an excellent tutorial around event processing languages (EPLs). I particularly like their categorization of the different styles of EPLs: “inference rules, ECA rules, agent oriented, SQL extensions, state oriented, imperative/script based”.

I was, however, a bit lost on the ‘why‘. Why is that there are so many different styles? Why is that we need a different language for event processing to begin with? Or, more importantly, why the recent re-energization of event processing?

If I may try to address this, I believe the spike in interest around event technologies like CEP is in a large extent a reflection of two recent developments in the IT world:

  • The increase in order-of magnitude of the volume of events that are generated and thus need to be processed today.
  • The desire to process these high volume of events in real-time.

The first item is an obvious side effect of increasing bandwidth, connectivity, and computational resources in today’s world.

The second item is less obvious. As markets become global, and competition is increased, business and their processes need to adapt and change ever more quickly and efficiently, thus forcing IT to move from a offline mode of data analysis where data can be stored and then analyzed to a online mode where data in the form events need to be processed in real-time as they occur.

These two business requirements dictate, presented here in a simplistic form, the following language requirements on an EPL:

  1. To be able to cope with possibly infinite sequences of events, the programming language must provide facilities to bind the event sequence in a structure that is workable, for example, by creating windows on top of streams.
  2. To facilitate the reduction of the volume of events, the programming language must provide facilities to transform several events into a single summary event, that is, to aggregate simple events into complex events (and hence the term ‘complex event processing’)
  3. As processing is executed in real-time, “time” and temporal constraints must be a first-class citizen of the programming language.
  4. Event processing with the intent of driving business processes hardly can be done in isolation without contextual data; hence the programming language must facilitate the handling of both events and the retrieval of (static) data and easy integration of the two.
  5. Along side the previous item, finding the relationship amongst events is important; the programming language must allow the matching of complex relationship patterns, such as: conjunctions, disjunctions, and negations (e.g. Events A and B, A or B, not A); non-presence of events; temporal matching (e.g. Events A before B); and correlation. In some aspect, this is another example where several simple events are synthesized into fewer complex events.

Undoubtedly, I have left out several other important language requirements, such as expressiveness and computability, however our focus is on CEP, and hence we will ignore these, for which there is extensive literature elsewhere.

Well, having established and hopefully agreed upon the language requirements, the next step is to map these into a real language implementation, which I following attempt to do using CQL.

Requirement 1 can be achieved in CQL through the definition of windows:

SELECT * FROM stream [RANGE 10 seconds]

In this example, we bind a window of 10 seconds to an previously unbound stream of events.

Requirement 2 is supported through different ways, one of which is just the aggregation of events into a new event, as following:

SELECT avg(price) FROM stock-stream [RANGE 10 seconds] GROUP BY symbol

Per requirement 3, time is a fundamental piece of CQL, which supports both application time and system time. As one can noticed in the first example, the window of events is defined in terms of time. There are other operators that take time as input.

In requirement 4, there is a need to join the stream with an external table that provides the contextual data:

SELECT symbol, full-name FROM stock-stream [RANGE 10 seconds] as event, stock-table as data

WHERE event.symbol = data.symbol

The stock-table entity may live in a RDBMS, in a distributed cache, or in any other data-provider implementation.

Finally, requirement 5 is achieved through the match_recognize operator. The following example detects that a customer order has not been followed by a shipment within 10 seconds:

SELECT “DELAYED” as alertType, orders.orderId,
FROM salestream MATCH_RECOGNIZE (
PARTITION BY orderId

MEASURES
CustOrder.orderId AS orderId

PATTERN (CustOrder NotTheShipment*) DURATION 10 SECONDS

DEFINE
CustOrder AS (type = ‘ORDER’),
NotTheShipment AS ((NOT (eventType = ‘ SHIPMENT’)))

) AS orders

Overall, this is a simple and somewhat naive attempt to illustrate the point, but, nevertheless, I hope it provides some structure to the requirements of an event processing language.


Oracle CEP 11gR1 – official support for CQL

July 1, 2009

Today Oracle announced the release of the product Oracle CEP, version 11gR1, which I am gladly part of the team.

Oracle CEP 11gR1 is the next release after 10.3, which is the re-brand of BEA’s Weblogic Event Server.

There are several new features in this release, but the flag-stone is the inclusion of CQL (Continuous Query Language) as the default event processing language.

CQL is important in several aspects:

  • It brings us closer to converging towards a standard language for event processing
  • It is based on a solid theoretical foundation, leveraging relational calculus and extending it to include the concept of stream of events
  • Full support for pattern matching, a new stream-only operator
  • Solid engineering, including several query plan optimizations, for example, a unbounded stream that uses the insert stream operator is converted to a ‘NOW’ window using the relation-stream operator
  • Implementation abstractions, for example, a relation can be bound to either a RDBMS table or to a Coherence cache, without any query re-write.

Other features worth-mentioning are:

And more coming soon…


Best-case analysis is bogus

June 21, 2009

One deals with worst-case analysis of algorithms in a day-to-day basis, especially if one is working with systems that have real-time constraints, such as CEP. As I’ve written in the past, real-time systems do not necessarily need to run fast, that is, they can take days to finish, but if the real-time system states that it will finish a task in 1 hour, then it must. To be able to make such time assertions with certainty, the real-time system developer must know the worst-case time for all algorithms that make up such a task.

As an example, what is the worst-case cost of inserting 1 element in a sorted linked list of size n? A naive approach is to start at the first position and compare the elements in the linked list until you find one that is larger than the element being inserted, in which case then you insert the new element in that position and move the remaining elements back. The worst-case scenario is when the element to be inserted is larger than any other element in the list, thus the cost being the cost of n node traversals (i.e. n*c1) and the cost of the actual insertion at the end (i.e. c2). Or, eliminating the constants, we would arrive at the asymptotic worst-case cost of n.

What about the best-case analysis of this algorithm?

Well, I believe that best-case analysis are rather bogus. It has very limited analytical and statistical value, as it not only adds no insight to the analysis of the algorithm, but it also rather hardly presents a real-world case.

However, the biggest problem of best-case algorithm analysis is that the result can be manufactured. To test for best-case scenarios means that you can choose your best input, which allows the algorithm developer to hard-code the algorithm for this well known best input. For example, the algorithm could do the following:

if (bestInput) then return x;

// otherwise do proper work

In other words, you could have a very complex algorithm whose worst-case cost is n square, but whose best-case cost for this one selected input is 1!

Is this valid? Well, it is a valid algorithm, but, in my opinion, a bogus analysis.


CEP as a Continuous Query Engine

June 10, 2009

CEP is commonly referenced as a continuous query engine.
What exactly is a continuous query engine? How is it different than a non-continuous query engine, such as a database query engine?

First and foremost, CEP is not passively waiting for a customer request to execute work. Instead, the customer registers CEP queries into the engine and the engine executes these queries, even when there are no input events to be processed. Contrast this with the case of a database system, which waits for a DBA to send a query before processing the input data. This is the reason why CEP is sometimes called an inverted database:

In a database, the queries change often, and the input (data) changes less often. In CEP, the input (event) changes often, and the queries changes less often.

Let’s consider different CEP scenarios, where in each one “continuous” has a slightly different meaning:

  • Filtering

In the case of queries performing plain filtering over each single input event, the CEP engine is executing work only when a new event arrives, and otherwise it is idle. If the flow of input events is high, one has the impression that the engine is continuously running.

  • Time-windows

In the case of queries that make use of time-windows, for example, a query that calculates the average price of stocks received in the last 10 seconds, then the CEP engine must purge events out of the time-window as time progresses. In this sense, the engine is actually doing work even if there are no input events.

  • User-defined functions

Consider a query that makes use of a function that is not idempotent, for example, a function that returns the current time:

SELECT getCurrentTime() as time FROM stream

This is an interesting case, should the CEP engine output a new event only when a new input event arrives in the stream, or should the CEP engine output a new event every time getCurrentTime() returns a new value? If we are true to the continuous-aspect of a continuous query engine, then it seems the CEP engine should do the latter case, that is, output a new event every time progression.

Note that the output event does not depend upon the input event, such it is the case of the following example:

SELECT getCurrentTime(), stream.property FROM stream [NOW]

  • State-space

Finally, let’s consider the state-space used by the CEP engines, or, for that matter, by any continuous query engine. The state-space is the set of variables used internally in a system. If the processing is continuous, then one would assume that the variables being used while processing the input values is also of a continuous nature, that is, composed of a vector of real (or complex) numbers, and calculated through differential equations. Instead, this ends up not being the case, the state-space is more likely than not a set of discreet numbers, like integers.

So, as it stands today, most continuous query engines have discreet state-spaces instead of continuous state-spaces.


Javaone 2009 CEP presentation

May 3, 2009

CEP is an innovative technology; it deals with two problems that are prominent and relevant today: real-time processing and high-volume of data. I argue that CEP is here to stay.

Nonetheless, one can’t say it has become mainstream yet.

I believe the reason being is that CEP is not particularly simple; for instance, I am sure one will find several different definitions of the term CEP if one looks around. Partly, I think the culprit  is that we’ve spent too much time trying to define what is CEP, rather than what it does. With that intent on mind, I will be presenting a CEP Design Patterns at the upcoming J1 2009.

The idea is to focus on simple, practical problems around event processing, and then show how these problems can be solved with different CEP design patterns. The material is not yet finalized, however so far I am planning on discussing the following patterns:

  1. Event filtering
  2. New event detection
  3. Partitioned event filtering
  4. Old event detection
  5. Event enrichment
  6. Event aggregation
  7. Batched event aggregation
  8. Event correlation
  9. Missing event detection

The current schedule is to present on Wed at 3:00. Hope to see you there.


CEP glossary

September 11, 2008

The Event Processing Technical Society (EPTS) has updated its glossary of CEP terms.

Having a common language is indeed one of the first steps towards being able to discuss a technology.

Fortunately enough, several of the terms are already well-accepted and used. For instances, Oracle CEP (i.e. former BEA Event Server) treats the following terms as first-class citizens in its programming model: Event Source, Event Sink, Stream, Processor (a.k.a Event Processing Agent), Event Type, Event Processing Network, Event Processing Language, and Rule. This allows the user to author an event processing application by leveraging existing event processing design patterns: define the event types, define the sources of events, define the sinks of events, define the intermediate processors, connect these forming a EPN, configure the processors with EPL rules. It becomes a simple enough task if one understands the glossary.

Nevertheless, I would like to suggest one additional term: a relation. To understand relations, one has to understand streams first.

The glossary defines stream as “a linearly ordered sequence of events”. At first glance, one may think of a stream as the physical pipe, or connection, between sources and sinks. However, that is not entirely correct. Rather, an event channel, which is defined as “a conduit in which events are transmitted…” represents the idea of a connection; a stream extends this idea and adds processing semantic to it, precisely it states that a stream not only contains events, but that these events must be totally ordered.

Let’s consider an example. Consider an event type consisting of two attributes, a integer, and a string.

The following sequence is a stream ordered by the first attribute: {1, “a”}, {2, “b”}, {3, “a”}.
Conversely, the following sequence is NOT a stream: {1, “a”}, {2, “b”}, {1, “a”}. In this latter case, this sequence of events is termed a event cloud. Streams are a type of event clouds.

This may seem rather odd and arbitrary. But this additional semantic is very helpful. For one thing, it allows CEP engines to optimize. Because streams are generally unbounded, i.e. they never end, one has to define a window of time (or length) on top of a stream to be able to do any interesting computation. Having a ordered sequence allows the engine to progress the computation for a window without having to keep old values forever.

Considering the example at hand, let’s say one wants to calculate the number (i.e. count) of similar strings in the last 3 seconds, and that the first attribute provides the time-stamp of the event in seconds.

If the events were not ordered in time, then how would the engine know when it is safe enough to output the correct result?

Consider that time starts at t = 0 and the following input: {2, “a”}, {3, “a”}.

Should we wait for event {1,?} before outputting the result? Was this event lost or delayed? What if we don’t wait, and output the result, and then the event {1,?} ends up arriving afterward, should we output the result again?

As it can be noted, the conceptual model gets very complicated. Better to keep the conceptual model simple, and move this complexity elsewhere, such as an adapter that might be re-ordering events using some time-out strategy.

A “relation” is another type of event cloud. It is also a totally ordered set of events. In addition, it has an attribute that denotes whether the event is an insert, delete, or update event. These three different kind of events allows one to model a table, or more precisely, a instantaneous finite bag of tuples at some instant of time.

Consider a table whose rows contain a single column of type string.

The sequence of events {+, 1, “a”}, {+, 2, “b”} creates the table { {“a”}, {“b”} } at time t = 2.

Then the following event arrives: {-, 3, “a”}. This will result into the table being updated to { {“b”} } at time t = 3.

Keep in mind that a relation, in the context of event processing, and similarly to a stream, is still a sequence of streaming events, which pass through some event channel. However, differently than a stream, these events have additional semantic into them to allow one to represent actual finite tables.

Why is this needed? It is very common for events to be enriched with additional data. Sometimes this data is very dynamic, in which case this is modeled as a join between two streams; sometimes this data is somewhat static, changing less often, this case is better modeled as the join between a stream and a relation. If you think of it in this terms, it almost seems like a stream, seen as containing only insert events, is a sub-type of relation.

Why called it “relation”? Mainly because this is the term used by CQL, the foundation work for data stream management.

The CEP glossary does an excellent job of setting up the base model, and extending it to event patterns. With “relation“, we complete the model by embracing the data stream management piece of it.


A OSGi-Powered Database System

August 17, 2008

On the article One size fits all: A concept whose time has come and gone, Stonebraker comments that in the past some RDBMSs advocated the idea of a single solution to solve all database management needs.

He further points out that data management needs have grown since the inception of the technology in the 1970’s. Database systems started off focusing on OLTP, that is, transaction (i.e. particularly update) intensive applications, moved to OLAP, which is analytical, that is, read-intensive in nature, and hasn’t stopped since. Other interesting examples of new application areas related to data management are stream processing, web search and indexing, XML management, and local and distributed caches.

Today, database systems are often taught in computer science courses, to the point that it is common to sub-divide its study into separate functional areas, such as:

  • Query compiler
  • Query optimizer
  • Transaction manager
  • Concurrency manager
  • Index manager
  • Buffer (Memory) manager
  • Storage manager
  • Logging and recovery manager

Consider a DBMS system that is implemented by the modularization of each one of these functional areas, for example, which could be done by composing each functional area as a separate OSGi bundle.

This DBMS system could then be easily configured for different database management needs. Let’s revisit the application areas mentioned previously, and see how modularization could help us achieve them:

  • Stream processing: extend the query compiler and optimizer to handle stream-based algebra; drop transaction, and storage manager, as there are no atomicity nor durability across events.
  • Web search: use index manager based upon inverse lookups instead of B-trees.
  • Distributed cache: distribute the logging and recovery manager.
Of course, this is a over-simplification of the challenges and work involved, however it does strike me that most of the components already exist, it is just a matter of modularizing and re-using them correctly to assemble the next database management application…

Best Complex Event Processing Solution

July 15, 2008

Recently, Waters published its ranking for best solution and services for the year of 2008.

I am very glad to see WebLogic Event Server, re-branded as Oracle CEP, as the winner of the best complex event processing (CEP) solution.

There is still plenty for us to do, but I do think we have come a long way in the past two/three years, and we have constantly tried to innovate, both at the container level as well as at the programming model.

Interesting enough, there is a separate Best Streaming Data Management Solution category, which was awarded to the company Streambase.

Personally, I do think there is an implementation difference between streaming data management systems (SDMS), whose roots are deep from DBMS technology, and complex event processing systems (CEP), term which I believe was coined by David Luckham, focusing on event relationships (e.g. causality, aggregation). The keyword being implementation difference, as there is a large overlap on the use-cases that both address.

Regardless, I find it intriguing that Waters not only does not state the differences between the categories, but also uses the term CEP several times in the SDMS category.

I guess the verdict is that there is still confusion amongst the experts regarding event and stream processing… And that both products must be very good.


A Leicht Introduction to OSGi

June 23, 2008

Tomorrow, on June 24, Ian Skerrett is organizing a day of Eclipse and OSGi related technical sessions at Google, Mountain View CA.

I will be presenting a gentle introduction to OSGi, focusing on differences between using OSGi’s framework APIs and Spring-DM.

On a related note, I also presented another gentle introduction to OSGi at the 2008 OSGi Community Event at Berlin on June 11, however at this event the focus was on how WebLogic Event Server is built upon the OSGi technology.

I have high hope that someday I will know enough on any of these topics so that I would be able to give an advanced presentation. 🙂