Detecting Fraud using Event Processing

September 5, 2013

Let’s consider a simple fraud detection scenario, one which I have run into with customers in the past.

Say we want to detect when a particular credit card customer spends more than 10000 dollars in a 24 hour (rolling) period.

Offhand, this appears to be a simple scenario to implement. Let’s get to it. Intuitively, this seems like a good use-case for pattern matching queries:

SELECT

M.clientId, M.txId, “ALARM” AS status

FROM creditcardTransactionStream

MATCH_RECOGNIZE (

PARTITION BY cardNumber

MEASURES cardNumber as clientId, transactionId AS txId

PATTERN (A)

DEFINE A AS price > 10000

) AS M

We can break down this query into the following steps:

  1. Partition the events coming through the credit-card transaction stream by its cardNumber;
  2. Check if the price of the transaction (event) is greater than 10000 dollars. We call this pattern A, or more precisely, the correlation variable A.
  3. When the pattern is detected, a process called matching, you can specify how to represent the matched pattern. These are called measures of the pattern. In this particular case, two measures are defined, one identifying the card number and another for the transaction ID.
  4. The final step is to output the two measures (i.e. cardNumber, transactionId) and a status string set to ‘ALARM’ indicating that the customer has spent more than 10000 dollars.

This pattern matching query is able to identify if a single transaction is above the threshold, but not multiple transactions that individually may cross the threshold.

For example, consider the following table of input and output events:

Time (h) Input Output
0 {cardNumber=10, transactionId=1, price=10000} {clientId=10, txId=1, ‘ALARM’}
1 {cardNumber=10, transactionId=2, price=5000}
2 {cardNumber=10, transactionId=3, price=6000}

In this case, note how there is no output at time t = 2h even though transactions 2 and 3 collectively do cross the 10K threshold.

The following query addresses this issue:

SELECT

M.clientId, M.txId, “ALARM” AS status

FROM creditcardTransactionStream

MATCH_RECOGNIZE (

PARTITION BY cardNumber

MEASURES cardNumber as clientId, transactionId as txId

PATTERN (A+? B) WITHIN 24 HOURS

DEFINE B as SUM(A.price) + B.price > 10000

) as M

One way to look at pattern matching is to consider it as a state diagram, where each correlation variable (e.g. A) represents a state in the state diagram. Typically, you need at least two states, the starting state and the accepting state (i.e. the end state). In this particular scenario, state A represents the individual transactions and state B represents  the condition where collectively the events cross the 10K threshold.

However, there are a few caveats. For example, could we go along with a single state (i.e. A) using the expression ‘A as SUM(A.price)’? No, we can’t define a variable in terms of an aggregation of itself, hence the reason we created variable B, which is then defined as the aggregation of A. Because we intent to aggregate several events as part of A, A must be defined as a collection, that is, a correlation group variable. In this case, we do this by using the regular expression operator of ‘+’ as in ‘A+’. This indicates that the correlation variable A should attempt to match one or more events.

Further, why do we need to add ‘B.price’ to the ‘SUM(a.price)’ as in the expression ‘B as SUM(A.price) + B.price’ ? This is needed because the ending event itself is a transaction and thus its price needs to be considered as part of the equation. If we don’t do that, then the price of the last received event would be ignored.

Why do we need the token ‘?’ as in the expression ‘A+?’ in the query? Typically, correlation group variables, as it is the case of A, are greedy, that is, they try to match as many events as possible. However, because A and B are similar, if we keep A greedy, then B would never match. To avoid this situation, we use the token ‘?’ to indicate A should be reluctant, and match the minimal set of events possible, therefore also allowing B to match and accept the final sequence.

Finally, note that the clause ‘within 24 hours’ specifies the window of interest. This means that the sum of the events are always constrained to a rolling 24 hours window. Here is a state diagram that represents the query:

state-machine

Using this new query, we will get a proper alarm output when the query receives the third event (i.e transactionId = 3). However, there is one new problem, this query ceases to output an alarm when we receive the first event (i.e. transactionId = 1).

Why is that? The problem is that the new query expects at least two events, one that matches variable A and another to match variable B. We can fix this by introducing an alternation in our regular expression, as follows:

PATTERN (C | (A*? B)) WITHIN 24 HOURS

DEFINE B as SUM(A.price) + B.price > 10000, C as price > 10000

In this case, either we match the expression ‘(A+? B)’ or the expression ‘C’, where C is simply the case where we have a single event whose price is already crossing the threshold. Note that we need to specify C as the first variable in our pattern, otherwise the query would only consider the ‘(A+? B)’ expression and never have a chance to match for ‘C’. Also note that any variable that is not defined, as it is the case of A, means that it is always true for all events.

Are we done?

Not yet, consider the case where we want to output the total sum of the transactions that are crossing the threshold. The introduction of an alternation makes this a bit harder, as the query is either matching C or A and B. How can we solve this?

We can address this by using subsets. Subsets are correlation variables that represent the union of other correlation variables. Take a look a look at the modified query:

SELECT M.clientId, M.total, “ALARM” as status

FROM creditcardTransactionStream

MATCH_RECOGNIZE (

PARTITION BY cardNumber

MEASURES cardNumber as clientId, SUM(ABC.price) as total

PATTERN (C | (A+? B)) WITHIN 24 HOURS

SUBSET ABC = (A,B,C)

DEFINE B as SUM(A.price) + B.price > 10000, C as price > 10000

) as M

As you can see, pattern matching queries gives us a lot of power, however this comes at a price, as one needs to consider a lot of things, such as the starting conditions, the accepting conditions, and all the corner-case scenarios, which are never trivial, even for apparently simple use-cases.

It also highlights the crucial but subtle point that as much as one tries to create black-box solutions for event processing scenarios, real-world use-cases are always more complicated and have more subtleties then originally conceived and for this reason there is always going to be a place for computational rich domain-specific event processing languages.


New book on Event Processing and EPL theory

July 4, 2013

In march of this year, I had the pleasure to publish my second book: Getting Started with Oracle Event Processing 11g

As I learned the hard way, authoring a book is really hard stuff, so this time around I had the satisfaction of working with two co-authors, Robin Smith and Lloyd Williams, both outstanding product managers at Oracle.

Although the book is focused primarily on Oracle’s Event Processing product (version 11g), it also includes a lot of material around the underlying fundamentals and concepts of the (complex) event processing technology employed by the product, so that it should make it interesting to the event processing enthusiastics in general, particularly those interested on learning the theory behind event processing languages.

The first two chapters provide a very good landscape of the market for event processing, along the way describing a few important use-cases that are addressed by this technology. Chapter 3 and 4 describe how to get events in and out of the system, and how to model the system using the concept of a Event Processing Network (EPN).

Chapters 5, 8, and 11 provide a very deep description of Oracle’s CQL (Continuous Query Language). Amongst other things, they get into several interesting and advanced topics:

  • The conceptual differences between streams, and relations;
  • How different timing models, such as application-based and system-based time models, influence the results;
  • A formal explanation of how relational algebra in SQL is extended to support streams;
  • Shows how to implement interesting pattern matching scenarios, such as that of missing events, and the W pattern; and
  • Describes how CQL is extended to support JDBC, Java, and Spatial technology, allowing one to not only process events in time, but also in terms of location.

Chapters 6, 7, and 9 describe how to manage the overall system, both in terms of configuration, but also performance, and how to scale-up and scale-out, particularly explaining how a Data Grid can be used in conjunction with event processing to greatly improve scalability and fault tolerance. Finally, chapters 10 and 12 tie everything together with a case study and discusses future directions.

I hope you will have as much fun reading this book as I had writing it.

If you have any questions along the way, feel free to send them to me.


Answer to CEP quiz: streams and relations

January 4, 2012

Sorry for the long delay on posting an answer to last week’s, or rather, last year’s quiz. It is funny how time seems to stretch out sometimes and a few days turn into weeks.

Well, let’s revisit the queries from the last post:

  1. SELECT * FROM C1[NOW]
  2. ISTREAM (SELECT * FROM C1[NOW])
  3. DSTREAM (SELECT * FROM C1[NOW])

In the first query, the answer is that at the time t = 2 the CACHE is empty. Why empty and not 1?

To understand this, consider the following sequence of events:

  • At time t = 0, the NOW (window) operator creates an empty relation.
  • At time t = 1, the NOW operator converts the stream with the single event { p1 = 1 } to a relation containing a single entry { p1 = 1 }. The result of the NOW operator is a relation, and hence in the absence of other conversion operators, the query likewise outputs a relation. As CEP deals with continuous queries, the best way to represent the difference between the empty relation at time t = 0 and the relation at time t = 1 is to output the insertion of the entry { p1 = 1 }, or in other words, an insert event { p1 = 1 }. The CACHE receives this insert event, and puts the entry { p1 = 1} into it.
  • At time t = 2 (or more precisely at the immediate next moment after t = 1), the NOW operator outputs an empty relation, as the event e1 has moved on from the input stream. The difference between the relation at t = 1 and the relation at t = 2 is the deletion of the entry { p1 = 1 }, therefore the query outputs the delete event { p1 = 1 }. The CACHE receives this delete event, and consistently removes the entry { p1 = 1 }, leaving the cache empty.

Next, let’s consider the second query. In this case, the answer is that at the end the CACHE contains a single entry with the value of 1.

Let’s explore this. In this query, we are using an ISTREAM operator after the NOW operator. The ISTREAM converts the relation into a stream by keeping the insert events. This means that at time t = 1, the insert event being output from the NOW operator is converted into a stream containing the single event { p1 = 1 }. The CACHE receives this event and puts it into it. Next, at time t = 2, the delete event output from the NOW operator is ignored (dropped) by the ISTREAM (convert) operator and never makes it into the CACHE.

The answer for the third query is likewise that at the end the CACHE contains the single entry of 1. The rationale is similar to that of the previous case, however off by one time tick.

At time t = 1, the insert event being output from the NOW operator is ignored by the DSTREAM operator, however the delete event output at time t = 2 is used and converted to a stream. The conversion is simple, the delete event from the relation becomes an insert event in the stream, as the streams only support inserts anyway. The CACHE then picks up this insert event and puts the event into it. Just keep in mind that for this third query this happens at time t = 2, rather than at time t = 1 as it is the case of the second query.

Here is a quick summary:

I would like to thank those people who pinged me, some of them several times, to gently remind me to post the answer.


A CEP Quiz: streams and relations

October 28, 2011

Last week, we were training some of Oracle’s top consulting and partners on CEP.

The training was realized in Reading, UK (near London). Beautiful weather, contrary to what I was told is the common English predicament for October.

At the end, we gave a quiz, which I am reproducing here:

Consider an input channel C1 of the event type E1, defined as having a single property called p1 of type Integer.

An example of events of type E1 are { p1 = 1 } and { p1 = 2 }.

This input channel C1 is connected to a processor P, which is then connected to another (output) channel C2, whose output is sent to cache CACHE. Assume CACHE is keyed on p1.

Next, consider three CQL queries as follows which reside on processor P:

  1. SELECT * FROM C1[NOW]
  2. ISTREAM (SELECT * FROM C1[NOW])
  3. DSTREAM (SELECT * FROM C1[NOW])

Finally, send a single event e1 = { p1 = 1 } to S1.

The question is: what should be the content of the cache at the end for each one of these three queries?

To answer this, a couple of points need to be observed.

First, as I have mentioned in the past, CEP deals with two main concepts: that of a stream and that of a relation.

A stream is container of events, which is unbounded, and only supports inserts. Why only inserts? Well, because there is no such thing as a stream delete, think about it, how could we delete an event that has already happened?!

Whereas a relation is a container of events that is bounded by a certain number of events. A relation supports inserts, deletes, and updates.

Second, remember that a cache is treated like a table, or more precisely, like a relation, and therefore supports insert, delete, and update operations. In the case the query outputs a stream, then the events inserted into the stream are mapped to inserts (or puts) into the cache. If the query outputs a relation, then inserts into the relation are likewise mapped into puts into the cache, however a delete on the relation, becomes a remove of an entry in the cache.

Third, keep in mind that the operations ISTREAM (i.e. insert stream) and DSTREAM (i.e. delete stream) convert relations to streams. The former converts relation inserts into stream inserts, but ignores the relation updates and deletes. The latter converts relation deletes into stream inserts, and ignores relation inserts and updates (in reality, things are a bit more complicated, but let’s ignore the details for the time being).

Fourth, we want the answer as if time has moved on from ‘now’. For all purpose, say we measuring time in seconds, and we sent event e1 at time 1 second and want the answer at time 2 seconds.

I will post the answer in a follow up post next week.

The crucial point of this exercise is to understand the difference between two of the most important CEP concepts: that of a STREAM and RELATION, and how they relate to each other.


Blending Space and Time in CEP

July 31, 2011

Space and time are two dimensions that are increasingly more important in today’s online world. It is therefore no surprise that the blending of CEP and spatial is a natural one and ever more important.

Recently at DEBS 2001, I presented our work on integrating Oracle Spatial with Oracle CEP, where we are seamless referencing to spatial functions and types in CQL (e.g. our event processing language). This allows us to implement geo-fencing and telematics within the real-timeness of CEP.

For example, consider the following query:

SELECT shopId, customerIdFROM Location-Stream [ NOW ] AS loc, Shop
WHERE contains@spatial(Shop.geometry, loc.point)

Noteworthy to mention:

  • Location-Stream is a stream of events containing the customer’s location as a point, perhaps being emitted by a GPS.
  • Shop is a relation defining the physical location of shops as a geometry.
  • The contains predicate function verifies if a point (event) from the stream is contained by any of the geometries (row) of the relation.

This single query selects an event in time (i.e. now) and joins it with a spatial table!

Joining point stream with a geometry relation

The join happens in memory aided by a R-Tree (i.e. region-tree) data structure, which is also provided by the spatial library, or as we called, the spatial cartridge.

Further, as better detailed in the presentation, CQL accomplishes this in a pluggable form using links and cartridges, but this is the subject of a future post…


New Point Release for Oracle CEP 11gR1

May 6, 2010

This week, Oracle announced the release of Oracle CEP 11gR1 11.1.1.3.

Even though it is a point release, there are noteworthy improvements and features:

Integration of CQL with Java

CQL (or any other event processing language) allows the authoring of event processing applications at a higher level of abstraction, making them less suitable for dealing with low-level tasks, such as String manipulation, and other programming-in-the-small problems; and lack the richness of other programming language libraries (e.g. Java), which have been built over several years of usage.

In this new release of Oracle CEP, we solve this problem by fully integrating the Java programming language into CQL. This is done at the type-system level, rather than through User-Defined Functions or call-outs, allowing the usage of Java classes (e.g. constructors, methods, fields) directly in CQL in a blended form.

CQL and Java.jpg

In this example, we make use of the Java class Date, by invoking its constructor, and then we invoke the instance method toString() on the new object.

The JDK has several useful utility classes, such as Date, RegExp, and String, making it a perfect choice for CQL.

Integration of CQL and Spatial

Location tracking and CEP go hand-in-hand. One example of a spatial-related CEP application is automobile traffic monitoring, where the automobile location is received as a continuous event stream.

Oracle CEP now supports the direct usage of spatial types (e.g. Geometry) and spatial functions in CQL, as shown by the next example, which verifies if “the current location of a bus is contained within a pre-determined arrival location”.

 

CQL and Spatial.jpg

One very important aspect of this integration is that indexing of the spatial types (e.g. Geometry) are also being handled in the appropriate form. In other words, not only a user is able to leverage the spatial package, but also OCEP takes care of using the right indexing mechanism for the spatial data, such as a R-tree instead of a hash-based index.

High-Availability Adapters

CEP applications are characterized by their quick response-time. This is also applicable for high-available CEP applications, hence a common approach for achieving high-availability in CEP systems is to use an active/active architecture.

In the previous release of OCEP, several APIs were made available for OCEP developers to create their active/active HA CEP solutions.

HA OCEP app.jpgIn this new release, we take a step further and provide several built-in adapters to aide in the creation of HA OCEP applications. Amongst these, there are HA adapters that synchronize time in the upstream portions of the EPN, and synchronize the events in the downstream portions of the EPN, as illustrated in the previous figure.

Much More…

There is much more, please check the documentation for the full set of features and their details, but here are other examples:

  • Visualizer’s Event Injection and Trace, which allows a user to easily and dynamically send and receive events into and from a runnign application without having to write any code
  • Manage the revision history of a OCEP configuration
  • Deploy OCEP application libraries, facilitating re-use of adapters, JDBC drivers, and event-types
  • Support for a WITHIN clause in CQL’s pattern matching to limit the amount of time to wait for a pattern to be detected
  • Create aliases in CQL, facilitating the creation and management of complex queries
  • Support for TABLE functions in CQL, thus allowing a query to invoke a function that returns a full set of rows (e.g. table)

Dealing with different timing models in CEP

March 20, 2010

Time can be a subtle thing to understand in CEP systems.

Following, we have a scenario that yields different results depending on how one interprets time.

First, let’s define the use-case:

Consider a stream of events that has a rate of one event every 500 milliseconds. The event contains a single value property of type int. We want to calculate the average of this value property for the last 1 second of events. Note how simple is the use-case.

Is the specification of this use-case complete? Not yet, so far we have described the input, and the event processing (EP) function, but we have not specified when to output. Let’s do so: we want to output every time the calculated average value changes.

As an illustration, the following CQL implements this EP function:

ISTREAM( SELECT AVG(value) AS average FROM stream [RANGE 1 second] )

The final question is: how should we interpret time? Say we let the CEP system timestamp the events as they arrive using the wall clock (i.e. CPU clock) time. We shall call this the system-timestamped timing model.

Table 1 shows the output of this use-case for a set of input events when applying the system-timestamped model:

What’s particularly interesting in this scenario is the output event o4. A CEP system that supports the system-timestamped model can progress time as the wall clock progresses. Let’s say that our system has a heart-beat of 300 milliseconds, what this means is that at time 1300 milliseconds (i.e. i3 + heart-beat) the CEP system is able to automatically update the stream window by expiring the event i1. Note that this only happens when the stream window is full and thus events can be expired.

Next, let’s assume that the time is defined by the application itself, and this is done by including a timestamp property in the event. Let’s look what happens when we input the same set of events at exactly the same time as if we were using the wall clock time:

Interesting enough, now we only get four output events, that is, event o4 is missing. When time is defined by the application, the CEP system itself does not know how to progress time, in other words, even though the wall clock may have progressed several minutes, the application time may not have changed at all. What this means is that the CEP system will only be able to determine that time has moved when it receives a new input event with an updated timestamp property. Hence we don’t get a situation like in the previous case where the CEP system itself was able to expire events automatically.

In summary, in this simple example we went through two different timing models, system-timestamped and application timestamped. Timing models are very important in CEP, as it allows great flexibility, however you must be aware of the caveats.


Language requirements for CEP and DEBS 2009

July 28, 2009

This year on DEBS (Distributed Event-Based Systems), there is an excellent tutorial around event processing languages (EPLs). I particularly like their categorization of the different styles of EPLs: “inference rules, ECA rules, agent oriented, SQL extensions, state oriented, imperative/script based”.

I was, however, a bit lost on the ‘why‘. Why is that there are so many different styles? Why is that we need a different language for event processing to begin with? Or, more importantly, why the recent re-energization of event processing?

If I may try to address this, I believe the spike in interest around event technologies like CEP is in a large extent a reflection of two recent developments in the IT world:

  • The increase in order-of magnitude of the volume of events that are generated and thus need to be processed today.
  • The desire to process these high volume of events in real-time.

The first item is an obvious side effect of increasing bandwidth, connectivity, and computational resources in today’s world.

The second item is less obvious. As markets become global, and competition is increased, business and their processes need to adapt and change ever more quickly and efficiently, thus forcing IT to move from a offline mode of data analysis where data can be stored and then analyzed to a online mode where data in the form events need to be processed in real-time as they occur.

These two business requirements dictate, presented here in a simplistic form, the following language requirements on an EPL:

  1. To be able to cope with possibly infinite sequences of events, the programming language must provide facilities to bind the event sequence in a structure that is workable, for example, by creating windows on top of streams.
  2. To facilitate the reduction of the volume of events, the programming language must provide facilities to transform several events into a single summary event, that is, to aggregate simple events into complex events (and hence the term ‘complex event processing’)
  3. As processing is executed in real-time, “time” and temporal constraints must be a first-class citizen of the programming language.
  4. Event processing with the intent of driving business processes hardly can be done in isolation without contextual data; hence the programming language must facilitate the handling of both events and the retrieval of (static) data and easy integration of the two.
  5. Along side the previous item, finding the relationship amongst events is important; the programming language must allow the matching of complex relationship patterns, such as: conjunctions, disjunctions, and negations (e.g. Events A and B, A or B, not A); non-presence of events; temporal matching (e.g. Events A before B); and correlation. In some aspect, this is another example where several simple events are synthesized into fewer complex events.

Undoubtedly, I have left out several other important language requirements, such as expressiveness and computability, however our focus is on CEP, and hence we will ignore these, for which there is extensive literature elsewhere.

Well, having established and hopefully agreed upon the language requirements, the next step is to map these into a real language implementation, which I following attempt to do using CQL.

Requirement 1 can be achieved in CQL through the definition of windows:

SELECT * FROM stream [RANGE 10 seconds]

In this example, we bind a window of 10 seconds to an previously unbound stream of events.

Requirement 2 is supported through different ways, one of which is just the aggregation of events into a new event, as following:

SELECT avg(price) FROM stock-stream [RANGE 10 seconds] GROUP BY symbol

Per requirement 3, time is a fundamental piece of CQL, which supports both application time and system time. As one can noticed in the first example, the window of events is defined in terms of time. There are other operators that take time as input.

In requirement 4, there is a need to join the stream with an external table that provides the contextual data:

SELECT symbol, full-name FROM stock-stream [RANGE 10 seconds] as event, stock-table as data

WHERE event.symbol = data.symbol

The stock-table entity may live in a RDBMS, in a distributed cache, or in any other data-provider implementation.

Finally, requirement 5 is achieved through the match_recognize operator. The following example detects that a customer order has not been followed by a shipment within 10 seconds:

SELECT “DELAYED” as alertType, orders.orderId,
FROM salestream MATCH_RECOGNIZE (
PARTITION BY orderId

MEASURES
CustOrder.orderId AS orderId

PATTERN (CustOrder NotTheShipment*) DURATION 10 SECONDS

DEFINE
CustOrder AS (type = ‘ORDER’),
NotTheShipment AS ((NOT (eventType = ‘ SHIPMENT’)))

) AS orders

Overall, this is a simple and somewhat naive attempt to illustrate the point, but, nevertheless, I hope it provides some structure to the requirements of an event processing language.


Oracle CEP 11gR1 – official support for CQL

July 1, 2009

Today Oracle announced the release of the product Oracle CEP, version 11gR1, which I am gladly part of the team.

Oracle CEP 11gR1 is the next release after 10.3, which is the re-brand of BEA’s Weblogic Event Server.

There are several new features in this release, but the flag-stone is the inclusion of CQL (Continuous Query Language) as the default event processing language.

CQL is important in several aspects:

  • It brings us closer to converging towards a standard language for event processing
  • It is based on a solid theoretical foundation, leveraging relational calculus and extending it to include the concept of stream of events
  • Full support for pattern matching, a new stream-only operator
  • Solid engineering, including several query plan optimizations, for example, a unbounded stream that uses the insert stream operator is converted to a ‘NOW’ window using the relation-stream operator
  • Implementation abstractions, for example, a relation can be bound to either a RDBMS table or to a Coherence cache, without any query re-write.

Other features worth-mentioning are:

And more coming soon…


CEP glossary

September 11, 2008

The Event Processing Technical Society (EPTS) has updated its glossary of CEP terms.

Having a common language is indeed one of the first steps towards being able to discuss a technology.

Fortunately enough, several of the terms are already well-accepted and used. For instances, Oracle CEP (i.e. former BEA Event Server) treats the following terms as first-class citizens in its programming model: Event Source, Event Sink, Stream, Processor (a.k.a Event Processing Agent), Event Type, Event Processing Network, Event Processing Language, and Rule. This allows the user to author an event processing application by leveraging existing event processing design patterns: define the event types, define the sources of events, define the sinks of events, define the intermediate processors, connect these forming a EPN, configure the processors with EPL rules. It becomes a simple enough task if one understands the glossary.

Nevertheless, I would like to suggest one additional term: a relation. To understand relations, one has to understand streams first.

The glossary defines stream as “a linearly ordered sequence of events”. At first glance, one may think of a stream as the physical pipe, or connection, between sources and sinks. However, that is not entirely correct. Rather, an event channel, which is defined as “a conduit in which events are transmitted…” represents the idea of a connection; a stream extends this idea and adds processing semantic to it, precisely it states that a stream not only contains events, but that these events must be totally ordered.

Let’s consider an example. Consider an event type consisting of two attributes, a integer, and a string.

The following sequence is a stream ordered by the first attribute: {1, “a”}, {2, “b”}, {3, “a”}.
Conversely, the following sequence is NOT a stream: {1, “a”}, {2, “b”}, {1, “a”}. In this latter case, this sequence of events is termed a event cloud. Streams are a type of event clouds.

This may seem rather odd and arbitrary. But this additional semantic is very helpful. For one thing, it allows CEP engines to optimize. Because streams are generally unbounded, i.e. they never end, one has to define a window of time (or length) on top of a stream to be able to do any interesting computation. Having a ordered sequence allows the engine to progress the computation for a window without having to keep old values forever.

Considering the example at hand, let’s say one wants to calculate the number (i.e. count) of similar strings in the last 3 seconds, and that the first attribute provides the time-stamp of the event in seconds.

If the events were not ordered in time, then how would the engine know when it is safe enough to output the correct result?

Consider that time starts at t = 0 and the following input: {2, “a”}, {3, “a”}.

Should we wait for event {1,?} before outputting the result? Was this event lost or delayed? What if we don’t wait, and output the result, and then the event {1,?} ends up arriving afterward, should we output the result again?

As it can be noted, the conceptual model gets very complicated. Better to keep the conceptual model simple, and move this complexity elsewhere, such as an adapter that might be re-ordering events using some time-out strategy.

A “relation” is another type of event cloud. It is also a totally ordered set of events. In addition, it has an attribute that denotes whether the event is an insert, delete, or update event. These three different kind of events allows one to model a table, or more precisely, a instantaneous finite bag of tuples at some instant of time.

Consider a table whose rows contain a single column of type string.

The sequence of events {+, 1, “a”}, {+, 2, “b”} creates the table { {“a”}, {“b”} } at time t = 2.

Then the following event arrives: {-, 3, “a”}. This will result into the table being updated to { {“b”} } at time t = 3.

Keep in mind that a relation, in the context of event processing, and similarly to a stream, is still a sequence of streaming events, which pass through some event channel. However, differently than a stream, these events have additional semantic into them to allow one to represent actual finite tables.

Why is this needed? It is very common for events to be enriched with additional data. Sometimes this data is very dynamic, in which case this is modeled as a join between two streams; sometimes this data is somewhat static, changing less often, this case is better modeled as the join between a stream and a relation. If you think of it in this terms, it almost seems like a stream, seen as containing only insert events, is a sub-type of relation.

Why called it “relation”? Mainly because this is the term used by CQL, the foundation work for data stream management.

The CEP glossary does an excellent job of setting up the base model, and extending it to event patterns. With “relation“, we complete the model by embracing the data stream management piece of it.