Event Processing in Big Data: Creating Fast Data

January 15, 2013

Big Data has been getting a lot of attention recently. Not only is this a very interesting subject by itself, but I am finding it ever so more important to me as there is a direct relationship between Big Data and Event Processing (CEP). So much so, that I have written an article describing how event processing can be used to speed up, or rather, provide a real-time answer to Big Data queries. In other words, how one can combine event processing technology, which is inherently real-time, with Big Data technology, which is generally batch-oriented, to come with a new Fast Data platform, that is not only deep, but also fast.

The article is part of the first issue of the new technical magazine Real-Time Business Insights: Event Processing in Action.

Let me also take the opportunity to say that I gladly accepted a position as one of the editors for the magazine, alongside several other prestigious experts in the industry and academia. Please, feel free to download it, and send me feedback, and new ideas.

We are looking for new articles to be promoted in the following issues!

 

 

Advertisements

Answer to CEP quiz: streams and relations

January 4, 2012

Sorry for the long delay on posting an answer to last week’s, or rather, last year’s quiz. It is funny how time seems to stretch out sometimes and a few days turn into weeks.

Well, let’s revisit the queries from the last post:

  1. SELECT * FROM C1[NOW]
  2. ISTREAM (SELECT * FROM C1[NOW])
  3. DSTREAM (SELECT * FROM C1[NOW])

In the first query, the answer is that at the time t = 2 the CACHE is empty. Why empty and not 1?

To understand this, consider the following sequence of events:

  • At time t = 0, the NOW (window) operator creates an empty relation.
  • At time t = 1, the NOW operator converts the stream with the single event { p1 = 1 } to a relation containing a single entry { p1 = 1 }. The result of the NOW operator is a relation, and hence in the absence of other conversion operators, the query likewise outputs a relation. As CEP deals with continuous queries, the best way to represent the difference between the empty relation at time t = 0 and the relation at time t = 1 is to output the insertion of the entry { p1 = 1 }, or in other words, an insert event { p1 = 1 }. The CACHE receives this insert event, and puts the entry { p1 = 1} into it.
  • At time t = 2 (or more precisely at the immediate next moment after t = 1), the NOW operator outputs an empty relation, as the event e1 has moved on from the input stream. The difference between the relation at t = 1 and the relation at t = 2 is the deletion of the entry { p1 = 1 }, therefore the query outputs the delete event { p1 = 1 }. The CACHE receives this delete event, and consistently removes the entry { p1 = 1 }, leaving the cache empty.

Next, let’s consider the second query. In this case, the answer is that at the end the CACHE contains a single entry with the value of 1.

Let’s explore this. In this query, we are using an ISTREAM operator after the NOW operator. The ISTREAM converts the relation into a stream by keeping the insert events. This means that at time t = 1, the insert event being output from the NOW operator is converted into a stream containing the single event { p1 = 1 }. The CACHE receives this event and puts it into it. Next, at time t = 2, the delete event output from the NOW operator is ignored (dropped) by the ISTREAM (convert) operator and never makes it into the CACHE.

The answer for the third query is likewise that at the end the CACHE contains the single entry of 1. The rationale is similar to that of the previous case, however off by one time tick.

At time t = 1, the insert event being output from the NOW operator is ignored by the DSTREAM operator, however the delete event output at time t = 2 is used and converted to a stream. The conversion is simple, the delete event from the relation becomes an insert event in the stream, as the streams only support inserts anyway. The CACHE then picks up this insert event and puts the event into it. Just keep in mind that for this third query this happens at time t = 2, rather than at time t = 1 as it is the case of the second query.

Here is a quick summary:

I would like to thank those people who pinged me, some of them several times, to gently remind me to post the answer.


A CEP Quiz: streams and relations

October 28, 2011

Last week, we were training some of Oracle’s top consulting and partners on CEP.

The training was realized in Reading, UK (near London). Beautiful weather, contrary to what I was told is the common English predicament for October.

At the end, we gave a quiz, which I am reproducing here:

Consider an input channel C1 of the event type E1, defined as having a single property called p1 of type Integer.

An example of events of type E1 are { p1 = 1 } and { p1 = 2 }.

This input channel C1 is connected to a processor P, which is then connected to another (output) channel C2, whose output is sent to cache CACHE. Assume CACHE is keyed on p1.

Next, consider three CQL queries as follows which reside on processor P:

  1. SELECT * FROM C1[NOW]
  2. ISTREAM (SELECT * FROM C1[NOW])
  3. DSTREAM (SELECT * FROM C1[NOW])

Finally, send a single event e1 = { p1 = 1 } to S1.

The question is: what should be the content of the cache at the end for each one of these three queries?

To answer this, a couple of points need to be observed.

First, as I have mentioned in the past, CEP deals with two main concepts: that of a stream and that of a relation.

A stream is container of events, which is unbounded, and only supports inserts. Why only inserts? Well, because there is no such thing as a stream delete, think about it, how could we delete an event that has already happened?!

Whereas a relation is a container of events that is bounded by a certain number of events. A relation supports inserts, deletes, and updates.

Second, remember that a cache is treated like a table, or more precisely, like a relation, and therefore supports insert, delete, and update operations. In the case the query outputs a stream, then the events inserted into the stream are mapped to inserts (or puts) into the cache. If the query outputs a relation, then inserts into the relation are likewise mapped into puts into the cache, however a delete on the relation, becomes a remove of an entry in the cache.

Third, keep in mind that the operations ISTREAM (i.e. insert stream) and DSTREAM (i.e. delete stream) convert relations to streams. The former converts relation inserts into stream inserts, but ignores the relation updates and deletes. The latter converts relation deletes into stream inserts, and ignores relation inserts and updates (in reality, things are a bit more complicated, but let’s ignore the details for the time being).

Fourth, we want the answer as if time has moved on from ‘now’. For all purpose, say we measuring time in seconds, and we sent event e1 at time 1 second and want the answer at time 2 seconds.

I will post the answer in a follow up post next week.

The crucial point of this exercise is to understand the difference between two of the most important CEP concepts: that of a STREAM and RELATION, and how they relate to each other.


Oracle OpenWorld 2011

September 10, 2011

For those attending OOW this year, I will be co-presenting two sessions:

  • Complex Event Processing and Business Activity Monitoring Best Practices (Venue / Room: Marriott Marquis – Salon 3/4, Date and Time: 10/3/11, 12:30 – 13:30)

In this first session, we talk about how to best integrate CEP and BAM. BAM (Business Activity Monitoring) is a great fit to CEP, as it can serve as the CEP dashboard for visualizing and acting on complex events that are found to be business related.

  • Using Real-Time GPS Data with Oracle Spatial and Oracle Complex Event Processing (Venue / Room: Marriott Marquis – Golden Gate C3, Date and Time: 10/3/11, 19:30 – 20:15)

In this following talk, we walk through ours and our customers’ real-world experience on using GPS together with Oracle Spatial and CEP. The combination of CEP and Spatial has become an important trend and a very useful scenario.

If you are at San Francisco at this time, please stop by to chat.


Blending Space and Time in CEP

July 31, 2011

Space and time are two dimensions that are increasingly more important in today’s online world. It is therefore no surprise that the blending of CEP and spatial is a natural one and ever more important.

Recently at DEBS 2001, I presented our work on integrating Oracle Spatial with Oracle CEP, where we are seamless referencing to spatial functions and types in CQL (e.g. our event processing language). This allows us to implement geo-fencing and telematics within the real-timeness of CEP.

For example, consider the following query:

SELECT shopId, customerIdFROM Location-Stream [ NOW ] AS loc, Shop
WHERE contains@spatial(Shop.geometry, loc.point)

Noteworthy to mention:

  • Location-Stream is a stream of events containing the customer’s location as a point, perhaps being emitted by a GPS.
  • Shop is a relation defining the physical location of shops as a geometry.
  • The contains predicate function verifies if a point (event) from the stream is contained by any of the geometries (row) of the relation.

This single query selects an event in time (i.e. now) and joins it with a spatial table!

Joining point stream with a geometry relation

The join happens in memory aided by a R-Tree (i.e. region-tree) data structure, which is also provided by the spatial library, or as we called, the spatial cartridge.

Further, as better detailed in the presentation, CQL accomplishes this in a pluggable form using links and cartridges, but this is the subject of a future post…


Event Processing Patterns at DEBS 2011

July 13, 2011

This week, I co-presented a tutorial on event processing design patterns and their mapping to the EPTS reference architecture at DEBS 2011. For this presentation, I had the pleasure to work together with Paul Vincent (TIBCO), Adrian Paschke (Freie Universitaet Berlin), and Catherine Moxey (IBM).

Event processing design patterns is an interesting and new topic, one which I have talked about in the past. However, we have added an interesting aspect to this, which is the illustration of the event processing patterns using different event processing languages. Specifically, we showed how to realize filtering, aggregation, enrichment, and routing using both a stream-based EPL, as well as a rule-based (i.e. logical-programming) EPL.

This not only made it easier to understand and learn these patterns, but also in my opinion showed how both approaches are mostly similar in their fundamentals and achieve similar results.

For example, there is a trivial and obvious mapping between a stream event and a rule’s fact. Likewise, the specification of predicate conditions (e.g. a < b and c = d) are but identical.

Nonetheless, there are a few areas where there is a best fit. For example, aggregation is typically easier to express in the stream-based idiom, whereas semantic reasoning on events is best expressed in the logical-programming style, as I learned using Prova.

The intent of this work is ultimately to be published as a book, and with that in mind we continue to extend it with additional event processing patterns.

We welcome feedback, particularly around new event processing patterns to be documented and included in this project.


Concurrency at the EPN

January 3, 2011

Opher, from IBM, recently posted an interesting article explaining the need for an EPN.

As I have posted in the past, an Event Processing Network (EPN) is a directed graph that specifies the flow of the events from and to event processing agents  (EPAs), or, for short, processors.

Opher’s posting raised the valid question on why do we need an EPN at all, and instead couldn’t we just let all processors receive all events and drop those that are not of interest.

He pointed out two advantages of using an EPN, firstly it improves usability, and secondly it improves efficiency.

I believe there is a third advantage, the EPN allows one to specify concurrency.

For example, the following EPN specifies three sequential processor A, B and C. It is clear from the graph that processor C will only commence processing its events after B has finished, which likewise only processes its events after they have been processed by A.

Conversely, in the following EPN, processor B and C execute in parallel, only after the events have been processed by A.

Events are concurrent by nature, therefore being able to specify concurrency is a very important aspect of designing a CEP system. Surely, there are cases when the concurrency model can be inferred from the queries (i.e. rules) themselves by looking at their dependencies, however that is not always the case, or rather, that may not in itself be enough.

By the way, do see any resemblances between an EPN and a Petri-Network? This is not merely coincidental, but alas the subject of a later posting.