Detecting Fraud using Event Processing

September 5, 2013

Let’s consider a simple fraud detection scenario, one which I have run into with customers in the past.

Say we want to detect when a particular credit card customer spends more than 10000 dollars in a 24 hour (rolling) period.

Offhand, this appears to be a simple scenario to implement. Let’s get to it. Intuitively, this seems like a good use-case for pattern matching queries:

SELECT

M.clientId, M.txId, “ALARM” AS status

FROM creditcardTransactionStream

MATCH_RECOGNIZE (

PARTITION BY cardNumber

MEASURES cardNumber as clientId, transactionId AS txId

PATTERN (A)

DEFINE A AS price > 10000

) AS M

We can break down this query into the following steps:

  1. Partition the events coming through the credit-card transaction stream by its cardNumber;
  2. Check if the price of the transaction (event) is greater than 10000 dollars. We call this pattern A, or more precisely, the correlation variable A.
  3. When the pattern is detected, a process called matching, you can specify how to represent the matched pattern. These are called measures of the pattern. In this particular case, two measures are defined, one identifying the card number and another for the transaction ID.
  4. The final step is to output the two measures (i.e. cardNumber, transactionId) and a status string set to ‘ALARM’ indicating that the customer has spent more than 10000 dollars.

This pattern matching query is able to identify if a single transaction is above the threshold, but not multiple transactions that individually may cross the threshold.

For example, consider the following table of input and output events:

Time (h) Input Output
0 {cardNumber=10, transactionId=1, price=10000} {clientId=10, txId=1, ‘ALARM’}
1 {cardNumber=10, transactionId=2, price=5000}
2 {cardNumber=10, transactionId=3, price=6000}

In this case, note how there is no output at time t = 2h even though transactions 2 and 3 collectively do cross the 10K threshold.

The following query addresses this issue:

SELECT

M.clientId, M.txId, “ALARM” AS status

FROM creditcardTransactionStream

MATCH_RECOGNIZE (

PARTITION BY cardNumber

MEASURES cardNumber as clientId, transactionId as txId

PATTERN (A+? B) WITHIN 24 HOURS

DEFINE B as SUM(A.price) + B.price > 10000

) as M

One way to look at pattern matching is to consider it as a state diagram, where each correlation variable (e.g. A) represents a state in the state diagram. Typically, you need at least two states, the starting state and the accepting state (i.e. the end state). In this particular scenario, state A represents the individual transactions and state B represents  the condition where collectively the events cross the 10K threshold.

However, there are a few caveats. For example, could we go along with a single state (i.e. A) using the expression ‘A as SUM(A.price)’? No, we can’t define a variable in terms of an aggregation of itself, hence the reason we created variable B, which is then defined as the aggregation of A. Because we intent to aggregate several events as part of A, A must be defined as a collection, that is, a correlation group variable. In this case, we do this by using the regular expression operator of ‘+’ as in ‘A+’. This indicates that the correlation variable A should attempt to match one or more events.

Further, why do we need to add ‘B.price’ to the ‘SUM(a.price)’ as in the expression ‘B as SUM(A.price) + B.price’ ? This is needed because the ending event itself is a transaction and thus its price needs to be considered as part of the equation. If we don’t do that, then the price of the last received event would be ignored.

Why do we need the token ‘?’ as in the expression ‘A+?’ in the query? Typically, correlation group variables, as it is the case of A, are greedy, that is, they try to match as many events as possible. However, because A and B are similar, if we keep A greedy, then B would never match. To avoid this situation, we use the token ‘?’ to indicate A should be reluctant, and match the minimal set of events possible, therefore also allowing B to match and accept the final sequence.

Finally, note that the clause ‘within 24 hours’ specifies the window of interest. This means that the sum of the events are always constrained to a rolling 24 hours window. Here is a state diagram that represents the query:

state-machine

Using this new query, we will get a proper alarm output when the query receives the third event (i.e transactionId = 3). However, there is one new problem, this query ceases to output an alarm when we receive the first event (i.e. transactionId = 1).

Why is that? The problem is that the new query expects at least two events, one that matches variable A and another to match variable B. We can fix this by introducing an alternation in our regular expression, as follows:

PATTERN (C | (A*? B)) WITHIN 24 HOURS

DEFINE B as SUM(A.price) + B.price > 10000, C as price > 10000

In this case, either we match the expression ‘(A+? B)’ or the expression ‘C’, where C is simply the case where we have a single event whose price is already crossing the threshold. Note that we need to specify C as the first variable in our pattern, otherwise the query would only consider the ‘(A+? B)’ expression and never have a chance to match for ‘C’. Also note that any variable that is not defined, as it is the case of A, means that it is always true for all events.

Are we done?

Not yet, consider the case where we want to output the total sum of the transactions that are crossing the threshold. The introduction of an alternation makes this a bit harder, as the query is either matching C or A and B. How can we solve this?

We can address this by using subsets. Subsets are correlation variables that represent the union of other correlation variables. Take a look a look at the modified query:

SELECT M.clientId, M.total, “ALARM” as status

FROM creditcardTransactionStream

MATCH_RECOGNIZE (

PARTITION BY cardNumber

MEASURES cardNumber as clientId, SUM(ABC.price) as total

PATTERN (C | (A+? B)) WITHIN 24 HOURS

SUBSET ABC = (A,B,C)

DEFINE B as SUM(A.price) + B.price > 10000, C as price > 10000

) as M

As you can see, pattern matching queries gives us a lot of power, however this comes at a price, as one needs to consider a lot of things, such as the starting conditions, the accepting conditions, and all the corner-case scenarios, which are never trivial, even for apparently simple use-cases.

It also highlights the crucial but subtle point that as much as one tries to create black-box solutions for event processing scenarios, real-world use-cases are always more complicated and have more subtleties then originally conceived and for this reason there is always going to be a place for computational rich domain-specific event processing languages.


New book on Event Processing and EPL theory

July 4, 2013

In march of this year, I had the pleasure to publish my second book: Getting Started with Oracle Event Processing 11g

As I learned the hard way, authoring a book is really hard stuff, so this time around I had the satisfaction of working with two co-authors, Robin Smith and Lloyd Williams, both outstanding product managers at Oracle.

Although the book is focused primarily on Oracle’s Event Processing product (version 11g), it also includes a lot of material around the underlying fundamentals and concepts of the (complex) event processing technology employed by the product, so that it should make it interesting to the event processing enthusiastics in general, particularly those interested on learning the theory behind event processing languages.

The first two chapters provide a very good landscape of the market for event processing, along the way describing a few important use-cases that are addressed by this technology. Chapter 3 and 4 describe how to get events in and out of the system, and how to model the system using the concept of a Event Processing Network (EPN).

Chapters 5, 8, and 11 provide a very deep description of Oracle’s CQL (Continuous Query Language). Amongst other things, they get into several interesting and advanced topics:

  • The conceptual differences between streams, and relations;
  • How different timing models, such as application-based and system-based time models, influence the results;
  • A formal explanation of how relational algebra in SQL is extended to support streams;
  • Shows how to implement interesting pattern matching scenarios, such as that of missing events, and the W pattern; and
  • Describes how CQL is extended to support JDBC, Java, and Spatial technology, allowing one to not only process events in time, but also in terms of location.

Chapters 6, 7, and 9 describe how to manage the overall system, both in terms of configuration, but also performance, and how to scale-up and scale-out, particularly explaining how a Data Grid can be used in conjunction with event processing to greatly improve scalability and fault tolerance. Finally, chapters 10 and 12 tie everything together with a case study and discusses future directions.

I hope you will have as much fun reading this book as I had writing it.

If you have any questions along the way, feel free to send them to me.


Event Processing in Big Data: Creating Fast Data

January 15, 2013

Big Data has been getting a lot of attention recently. Not only is this a very interesting subject by itself, but I am finding it ever so more important to me as there is a direct relationship between Big Data and Event Processing (CEP). So much so, that I have written an article describing how event processing can be used to speed up, or rather, provide a real-time answer to Big Data queries. In other words, how one can combine event processing technology, which is inherently real-time, with Big Data technology, which is generally batch-oriented, to come with a new Fast Data platform, that is not only deep, but also fast.

The article is part of the first issue of the new technical magazine Real-Time Business Insights: Event Processing in Action.

Let me also take the opportunity to say that I gladly accepted a position as one of the editors for the magazine, alongside several other prestigious experts in the industry and academia. Please, feel free to download it, and send me feedback, and new ideas.

We are looking for new articles to be promoted in the following issues!

 

 


Answer to CEP quiz: streams and relations

January 4, 2012

Sorry for the long delay on posting an answer to last week’s, or rather, last year’s quiz. It is funny how time seems to stretch out sometimes and a few days turn into weeks.

Well, let’s revisit the queries from the last post:

  1. SELECT * FROM C1[NOW]
  2. ISTREAM (SELECT * FROM C1[NOW])
  3. DSTREAM (SELECT * FROM C1[NOW])

In the first query, the answer is that at the time t = 2 the CACHE is empty. Why empty and not 1?

To understand this, consider the following sequence of events:

  • At time t = 0, the NOW (window) operator creates an empty relation.
  • At time t = 1, the NOW operator converts the stream with the single event { p1 = 1 } to a relation containing a single entry { p1 = 1 }. The result of the NOW operator is a relation, and hence in the absence of other conversion operators, the query likewise outputs a relation. As CEP deals with continuous queries, the best way to represent the difference between the empty relation at time t = 0 and the relation at time t = 1 is to output the insertion of the entry { p1 = 1 }, or in other words, an insert event { p1 = 1 }. The CACHE receives this insert event, and puts the entry { p1 = 1} into it.
  • At time t = 2 (or more precisely at the immediate next moment after t = 1), the NOW operator outputs an empty relation, as the event e1 has moved on from the input stream. The difference between the relation at t = 1 and the relation at t = 2 is the deletion of the entry { p1 = 1 }, therefore the query outputs the delete event { p1 = 1 }. The CACHE receives this delete event, and consistently removes the entry { p1 = 1 }, leaving the cache empty.

Next, let’s consider the second query. In this case, the answer is that at the end the CACHE contains a single entry with the value of 1.

Let’s explore this. In this query, we are using an ISTREAM operator after the NOW operator. The ISTREAM converts the relation into a stream by keeping the insert events. This means that at time t = 1, the insert event being output from the NOW operator is converted into a stream containing the single event { p1 = 1 }. The CACHE receives this event and puts it into it. Next, at time t = 2, the delete event output from the NOW operator is ignored (dropped) by the ISTREAM (convert) operator and never makes it into the CACHE.

The answer for the third query is likewise that at the end the CACHE contains the single entry of 1. The rationale is similar to that of the previous case, however off by one time tick.

At time t = 1, the insert event being output from the NOW operator is ignored by the DSTREAM operator, however the delete event output at time t = 2 is used and converted to a stream. The conversion is simple, the delete event from the relation becomes an insert event in the stream, as the streams only support inserts anyway. The CACHE then picks up this insert event and puts the event into it. Just keep in mind that for this third query this happens at time t = 2, rather than at time t = 1 as it is the case of the second query.

Here is a quick summary:

I would like to thank those people who pinged me, some of them several times, to gently remind me to post the answer.


A CEP Quiz: streams and relations

October 28, 2011

Last week, we were training some of Oracle’s top consulting and partners on CEP.

The training was realized in Reading, UK (near London). Beautiful weather, contrary to what I was told is the common English predicament for October.

At the end, we gave a quiz, which I am reproducing here:

Consider an input channel C1 of the event type E1, defined as having a single property called p1 of type Integer.

An example of events of type E1 are { p1 = 1 } and { p1 = 2 }.

This input channel C1 is connected to a processor P, which is then connected to another (output) channel C2, whose output is sent to cache CACHE. Assume CACHE is keyed on p1.

Next, consider three CQL queries as follows which reside on processor P:

  1. SELECT * FROM C1[NOW]
  2. ISTREAM (SELECT * FROM C1[NOW])
  3. DSTREAM (SELECT * FROM C1[NOW])

Finally, send a single event e1 = { p1 = 1 } to S1.

The question is: what should be the content of the cache at the end for each one of these three queries?

To answer this, a couple of points need to be observed.

First, as I have mentioned in the past, CEP deals with two main concepts: that of a stream and that of a relation.

A stream is container of events, which is unbounded, and only supports inserts. Why only inserts? Well, because there is no such thing as a stream delete, think about it, how could we delete an event that has already happened?!

Whereas a relation is a container of events that is bounded by a certain number of events. A relation supports inserts, deletes, and updates.

Second, remember that a cache is treated like a table, or more precisely, like a relation, and therefore supports insert, delete, and update operations. In the case the query outputs a stream, then the events inserted into the stream are mapped to inserts (or puts) into the cache. If the query outputs a relation, then inserts into the relation are likewise mapped into puts into the cache, however a delete on the relation, becomes a remove of an entry in the cache.

Third, keep in mind that the operations ISTREAM (i.e. insert stream) and DSTREAM (i.e. delete stream) convert relations to streams. The former converts relation inserts into stream inserts, but ignores the relation updates and deletes. The latter converts relation deletes into stream inserts, and ignores relation inserts and updates (in reality, things are a bit more complicated, but let’s ignore the details for the time being).

Fourth, we want the answer as if time has moved on from ‘now’. For all purpose, say we measuring time in seconds, and we sent event e1 at time 1 second and want the answer at time 2 seconds.

I will post the answer in a follow up post next week.

The crucial point of this exercise is to understand the difference between two of the most important CEP concepts: that of a STREAM and RELATION, and how they relate to each other.


Oracle OpenWorld 2011

September 10, 2011

For those attending OOW this year, I will be co-presenting two sessions:

  • Complex Event Processing and Business Activity Monitoring Best Practices (Venue / Room: Marriott Marquis – Salon 3/4, Date and Time: 10/3/11, 12:30 – 13:30)

In this first session, we talk about how to best integrate CEP and BAM. BAM (Business Activity Monitoring) is a great fit to CEP, as it can serve as the CEP dashboard for visualizing and acting on complex events that are found to be business related.

  • Using Real-Time GPS Data with Oracle Spatial and Oracle Complex Event Processing (Venue / Room: Marriott Marquis – Golden Gate C3, Date and Time: 10/3/11, 19:30 – 20:15)

In this following talk, we walk through ours and our customers’ real-world experience on using GPS together with Oracle Spatial and CEP. The combination of CEP and Spatial has become an important trend and a very useful scenario.

If you are at San Francisco at this time, please stop by to chat.


Event Processing Patterns at DEBS 2011

July 13, 2011

This week, I co-presented a tutorial on event processing design patterns and their mapping to the EPTS reference architecture at DEBS 2011. For this presentation, I had the pleasure to work together with Paul Vincent (TIBCO), Adrian Paschke (Freie Universitaet Berlin), and Catherine Moxey (IBM).

Event processing design patterns is an interesting and new topic, one which I have talked about in the past. However, we have added an interesting aspect to this, which is the illustration of the event processing patterns using different event processing languages. Specifically, we showed how to realize filtering, aggregation, enrichment, and routing using both a stream-based EPL, as well as a rule-based (i.e. logical-programming) EPL.

This not only made it easier to understand and learn these patterns, but also in my opinion showed how both approaches are mostly similar in their fundamentals and achieve similar results.

For example, there is a trivial and obvious mapping between a stream event and a rule’s fact. Likewise, the specification of predicate conditions (e.g. a < b and c = d) are but identical.

Nonetheless, there are a few areas where there is a best fit. For example, aggregation is typically easier to express in the stream-based idiom, whereas semantic reasoning on events is best expressed in the logical-programming style, as I learned using Prova.

The intent of this work is ultimately to be published as a book, and with that in mind we continue to extend it with additional event processing patterns.

We welcome feedback, particularly around new event processing patterns to be documented and included in this project.