Detecting Fraud using Event Processing

September 5, 2013

Let’s consider a simple fraud detection scenario, one which I have run into with customers in the past.

Say we want to detect when a particular credit card customer spends more than 10000 dollars in a 24 hour (rolling) period.

Offhand, this appears to be a simple scenario to implement. Let’s get to it. Intuitively, this seems like a good use-case for pattern matching queries:

SELECT

M.clientId, M.txId, “ALARM” AS status

FROM creditcardTransactionStream

MATCH_RECOGNIZE (

PARTITION BY cardNumber

MEASURES cardNumber as clientId, transactionId AS txId

PATTERN (A)

DEFINE A AS price > 10000

) AS M

We can break down this query into the following steps:

  1. Partition the events coming through the credit-card transaction stream by its cardNumber;
  2. Check if the price of the transaction (event) is greater than 10000 dollars. We call this pattern A, or more precisely, the correlation variable A.
  3. When the pattern is detected, a process called matching, you can specify how to represent the matched pattern. These are called measures of the pattern. In this particular case, two measures are defined, one identifying the card number and another for the transaction ID.
  4. The final step is to output the two measures (i.e. cardNumber, transactionId) and a status string set to ‘ALARM’ indicating that the customer has spent more than 10000 dollars.

This pattern matching query is able to identify if a single transaction is above the threshold, but not multiple transactions that individually may cross the threshold.

For example, consider the following table of input and output events:

Time (h) Input Output
0 {cardNumber=10, transactionId=1, price=10000} {clientId=10, txId=1, ‘ALARM’}
1 {cardNumber=10, transactionId=2, price=5000}
2 {cardNumber=10, transactionId=3, price=6000}

In this case, note how there is no output at time t = 2h even though transactions 2 and 3 collectively do cross the 10K threshold.

The following query addresses this issue:

SELECT

M.clientId, M.txId, “ALARM” AS status

FROM creditcardTransactionStream

MATCH_RECOGNIZE (

PARTITION BY cardNumber

MEASURES cardNumber as clientId, transactionId as txId

PATTERN (A+? B) WITHIN 24 HOURS

DEFINE B as SUM(A.price) + B.price > 10000

) as M

One way to look at pattern matching is to consider it as a state diagram, where each correlation variable (e.g. A) represents a state in the state diagram. Typically, you need at least two states, the starting state and the accepting state (i.e. the end state). In this particular scenario, state A represents the individual transactions and state B represents  the condition where collectively the events cross the 10K threshold.

However, there are a few caveats. For example, could we go along with a single state (i.e. A) using the expression ‘A as SUM(A.price)’? No, we can’t define a variable in terms of an aggregation of itself, hence the reason we created variable B, which is then defined as the aggregation of A. Because we intent to aggregate several events as part of A, A must be defined as a collection, that is, a correlation group variable. In this case, we do this by using the regular expression operator of ‘+’ as in ‘A+’. This indicates that the correlation variable A should attempt to match one or more events.

Further, why do we need to add ‘B.price’ to the ‘SUM(a.price)’ as in the expression ‘B as SUM(A.price) + B.price’ ? This is needed because the ending event itself is a transaction and thus its price needs to be considered as part of the equation. If we don’t do that, then the price of the last received event would be ignored.

Why do we need the token ‘?’ as in the expression ‘A+?’ in the query? Typically, correlation group variables, as it is the case of A, are greedy, that is, they try to match as many events as possible. However, because A and B are similar, if we keep A greedy, then B would never match. To avoid this situation, we use the token ‘?’ to indicate A should be reluctant, and match the minimal set of events possible, therefore also allowing B to match and accept the final sequence.

Finally, note that the clause ‘within 24 hours’ specifies the window of interest. This means that the sum of the events are always constrained to a rolling 24 hours window. Here is a state diagram that represents the query:

state-machine

Using this new query, we will get a proper alarm output when the query receives the third event (i.e transactionId = 3). However, there is one new problem, this query ceases to output an alarm when we receive the first event (i.e. transactionId = 1).

Why is that? The problem is that the new query expects at least two events, one that matches variable A and another to match variable B. We can fix this by introducing an alternation in our regular expression, as follows:

PATTERN (C | (A*? B)) WITHIN 24 HOURS

DEFINE B as SUM(A.price) + B.price > 10000, C as price > 10000

In this case, either we match the expression ‘(A+? B)’ or the expression ‘C’, where C is simply the case where we have a single event whose price is already crossing the threshold. Note that we need to specify C as the first variable in our pattern, otherwise the query would only consider the ‘(A+? B)’ expression and never have a chance to match for ‘C’. Also note that any variable that is not defined, as it is the case of A, means that it is always true for all events.

Are we done?

Not yet, consider the case where we want to output the total sum of the transactions that are crossing the threshold. The introduction of an alternation makes this a bit harder, as the query is either matching C or A and B. How can we solve this?

We can address this by using subsets. Subsets are correlation variables that represent the union of other correlation variables. Take a look a look at the modified query:

SELECT M.clientId, M.total, “ALARM” as status

FROM creditcardTransactionStream

MATCH_RECOGNIZE (

PARTITION BY cardNumber

MEASURES cardNumber as clientId, SUM(ABC.price) as total

PATTERN (C | (A+? B)) WITHIN 24 HOURS

SUBSET ABC = (A,B,C)

DEFINE B as SUM(A.price) + B.price > 10000, C as price > 10000

) as M

As you can see, pattern matching queries gives us a lot of power, however this comes at a price, as one needs to consider a lot of things, such as the starting conditions, the accepting conditions, and all the corner-case scenarios, which are never trivial, even for apparently simple use-cases.

It also highlights the crucial but subtle point that as much as one tries to create black-box solutions for event processing scenarios, real-world use-cases are always more complicated and have more subtleties then originally conceived and for this reason there is always going to be a place for computational rich domain-specific event processing languages.

Advertisements

New book on Event Processing and EPL theory

July 4, 2013

In march of this year, I had the pleasure to publish my second book: Getting Started with Oracle Event Processing 11g

As I learned the hard way, authoring a book is really hard stuff, so this time around I had the satisfaction of working with two co-authors, Robin Smith and Lloyd Williams, both outstanding product managers at Oracle.

Although the book is focused primarily on Oracle’s Event Processing product (version 11g), it also includes a lot of material around the underlying fundamentals and concepts of the (complex) event processing technology employed by the product, so that it should make it interesting to the event processing enthusiastics in general, particularly those interested on learning the theory behind event processing languages.

The first two chapters provide a very good landscape of the market for event processing, along the way describing a few important use-cases that are addressed by this technology. Chapter 3 and 4 describe how to get events in and out of the system, and how to model the system using the concept of a Event Processing Network (EPN).

Chapters 5, 8, and 11 provide a very deep description of Oracle’s CQL (Continuous Query Language). Amongst other things, they get into several interesting and advanced topics:

  • The conceptual differences between streams, and relations;
  • How different timing models, such as application-based and system-based time models, influence the results;
  • A formal explanation of how relational algebra in SQL is extended to support streams;
  • Shows how to implement interesting pattern matching scenarios, such as that of missing events, and the W pattern; and
  • Describes how CQL is extended to support JDBC, Java, and Spatial technology, allowing one to not only process events in time, but also in terms of location.

Chapters 6, 7, and 9 describe how to manage the overall system, both in terms of configuration, but also performance, and how to scale-up and scale-out, particularly explaining how a Data Grid can be used in conjunction with event processing to greatly improve scalability and fault tolerance. Finally, chapters 10 and 12 tie everything together with a case study and discusses future directions.

I hope you will have as much fun reading this book as I had writing it.

If you have any questions along the way, feel free to send them to me.


Event Processing Patterns at DEBS 2011

July 13, 2011

This week, I co-presented a tutorial on event processing design patterns and their mapping to the EPTS reference architecture at DEBS 2011. For this presentation, I had the pleasure to work together with Paul Vincent (TIBCO), Adrian Paschke (Freie Universitaet Berlin), and Catherine Moxey (IBM).

Event processing design patterns is an interesting and new topic, one which I have talked about in the past. However, we have added an interesting aspect to this, which is the illustration of the event processing patterns using different event processing languages. Specifically, we showed how to realize filtering, aggregation, enrichment, and routing using both a stream-based EPL, as well as a rule-based (i.e. logical-programming) EPL.

This not only made it easier to understand and learn these patterns, but also in my opinion showed how both approaches are mostly similar in their fundamentals and achieve similar results.

For example, there is a trivial and obvious mapping between a stream event and a rule’s fact. Likewise, the specification of predicate conditions (e.g. a < b and c = d) are but identical.

Nonetheless, there are a few areas where there is a best fit. For example, aggregation is typically easier to express in the stream-based idiom, whereas semantic reasoning on events is best expressed in the logical-programming style, as I learned using Prova.

The intent of this work is ultimately to be published as a book, and with that in mind we continue to extend it with additional event processing patterns.

We welcome feedback, particularly around new event processing patterns to be documented and included in this project.