Fraud Detection for Israel BigThings Meetup

Preview:

Citation preview

Real Time Anomaly DetectionPatterns and reference architectures

Gwen Shapira, System Architect

©2014 Cloudera, Inc. All rights reserved.

Overview• Intro• Review Problem• Quick overview of key technology• High level architecture• Deep Dive into NRT Processing• Completing the Puzzle – Micro-batch, Ingest and Batch

©2014 Cloudera, Inc. All rights reserved.

Gwen Shapira• 15 years of moving data• Formerly consultant, engineer• System Architect @ Confluent• Kafka Committer• @gwenshap

There’s a Book on That

Founded by creators of Kafka - @jaykreps, @nehanarkhede, @junrao

We help you gather, transport, organize, and analyze all of your stream data

What we offer• Confluent Platform• Kafka plus critical bug fixes not yet applied in Apache release• Kafka ecosystem projects• Enterprise support• Training and Professional Services

©2014 Cloudera, Inc. All rights reserved.

The Problem

©2014 Cloudera, Inc. All rights reserved.

Credit Card Transaction Fraud

©2014 Cloudera, Inc. All rights reserved.

Coupon Fraud

©2014 Cloudera, Inc. All rights reserved.

Video Game Strategy

©2014 Cloudera, Inc. All rights reserved.

Health Insurance Fraud

©2014 Cloudera, Inc. All rights reserved.

How do we React• Human Brain at Tennis

• Muscle Memory• Reaction Thought• Reflective Meditation

©2014 Cloudera, Inc. All rights reserved.

Overview of Key Technologies

©2014 Cloudera, Inc. All Rights Reserved.

Kafka

©2014 Cloudera, Inc. All rights reserved.

The Basics

• Messages are organized into topics

• Producers push messages• Consumers pull messages• Kafka runs in a cluster. Nodes are called brokers

©2014 Cloudera, Inc. All rights reserved.

Topics, Partitions and Logs

©2014 Cloudera, Inc. All rights reserved.

Each partition is a log

©2014 Cloudera, Inc. All rights reserved.

Each Broker has many partitions

Partition 0 Partition 0

Partition 1 Partition 1

Partition 2

Partition 1

Partition 0

Partition 2 Partion 2

©2014 Cloudera, Inc. All rights reserved.

Producers load balance between partitions

Partition 0

Partition 1

Partition 2

Partition 1

Partition 0

Partition 2

Partition 0

Partition 1

Partion 2

Client

©2014 Cloudera, Inc. All rights reserved.

Producers load balance between partitions

Partition 0

Partition 1

Partition 2

Partition 1

Partition 0

Partition 2

Partition 0

Partition 1

Partion 2

Client

Consumers

Consumer Group Y

Consumer Group X

Consumer

Kafka Cluster

Topic

Partition A (File)

Partition B (File)

Partition C (File)

Consumer

Consumer

Consumer

Order retained with in partition

Order retained with in partition but not over

partitionsOff

Set

X

Off S

et X

Off S

et X

Off S

et Y

Off S

et Y

Off S

et Y

Off sets are kept per consumer group

Consumer-Producer Pattern

Keeping Things Simple• Consume records from Kafka Topic• Filter, transform, join, lookups, aggregate• Write to another Kafka Topic• https://github.com/confluentinc/examples/tree/master/specifi

c-avro-consumer

Kafka Makes Streams Easy• Producers partition the data• Consumers load balance partitions• Add / remove consumers any way you want• Will work with any framework (or none!)

Coming Soon to Kafka Near You

• KafkaConnect - Export / Import for Kafka - 0.9.0 (Its here!)• KStream

• Consumer-Producer client - Processor (0.10.0 - April?)• DSLs:

• KStream (a bit like Spark) - (0.10.0 - April?)• SQL - ???

KConnect - Its a thing• Easy to add connectors to Kafka• Existing connectors

• JDBC• HDFS• MySQL * 2• ElasticSearch * 4• Cassandra• S3 * 2• MQTT• Twitter

• Kafka Connectors:• http://www.confluent.io/developers/connectors• http://docs.confluent.io/2.0.0/connect/index.html

• KStreams:• https://github.com/gwenshap/kafka-examples/blob/master/

KafkaStreamsAvg

SparkStreaming

©2014 Cloudera, Inc. All rights reserved.

Spark Example1. val conf = new SparkConf().setMaster("local[2]”)

2. val sc = new SparkContext(conf)

3. val lines = sc.textFile(path, 2)

4. val words = lines.flatMap(_.split(" "))

5. val pairs = words.map(word => (word, 1))

6. val wordCounts = pairs.reduceByKey(_ + _)

7. wordCounts.print()

©2014 Cloudera, Inc. All rights reserved.

Spark Streaming Example1. val conf = new SparkConf().setMaster("local[2]”)

2. val ssc = new StreamingContext(conf, Seconds(1))

3. val lines = ssc.socketTextStream("localhost", 9999)

4. val words = lines.flatMap(_.split(" "))

5. val pairs = words.map(word => (word, 1))

6. val wordCounts = pairs.reduceByKey(_ + _)

7. wordCounts.print()

8. SSC.start()

Spark Streaming

Confidentiality Information Goes Here

DStream

DStream

DStream

Single Pass

Source Receiver RDD

Source Receiver RDD

RDD

Filter Count Print

Source Receiver RDD

RDD

RDD

Single Pass

Filter Count Print

Pre-first Batch

First Batch

Second Batch

Confidentiality Information Goes Here

DStream

DStream

DStream

Single Pass

Source Receiver RDD

Source Receiver RDD

RDD

Filter Count

Print

Source Receiver RDD

RDD

RDD

Single Pass

Filter Count

Pre-first Batch

First Batch

Second Batch

Stateful RDD 1

Print

Stateful RDD 2

Stateful RDD 1

©2014 Cloudera, Inc. All rights reserved.

High Level Architecture

©2014 Cloudera, Inc. All rights reserved.

Real-Time Event Processing Approach

Hadoop Cluster IIStorage Processing

SolR

Hadoop Cluster I

ClientClientFlume Agents

Hbase / Memory

Spark Streaming

HDFS

Hive/ImpalaMap/

ReduceSpark

Search

Automated & Manual

Analytical Adjustments and Pattern detection

Fetching & Updating Profiles

HDFSEventSink

SolR Sink

Batch Time Adjustments

Automated & Manual

Review of NRT Changes and

Counters

Local Cache

Kafka

Clients:(Swipe here!)

Web App

Adjust NRT Statistics

Yarn / Mesos

Analytics Layer

SolR

ClientClientKStreams

Analytical Adjustment

s and Pattern

detection

Fetching & Updating Profiles

Adjusting NRT Stats Batch Time Adjustments

Review of NRT

Changes and

CountersLocal Cache

Kafka

Clients:(Swipe here!)

Web App

Kafka

HDFS

NoSQL

DWH

Connecor

Connector

KStreamProcessor

Profile Updates

Model Updates

Transactions

Local Store

Decisions

DWH

RedoLog

KStreamProcessorKStreamProcessor

©2014 Cloudera, Inc. All rights reserved.

NRT Processing

©2014 Cloudera, Inc. All rights reserved.

Focus on NRT First

Hadoop Cluster IIStorage Processing

SolR

Hadoop Cluster I

ClientClientProcessor

Hbase / Memory

Spark Streaming

HDFS

Hive/ImpalaMap/

ReduceSpark

Search

Automated & Manual

Analytical Adjustments and Pattern detection

Fetching & Updating Profiles

HDFSEventSink

SolR Sink

Batch Time Adjustments

Automated & Manual

Review of NRT Changes and

Counters

Local Cache

Kafka

Clients:(Swipe here!)

Web App

Adjust NRT Statistics

©2014 Cloudera, Inc. All rights reserved.

Streaming Architecture – NRT Event Processing

Kafka

Initial Events TopicEvent Processing Logic

Local Memory

HBase Client

Kafka

Answer Topic

HBase

Kafk

a Co

nsum

er

Kafk

a Pr

oduc

er

Able to respond with in 10s of milliseconds

©2014 Cloudera, Inc. All rights reserved.

Partitioned NRT Event Processing

Kafka

Initial Events Topic

Event Processing Logic

Local Cache

HBase Client

Kafka

Answer Topic

HBase

Kafk

a Co

nsum

er

Kafk

a Pr

oduc

er

TopicPartition A

Partition B

Partition C

Producer

Partitioner

Producer

Partitioner

Producer

Partitioner

Custom Partitioner

Better use of local memory

©2014 Cloudera, Inc. All rights reserved.

Questions?http://confluent.io

@confluentInc@gwenshap

gwen@confluent.io

Recommended