Realtime Risk Management Using Kafka, Python, and Spark Streaming by Nick Evans

  • Published on
    21-Apr-2017

  • View
    3.142

  • Download
    0

Embed Size (px)

Transcript

  • Realtime Risk ManagementU S I N G K A F K A , P Y T H O N , A N D S PA R K S T R E A M I N G

  • 2

    U N D E R W R I T I N G C R E D I T C A R D T R A N S A C T I O N S I S R I S K Y

  • 3

    W E N E E D T O B E Q U I C K AT U N D E R W R I T I N G

  • 4

    W E A L S O N E E D T O AV O I D L O S I N G M O N E Y

  • 5

    Some Numbers

    $12BnC U M U L AT I V E P R O C E S S E D

    200k+M E R C H A N T S

    14kE V E N T S / S E C O N D

    7R I S K A N A LY S T S

  • 6

    Risk Analysts

  • 7

    W E N E E D E D T O B U I L D S O M E T H I N G T H AT S T O P S T H E

    H O P E FA C T O R

  • 8

  • 9

    S PA R K S T R E A M I N G A L L O W S U S T O D O R E A L -T I M E

    D ATA P R O C E S S I N G .

    W E C A N D E C I D E W H I C H E V E N T S N E E D A C L O S E R L O O K

  • 10

    Intro to Kafka, Zookeeper,

    and Spark Streaming

  • 11

    Apache Kafka

    Image taken from Kafka docs

  • 12

    Apache Zookeeper

    Image taken from Zookeeper docs

  • 13

    Apache Spark Streaming

    Image taken from mapr.com

  • 14

    O L D WAY: R E C E I V E R S

  • 15

    Kafka

    Receiver

    Spark Engine

    t0 t1 t2 t3

    Event Event Event Event Event Event

    Build RDD with Events from t0 to

    t1

    Build RDD with Events from t1 to

    t2

    Process RDD Process RDD

  • 16

    Problems w/ Receivers

    The only way to get at-least once delivery makes it hard to deploy new code

    Zookeeper is updated with which offsets to start from when data is received, not when it is processed

    Were actually duplicating Kafka

  • 17

    N E W WAY: R E C E I V E R L E S S

  • 18

    Kafka

    Spark Engine

    t0 t1 t2 t3

    Event Event Event Event Event Event

    Process RDD with Events from

    t0 to t1

    Process RDD with Events from

    t1 to t2

  • 19

    General Structure Load Kafka offsets from Zookeeper

    Tell Spark Streaming to create a DStream that consumes from Kafka, starting at the specified offsets

    Define your processing step (ie. filter out non-risky events)

    Define your output step (ie. POST the data to the case management software)

    Save Kafka offset of most recently processed event to Zookeeper

    Start your streaming application, and grab some popcorn!

  • 20

    Example Filtering: Risky Products

    hair extensions

    pharmacy

    vaporizer

    gateway card

    wifi pineapple

    iPhone

    gucci

    cannabis

    travel package

  • 21

    Risky Products

    hair extensions

    gucci

    cannabis

    sweet bag

    nice shoes

    taylor swift t-shirt

    RDD for Time 0

    Filterhair extensions

    gucci

    cannabisMap

    {title: hair exten}

    {title: gucci}

    {title: cannab}HTTPS Post

    Case Management

    Software

  • 22

  • 23

    T I M E - W I N D O W E D A G G R E G AT I O N S

  • 24

    The Future

    Time-Windowed Functions A necessity for most of the non-trivial jobs

    Performance Tweaks We havent spent any time on this, so lots of potential for gains

    Machine Learning We could use the Risk Analyst decisions to build a ML model

    Improved Monitoring We are only monitoring the basics right now

    Apache Cassandra Others use it as a fast key/value store for their jobs

    Improved Receiverless API An API to access Kafka / Zookeeper without hard work

  • 25

    Icon CreditsCredit Card by Rediffusion from the Noun Project

    Money Bag by icon 54 from the Noun Project

  • 26

    W E R E H I R I N GS H O P I F Y. C O M / C A R E E R S

  • 27

    T H A N K Y O U !

    @n_e_evans

Recommended

View more >