22
NEW YORK STORM USERS GROUP Using Storm with MapR M7 for Real-Time Predictive Modeling January 28, 2014

New York Storm Users Group 2014-01-28 - Using Storm with MapR M7 for Real-Time Predictive Modeling

Embed Size (px)

DESCRIPTION

Velos provides predictive Aaalytics lifecycle and scaling solutions for Enterprise companies. Formerly Sociocast, a SaaS solution with use-case specific ad tech and e-commerce models on our own hardware. Velos provides an on-premise platform supporting any models on various production runtimes, such as Hadoop, Storm, Spark and others. We will discuss the evolution from an Hadoop-only system to an architecture consisting of Storm, Play, Kafka, Redis, MapR M3, and MapR M7 (HBase) to meet our requirements. An overview of the different types of topologies created by Sociocast will be discussed with an in depth review of the topology used for real-time probabilistic and absolute counting. Performance metrics of the platform will be shared as well as a development road map for the platform.

Citation preview

Page 1: New York Storm Users Group 2014-01-28 - Using Storm with MapR M7 for Real-Time Predictive Modeling

N E W Y O R K S T O R M U S E R S G R O U P

Using Storm with MapR M7 for Real-Time Predictive Modeling !!!January 28, 2014

Page 2: New York Storm Users Group 2014-01-28 - Using Storm with MapR M7 for Real-Time Predictive Modeling

• Introductions • About Velos • Our Use Cases • Requirements • Why Storm? • Why MapR M7? • How Did We Get Here? • Architecture • Quick Storm Introduction • Our Topologies • Performance & Learnings • Road Map • Q&A

A G E N D A

2

Page 3: New York Storm Users Group 2014-01-28 - Using Storm with MapR M7 for Real-Time Predictive Modeling

Gna PhetsarathDirector of Technology @sourignahttp://www.linkedin.com/in/sourignaphetsarath/

I N T R O D U C T I O N S

3

Page 4: New York Storm Users Group 2014-01-28 - Using Storm with MapR M7 for Real-Time Predictive Modeling

• Velos provides Predictive Analytics lifecycle and scaling solutions for Enterprise companies

• Formerly Sociocast, a SaaS solution with use-case specific ad tech and e-commerce models on our own hardware

• Velos provides an on-premise platform supporting any models on various production runtimes, such as Hadoop, Storm, Spark and others

• Customers can easily automate ETL, feature engineering, model evaluation and production deployment and monitoring, as well as relearning and adaptation

• Plug-in existing Python, Java, and R models

A B O U T V E L O S

4

Page 5: New York Storm Users Group 2014-01-28 - Using Storm with MapR M7 for Real-Time Predictive Modeling

• Real-Time Predictive Modeling • Real-Time Metrics

• Atomic Counters • Unique Probabilistic Counting (Hyper

Log Log Plus) • Group Membership (Bloom Filters)

• Page Parsing - NLP Feature Extraction • Event/Entity Attribute Maintenance

O U R U S E C A S E S

5

Page 6: New York Storm Users Group 2014-01-28 - Using Storm with MapR M7 for Real-Time Predictive Modeling

• < 50 ms response time • Random access to large data set > 1B keys • Near Real-time/streaming • Distributed • Scalable • Fault Tolerant • Reliable

R E Q U I R E M E N T S

6

Page 7: New York Storm Users Group 2014-01-28 - Using Storm with MapR M7 for Real-Time Predictive Modeling

• Simple API • Scalable • Fault tolerant • Guarantees data processing • Handles parallelization, partitioning, and

retrying on failures when necessary • Easy to deploy and operate • Free and open source

W H Y S T O R M ?

7

Page 8: New York Storm Users Group 2014-01-28 - Using Storm with MapR M7 for Real-Time Predictive Modeling

• Configuration is simpler than with HBase • No region servers • No compaction happens since it is read-write file system • Recovery from cold starts are easier. HBase if it goes

down and has to restarted takes a long time. Hours. Whereas, this is in minutes. we haven't had to experience that but we did have a ZK failure and had to bounce each node. Was quick.

• NFS Gateway is very useful • There are plenty of features we haven't taken advantage

of yet • MapR Admin UI is easy to use

W H Y M A P R M 7 ?

8

Page 9: New York Storm Users Group 2014-01-28 - Using Storm with MapR M7 for Real-Time Predictive Modeling

• Amazon Elastic MapReduce • Cloudera Hadoop on Amazon Web Services • MapR M3 (Hadoop MapReduce) on

Managed Hosting Service • MapR M3, Riak, Storm, Kafka, Redis, Play

on Managed Hosting Service • MapR M3, MapR M7 (HBase), Storm, Kafka,

Redis, Play on Managed Hosting Service

H O W D I D W E G E T H E R E ?

9

Page 10: New York Storm Users Group 2014-01-28 - Using Storm with MapR M7 for Real-Time Predictive Modeling

A R C H I T E C T U R E - Q 4 2 0 1 3

10

API - Play Framework

Dashboard - Play Framework

Kafka

Storm ToplogiesRedis

MapR M7

MapR M3

PostgresSQL

Page 11: New York Storm Users Group 2014-01-28 - Using Storm with MapR M7 for Real-Time Predictive Modeling

S T O R M C O N C E P T S

11

Spout

Bolt

Topology

Tuples

(key,fields,...)

Page 12: New York Storm Users Group 2014-01-28 - Using Storm with MapR M7 for Real-Time Predictive Modeling

• Tuple - named list of values • Streams - streams of tuples • Spouts - a source of streams • Bolts - processes any number of input streams

and produces a number of output streams • Topologies - an network of spouts and bolts • Reliability - guarantees that every tuple will be

fully processed • Workers - executes subset of topology • Tasks - executed by workers for bolts/spouts

Q U I C K S T O R M O V E R V I E W

12

Page 13: New York Storm Users Group 2014-01-28 - Using Storm with MapR M7 for Real-Time Predictive Modeling

• Entity Observe • Kafka Spout • Bot Detection Bolt • Entity Observe Bolt • Real-time Counter Bolt • Predictive Model Update Bolt

• NLP Feature Extraction of HTML Content • Entity/Event Attribute Maintenance

O U R T O P O L O G I E S

13

Page 14: New York Storm Users Group 2014-01-28 - Using Storm with MapR M7 for Real-Time Predictive Modeling

E N T I T Y O B S E R V E T O P O L O G Y

14

Page 15: New York Storm Users Group 2014-01-28 - Using Storm with MapR M7 for Real-Time Predictive Modeling

R E A L - T I M E C O U N T E R B O LT

15

Page 16: New York Storm Users Group 2014-01-28 - Using Storm with MapR M7 for Real-Time Predictive Modeling

P E R F O R M A N C E M E T R I C S

16

Play / Kafka ~ 3000 ops/node

Kafka / Storm ~ 1650 ops/node

Storm / MapR M7 ~ 5000 ops/node

Page 17: New York Storm Users Group 2014-01-28 - Using Storm with MapR M7 for Real-Time Predictive Modeling

1M Put 1,900 ops/n 15,000 ops/n

1M RW 2,000 ops/n 5,000 ops/n

1B Load N/A 7,000 ops/n

C O M PA R I N G M 7 W I T H C A S S A N D R A

YCSB benchmark on 5-Node Cluster with 24 Cores, 192GB RAM, 24 Disks / node Cassandra 2.0.x; MapR M7 Pre-Release 3.00

closer to what we see in production

17

Page 18: New York Storm Users Group 2014-01-28 - Using Storm with MapR M7 for Real-Time Predictive Modeling

T A M I N G S T O R M

18

• Use monit to keep Nimbus & Supervisors running smoothly • Local queues that periodically write operational stats to Redis

(e.g. processing throughput) & alert Ops team • Shaded jars & deployment scripts to keep topologies up to date • ScBaseRichBolt

• Write your own base classes to trap framework exceptions and do proper things

• Reduce boiler-plate code • Use Murmur Hash to make jobs more efficient by distributing

keys more evenly. (True for MapReduce, as well) • Storm UI is not reliable (v0.8.2). So, need to roll out your own

stats; Storm 0.9 UI should be more reliable • DataDog used for Dashboards and Alerts

Page 19: New York Storm Users Group 2014-01-28 - Using Storm with MapR M7 for Real-Time Predictive Modeling

F E A T U R E S

• Deep learning for feature detection • Anomaly detection • Automation of full data science lifecycle,

from exploration and modeling to production and relearning

• R and Python custom algorithm support • Automated model training and

optimization

R O A D M A P

19

Page 20: New York Storm Users Group 2014-01-28 - Using Storm with MapR M7 for Real-Time Predictive Modeling

T E C H N O L O G Y

• Storm 0.9.0 • Kafka 0.8.0 • Apache Spark • Play 2.2.x • Cascading • Spring XD - eXtreme Data • Spring Reactor • Spring Boot

R O A D M A P

20

Page 21: New York Storm Users Group 2014-01-28 - Using Storm with MapR M7 for Real-Time Predictive Modeling

Q&A

Page 22: New York Storm Users Group 2014-01-28 - Using Storm with MapR M7 for Real-Time Predictive Modeling

Thank you!