View
1.958
Download
1
Category
Preview:
DESCRIPTION
This session will describe how to resolve the processing limitations by placing the streaming and data store interfaces in-memory as well, through an in-memory computing platform, and also how to resolve the complexity challenge by implementing a DevOps approach that abstracts all the underlying infrastructure and provides single-click management of all the application tiers and services, on any environment (private/public cloud, bare metal…). And the best news is that all this optimization can be implemented seamlessly, with no code change to your apps.
Citation preview
Real Time Big Data With Storm, Cassandra, and In-‐Memory Compu=ng
DeWayne Filppi @dfilppi
Big Data Predic=ons
“Over the next few years we'll see the adop=on of scalable frameworks and pla1orms for handling streaming, or near real-‐=me, analysis and processing. In the same way that Hadoop has been borne out of large-‐scale web applica=ons, these plaMorms will be driven by the needs of large-‐scale loca=on-‐aware mobile, social and sensor use.”
Edd Dumbill, O’REILLY
2 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved 3
The Two Vs of Big Data
Velocity Volume
We’re Living in a Real Time World… Homeland Security
Real Time Search
Social
eCommerce
User Tracking & Engagement
Financial Services
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved 4
The Flavors of Big Data Analy=cs
Coun:ng Correla:ng Research
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved 5
Analy=cs @ Twi`er – Coun=ng
§ How many signups, tweets, retweets for a topic?
§ What’s the average latency?
§ Demographics § Countries and ci=es § Gender § Age groups § Device types § …
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved 6
Analy=cs @ Twi`er – Correla=ng
§ What devices fail at the same =me?
§ What features get user hooked?
§ What places on the globe are “happening”?
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved 7
Analy=cs @ Twi`er – Research
§ Sen=ment analysis § “Obama is popular”
§ Trends § “People like to tweet
aeer watching American Idol”
§ Spam pa`erns § How can you tell when
a user spams?
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved 8
It’s All about Timing
“Real :me” (< few Seconds)
Reasonably Quick (seconds -‐ minutes)
Batch (hours/days)
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved 9
It’s All about Timing
• Event driven / stream processing • High resolu=on – every tweet gets counted
• Ad-‐hoc querying • Medium resolu=on (aggrega=ons)
• Long running batch jobs (ETL, map/reduce) • Low resolu=on (trends & pa`erns)
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved 10
This is what we’re here to discuss J
VELOCITY + VAST VOLUME = IN MEMORY + BIG DATA
11
§ RAM is the new disk § Data par==oned across a cluster
§ Large “virtual” memory space § Transac=onal § Highly available § Code collocated with data.
In Memory Data Grid Review
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved 12
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved 13
Data Grid + Cassandra: A Complete Solu=on • Data flows through the in-‐memory cluster async to Cassandra • Side effects calculated • Filtering an op=on • Enrichment an op=on • Results instantly available • Internal and external event listeners no=fied
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved 14
Simplified Event Flow
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved 15
Grid – Cassandra Interface § Hector and CQL based interface § In memory data must be mapped to column families.
§ Configurable class to column family mapping § Must serialize individual fields
§ Fixed fields can use defined types § Variable fields ( for schemaless in-‐memory mode) need serializers
§ Object model fla`ening § By default, nested fields are fla`ened. § Can be overridden by custom serializer.
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved 16
Virtues and Limita=ons
§ Could be faster: high availability has a cost § Complex flows not easy to assemble or understand with simple
event handlers
§ Complete stack, not just two tools of many § Fast.
§ Microsecond latencies for in memory opera=ons § Fast enough for almost anybody
§ Highly available/self healing § Elas=c
§ Popular open source, real =me, in-‐memory, streaming computa=on plaMorm.
§ Includes distributed run=me and intui=ve API for defining distributed processing flows.
§ Scalable and fault tolerant. § Developed at BackType, and open sourced by Twi`er
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved 17
Storm Background
§ Streams § Unbounded sequence of tuples
§ Spouts § Source of streams (Queues)
§ Bolts § Func=ons, Filters, Joins, Aggrega=ons
§ Topologies
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved 18
Storm Abstrac=ons Spout
Bolt
Topologies
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved 19
Streaming word count with Storm
§ Storm has a simple builder interface to crea=ng stream processing topologies
§ Storm delegates persistence to external providers § Cassandra, because of its write performance, is commonly used
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved 20
Storm : Op=mis=c Processing
§ Storm (quite ra=onally) assumes success is normal § Storm uses batching and pipelining for performance § Therefore the spout must be able to replay tuples on demand
in case of error. § Any kind of quasi-‐queue like data source can be fashioned
into a spout. § No persistence is ever required, and speed a`ained by
minimizing network hops during topology processing.
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved 21
Fast. Want to go faster?
§ Eliminate non-‐memory components § Subs=tute disk based queue for reliable in-‐memory queue § Subs=tute disk based state persistence to in-‐memory
persistence § Asynchronously update disk based state (C*)
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved 22
Sample Architecture
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved 23
References § Try the Cloudify recipe
§ Download Cloudify : h`p://www.cloudifysource.org/ § Download the Recipe (apps/xapstream, services/xapstream):
– h`ps://github.com/CloudifySource/cloudify-‐recipes § XAP – Cassandra Interface Details;
§ h`p://wiki.gigaspaces.com/wiki/display/XAP95/Cassandra+Space+Persistency § Check out the source for the XAP Spout and a sample state
implementa=on backed by XAP, and a Storm friendly streaming implemen=on on github: § h`ps://github.com/Gigaspaces/storm-‐integra=on
§ For more background on the effort, check out my recent blog posts at h`p://blog.gigaspaces.com/ § h`p://blog.gigaspaces.com/gigaspaces-‐and-‐storm-‐part-‐1-‐storm-‐clouds/ § h`p://blog.gigaspaces.com/gigaspaces-‐and-‐storm-‐part-‐2-‐xap-‐integra=on/ § Part 3 coming soon.
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved 24
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved 25
Twi`er Storm With Cassandra
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved 26
Storm Overview
§ Streams § Unbounded sequence of tuples
§ Spouts § Source of streams (Queues)
§ Bolts § Func=ons, Filters, Joins, Aggrega=ons
§ Topologies
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved 27
Storm Concepts Spouts
Bolt
Topologies
Challenge – Word Count
Word:Count
Tweets
Count ?® Copyright 2013 Gigaspaces Ltd. All Rights Reserved 28
• HoWest topics • URL men:ons • etc.
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved 29
Streaming word count with Storm
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved 30
Supercharging Storm § Storm doesn’t supply persistence, but provides for it § Storm op=mizes IO to slow persistence (e.g. databases) using
batching. § Storm processes streams. The stream provider itself needs to
support persistency, batching, and reliability.
Tweets, events,whatever….
XAP Real Time Analy=cs
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved 31
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
Two Layer Approach § Advantage: Minimal
“impedance mismatch” between layers. – Both NoSQL cluster
technologies, with similar advantages
§ Grid layer serves as an in memory cache for interac=ve requests.
§ Grid layer serves as a real =me computa=on fabric for CEP, and limited ( to allocated memory) real =me distributed query capability.
In Memory Compute Cluster
NoSQL Cluster
...
Raw Event Stream
Raw Event Stream
Raw Event Stream
Real Tim
e Even
ts
Raw And Derived Events
Real Tim
e Even
ts
Repo
rting En
gine
SCALE
SCALE
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved 33
Simplified Architecture
§ Flowing event streams through memory for side effects § Event driven architecture execu=ng in-‐memory § Raw events flushed, aggrega=ons/deriva=ons retained § All layers horizontally scalable § All layers highly available § Real-‐=me analy=cs & cached batch analy=cs on same scalable
layer § Data grid provides a transac=onal/consistent façade on
NoSQL store (in this case elimina=ng SQL database en=rely)
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved 34
Key Concepts
Keep Things In Memory
Facebook keeps 80% of its data in Memory (Stanford research)
RAM is 100-‐1000x faster than Disk (Random seek) • Disk: 5 -‐10ms • RAM: ~0.001msec
Take Aways
§ A data grid can serve different needs for big data analy=cs: § Supercharge a dedicated stream processing cluster like Storm.
– Provide fast, reliable, transac=onal tuple streams and state § Provide a general purpose analy=cs plaMorm
– Roll your own § Simplify overall architecture while enhancing scalability
– Ultra high performance/low latency – Dynamically scalable processing and in-‐memory storage – Eliminate messaging =er – Eliminate or minimize need for RDBMS
§ Real:me Analy:cs with Storm and Hadoop § hWp://www.slideshare.net/Hadoop_Summit/real:me-‐
analy:cs-‐with-‐storm § Learn and fork the code on github:
hWps://github.com/Gigaspaces/storm-‐integra:on
§ Twi`er Storm: hWp://storm-‐project.net
§ XAP + Storm Detailed Blog Post hWp://blog.gigaspaces.com/gigaspaces-‐and-‐storm-‐part-‐2-‐xap-‐integra:on/
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved 37
References
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved 38
Recommended