C* Summit 2013: Real-Time Big Data with Storm, Cassandra, and In-Memory Computing by Dewayne Filppi

Preview:

DESCRIPTION

This session will describe how to resolve the processing limitations by placing the streaming and data store interfaces in-memory as well, through an in-memory computing platform, and also how to resolve the complexity challenge by implementing a DevOps approach that abstracts all the underlying infrastructure and provides single-click management of all the application tiers and services, on any environment (private/public cloud, bare metal…). And the best news is that all this optimization can be implemented seamlessly, with no code change to your apps.

Citation preview

Real  Time  Big  Data  With  Storm,  Cassandra,  and  In-­‐Memory  Compu=ng  

DeWayne  Filppi  @dfilppi  

 Big  Data  Predic=ons    

“Over  the  next  few  years  we'll  see  the  adop=on  of  scalable  frameworks  and  pla1orms  for  handling  streaming,  or  near  real-­‐=me,  analysis  and  processing.  In  the  same  way  that  Hadoop  has  been  borne  out  of  large-­‐scale  web  applica=ons,  these  plaMorms  will  be  driven  by  the  needs  of  large-­‐scale  loca=on-­‐aware  mobile,  social  and  sensor  use.”  

Edd  Dumbill,  O’REILLY  

2 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved

®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  3  

The  Two  Vs  of  Big  Data    

Velocity   Volume  

We’re  Living  in  a  Real  Time  World…  Homeland Security

Real Time Search

Social  

eCommerce

User  Tracking  &  Engagement  

Financial Services

®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  4  

The  Flavors  of  Big  Data  Analy=cs    

Coun:ng   Correla:ng   Research  

®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  5  

Analy=cs  @  Twi`er  –  Coun=ng    

§  How  many  signups,  tweets,  retweets  for  a  topic?  

§  What’s  the  average  latency?  

§  Demographics  §  Countries  and  ci=es  §  Gender    §  Age  groups    §  Device  types    §  …      

®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  6  

Analy=cs  @  Twi`er  –  Correla=ng    

§  What  devices  fail  at  the  same  =me?  

§  What  features  get  user  hooked?  

§  What  places  on  the  globe  are  “happening”?  

®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  7  

Analy=cs  @  Twi`er  –  Research    

§  Sen=ment  analysis  §  “Obama  is  popular”  

§  Trends  §  “People  like  to  tweet  

aeer  watching  American  Idol”  

§  Spam  pa`erns    §  How  can  you  tell  when  

a  user  spams?  

®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  8  

It’s  All  about  Timing    

“Real  :me”    (<  few  Seconds)    

Reasonably  Quick  (seconds  -­‐  minutes)    

Batch    (hours/days)    

®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  9  

It’s  All  about  Timing    

•  Event  driven  /  stream  processing      •  High  resolu=on  –  every  tweet  gets  counted    

•  Ad-­‐hoc  querying    •  Medium  resolu=on  (aggrega=ons)    

•  Long  running  batch  jobs  (ETL,  map/reduce)    •  Low  resolu=on  (trends  &  pa`erns)    

®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  10  

This  is  what  we’re  here  to  discuss  J  

VELOCITY  +  VAST  VOLUME  =    IN  MEMORY  +  BIG  DATA

11  

§  RAM  is  the  new  disk  §  Data  par==oned  across  a  cluster  

§  Large  “virtual”  memory  space  §  Transac=onal  §  Highly  available  §  Code  collocated  with  data.        

In  Memory  Data  Grid  Review  

®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  12  

®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  13  

Data  Grid  +  Cassandra:  A  Complete  Solu=on  •  Data  flows  through  the  in-­‐memory  cluster  async  to  Cassandra  •  Side  effects  calculated  •  Filtering  an  op=on  •  Enrichment  an  op=on  •  Results  instantly  available  •  Internal  and  external  event  listeners  no=fied  

®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  14  

Simplified  Event  Flow  

®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  15  

Grid  –  Cassandra  Interface  §  Hector  and  CQL  based  interface  §  In  memory  data  must  be  mapped  to  column  families.  

§  Configurable  class  to  column  family  mapping  §  Must  serialize  individual  fields  

§  Fixed  fields  can  use  defined  types  §  Variable  fields  (  for  schemaless  in-­‐memory  mode)  need  serializers  

§  Object  model  fla`ening  §  By  default,  nested  fields  are  fla`ened.  §  Can  be  overridden  by  custom  serializer.  

®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  16  

Virtues  and  Limita=ons  

§  Could  be  faster:    high  availability  has  a  cost  §  Complex  flows  not  easy  to  assemble  or  understand  with  simple  

event  handlers  

§  Complete  stack,  not  just  two  tools  of  many  §  Fast.  

§  Microsecond  latencies  for  in  memory  opera=ons  §  Fast  enough  for  almost  anybody  

§  Highly  available/self  healing  §  Elas=c  

§  Popular  open  source,  real  =me,  in-­‐memory,  streaming  computa=on  plaMorm.  

§  Includes  distributed  run=me  and  intui=ve  API  for  defining  distributed  processing  flows.  

§  Scalable  and  fault  tolerant.  §  Developed  at  BackType,              and  open  sourced  by  Twi`er  

®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  17  

Storm  Background  

§  Streams  §  Unbounded  sequence  of  tuples  

§  Spouts  §  Source  of  streams  (Queues)  

§  Bolts  §  Func=ons,  Filters,  Joins,  Aggrega=ons  

§  Topologies  

®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  18  

Storm  Abstrac=ons  Spout  

Bolt  

Topologies  

®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  19  

Streaming  word  count  with  Storm  

§  Storm  has  a  simple  builder  interface  to  crea=ng  stream  processing  topologies  

§  Storm  delegates  persistence  to  external  providers  §  Cassandra,  because  of  its  write  performance,  is  commonly  used  

®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  20  

Storm  :  Op=mis=c  Processing  

§  Storm  (quite  ra=onally)  assumes  success  is  normal  §  Storm  uses  batching  and  pipelining  for  performance  §  Therefore  the  spout  must  be  able  to  replay  tuples  on  demand  

in  case  of  error.  §  Any  kind  of  quasi-­‐queue  like  data  source  can  be  fashioned  

into  a  spout.  §  No  persistence  is  ever  required,  and  speed  a`ained  by  

minimizing  network  hops  during  topology  processing.  

®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  21  

Fast.    Want  to  go  faster?  

§  Eliminate  non-­‐memory  components  §  Subs=tute  disk  based  queue  for  reliable  in-­‐memory  queue  §  Subs=tute  disk  based  state  persistence  to  in-­‐memory  

persistence  §  Asynchronously  update  disk  based  state  (C*)  

®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  22  

Sample  Architecture  

®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  23  

References  §  Try  the  Cloudify  recipe  

§  Download  Cloudify  :  h`p://www.cloudifysource.org/  §  Download  the  Recipe  (apps/xapstream,  services/xapstream):  

–  h`ps://github.com/CloudifySource/cloudify-­‐recipes  §  XAP  –  Cassandra  Interface  Details;  

§  h`p://wiki.gigaspaces.com/wiki/display/XAP95/Cassandra+Space+Persistency  §  Check  out  the  source  for  the  XAP  Spout  and  a  sample  state  

implementa=on  backed  by  XAP,  and  a  Storm  friendly  streaming  implemen=on  on  github:  §  h`ps://github.com/Gigaspaces/storm-­‐integra=on  

§  For  more  background  on  the  effort,  check  out  my  recent  blog  posts  at  h`p://blog.gigaspaces.com/  §  h`p://blog.gigaspaces.com/gigaspaces-­‐and-­‐storm-­‐part-­‐1-­‐storm-­‐clouds/  §  h`p://blog.gigaspaces.com/gigaspaces-­‐and-­‐storm-­‐part-­‐2-­‐xap-­‐integra=on/  §  Part  3  coming  soon.  

®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  24  

®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  25  

Twi`er  Storm  With  Cassandra  

®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  26  

Storm  Overview  

§  Streams  §  Unbounded  sequence  of  tuples  

§  Spouts  §  Source  of  streams  (Queues)  

§  Bolts  §  Func=ons,  Filters,  Joins,  Aggrega=ons  

§  Topologies  

®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  27  

Storm  Concepts  Spouts  

Bolt  

Topologies  

Challenge  –  Word  Count  

Word:Count

Tweets  

Count  ?®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  28  

• HoWest  topics  • URL  men:ons  • etc.  

®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  29  

Streaming  word  count  with  Storm  

®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  30  

Supercharging  Storm  §  Storm  doesn’t  supply  persistence,  but  provides  for  it  §  Storm  op=mizes  IO  to  slow  persistence  (e.g.  databases)  using  

batching.  §  Storm  processes  streams.    The  stream  provider  itself  needs  to  

support  persistency,  batching,  and  reliability.  

Tweets,  events,whatever….  

XAP  Real  Time  Analy=cs  

®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  31  

®  Copyright  2011  Gigaspaces  Ltd.  All  Rights  Reserved  

Two  Layer  Approach  §  Advantage:  Minimal  

“impedance  mismatch”  between  layers.  –  Both  NoSQL  cluster  

technologies,  with  similar  advantages  

§  Grid  layer  serves  as  an  in  memory  cache  for  interac=ve  requests.  

§  Grid  layer  serves  as  a  real  =me  computa=on  fabric  for  CEP,  and  limited  (  to  allocated  memory)  real  =me  distributed  query  capability.  

In  Memory  Compute  Cluster

NoSQL  Cluster

...

Raw  Event  Stream

Raw  Event  Stream

Raw  Event  Stream

Real  Tim

e  Even

ts

Raw  And  Derived  Events

Real  Tim

e  Even

ts

Repo

rting  En

gine

SCALE

SCALE

®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  33  

Simplified  Architecture  

§  Flowing  event  streams  through  memory  for  side  effects  §  Event  driven  architecture  execu=ng  in-­‐memory  §  Raw  events  flushed,  aggrega=ons/deriva=ons  retained  §  All  layers  horizontally  scalable  §  All  layers  highly  available  §  Real-­‐=me  analy=cs  &  cached  batch  analy=cs  on  same  scalable  

layer  §  Data  grid  provides  a  transac=onal/consistent  façade  on  

NoSQL  store  (in  this  case  elimina=ng  SQL  database  en=rely)  

®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  34  

Key  Concepts  

Keep  Things  In  Memory  

Facebook  keeps  80%  of  its  data  in  Memory    (Stanford  research)  

RAM  is  100-­‐1000x  faster  than  Disk  (Random  seek)  •  Disk:  5  -­‐10ms      •  RAM:  ~0.001msec    

Take  Aways  

§  A  data  grid  can  serve  different  needs  for  big  data  analy=cs:  §  Supercharge  a  dedicated  stream  processing  cluster  like  Storm.  

–  Provide  fast,  reliable,  transac=onal  tuple  streams  and  state  §  Provide  a  general  purpose  analy=cs  plaMorm  

–  Roll  your  own  §  Simplify  overall  architecture  while  enhancing  scalability  

–  Ultra  high  performance/low  latency  –  Dynamically  scalable  processing  and  in-­‐memory  storage  –  Eliminate  messaging  =er  –  Eliminate  or  minimize  need  for  RDBMS  

§  Real:me  Analy:cs  with  Storm  and  Hadoop  §  hWp://www.slideshare.net/Hadoop_Summit/real:me-­‐

analy:cs-­‐with-­‐storm  §  Learn  and  fork  the  code  on  github:      

hWps://github.com/Gigaspaces/storm-­‐integra:on  

§  Twi`er  Storm:    hWp://storm-­‐project.net  

§  XAP  +  Storm  Detailed  Blog  Post            hWp://blog.gigaspaces.com/gigaspaces-­‐and-­‐storm-­‐part-­‐2-­‐xap-­‐integra:on/  

  ®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  37  

References    

®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  38  

Recommended