The Art of Big Data

Preview:

DESCRIPTION

Slides for my talk at the Naval Post Graduate SChool PhD Seminar

Citation preview

Krishna Sankar, http://doubleclix.wordpress.com

EC4000–PhD Guest Seminar, Naval Post Graduate School

Nov 29,2011

The road lies plain before me;--'tis a theme

Single and of determined bounds; …

- Wordsworth, The Prelude

What is Big Data ?

Big Data to smart data

Big Data Pipeline

Analytic Algorithms

Storage - NOSQL

Processing - Hadoop …

Analytics/Modeling

R

Visualization

o  Agenda o  To cover the broad

picture o  Understand the

waypoints & o  Drill down into one

area (NOSQL) o  Can do others later

o  Of the Big Data domain …

Thanks to … The giants whose shoulders I am

standing on

Special  Thanks  to:        Peter  Ateshian,  NPS  

     Prof  Murali  Tummala,  NPS        Shirley  Bailes,O’Reilly        Ed  Dumbill,O’Reilly  

     Jeff  Barr,AWS        Jenny  Kohr  Chynoweth,AWS  

When I think of my own native land, In a moment I seem to be there;

But, alas! recollection at hand Soon hurries me back to despair.

- Cowper, The Solitude Of Alexander SelKirk

What is Big Data ? “Big data” is data that becomes large enough that it cannot be processed using conventional methods. @twitter

Ref:  hIp://radar.oreilly.com/2010/09/the-­‐smaq-­‐stack-­‐for-­‐big-­‐data.html  

“Big data” is less about size, more

about flow & velocity - persisting

petabytes per year is easier than

processing terabytes per hour. @twitter

What is Big Data ?

Ref:  hIp://www.ciol.com/News/News/News-­‐Reports/Vinod-­‐Khosla%E2%80%99s-­‐cool-­‐dozen-­‐tech-­‐innovaXons/156307/0/  hIp://yourstory.in/2011/11/vinod-­‐khoslas-­‐keynote-­‐at-­‐nasscom-­‐product-­‐conclave-­‐reject-­‐punditry-­‐believe-­‐in-­‐an-­‐idea-­‐take-­‐risk-­‐and-­‐succeed/  

Vinod Khosla’s Cool Dozen !①  Consumers : “Widespread innovation in technologies that reduce data overload for

users” ~ Data Reduction ②  Businesses : “Simple solutions to handle the deluge of data generated from various

sources …” ~ Big Data Analytics TV  2.0,  EducaXon,  Social  NEXT,Tools  for  sharing  inteerst,Publishing,…  

①  Volume o  Scale  

②  Velocity o  Data  change  rate  vs.  decision  window  

③  Variety o  Different  sources  &  formats  o  Structured  vs.  Unstructured  

④  Variability o  Breadth  of  interpreta<on  &  o  Depth  of  analy<cs  

⑤  Contextual o  Dynamic  variability  o  RecommendaXon  

⑥  Connectedness

EBC322  

hIp://doubleclix.wordpress.com/2011/09/13/when-­‐is-­‐big-­‐data-­‐really-­‐big-­‐data/  hIp://www.hpts.ws/posters/Poster2011_13_Bulkowski.pdf  

①  Volume o  Scale  

②  Velocity o  Data  change  rate  vs.  decision  window  

③  Variety o  Different  sources  &  formats  o  Structured  vs.  Unstructured  

④  Variability o  Breadth  of  interpreta<on  &  o  Depth  of  analy<cs  

⑤  Contextual o  Dynamic  variability  o  RecommendaXon  

⑥  Connectedness

EBC322  

hIp://doubleclix.wordpress.com/2011/09/13/when-­‐is-­‐big-­‐data-­‐really-­‐big-­‐data/  hIp://www.hpts.ws/posters/Poster2011_13_Bulkowski.pdf  

①  Volume o  Scale  

②  Velocity o  Data  change  rate  vs.  decision  window  

③  Variety o  Different  sources  &  formats  o  Structured  vs.  Unstructured  

④  Variability o  Breadth  of  interpreta<on  &  o  Depth  of  analy<cs  

⑤  Contextual o  Dynamic  variability  o  RecommendaXon  

⑥  Connectedness

EBC322  

hIp://doubleclix.wordpress.com/2011/09/13/when-­‐is-­‐big-­‐data-­‐really-­‐big-­‐data/  hIp://www.hpts.ws/posters/Poster2011_13_Bulkowski.pdf  

①  Volume o  Scale  

②  Velocity o  Data  change  rate  vs.  decision  window  

③  Variety o  Different  sources  &  formats  o  Structured  vs.  Unstructured  

④  Variability o  Breadth  of  interpreta<on  &  o  Depth  of  analy<cs  

⑤  Contextual o  Dynamic  variability  o  RecommendaXon  

⑥  Connectedness

EBC322  

hIp://doubleclix.wordpress.com/2011/09/13/when-­‐is-­‐big-­‐data-­‐really-­‐big-­‐data/  hIp://www.hpts.ws/posters/Poster2011_13_Bulkowski.pdf  

①  Volume o  Scale  

②  Velocity o  Data  change  rate  vs.  decision  window  

③  Variety o  Different  sources  &  formats  o  Structured  vs.  Unstructured  

④  Variability o  Breadth  of  interpreta<on  &  o  Depth  of  analy<cs  

⑤  Contextual o  Dynamic  variability  o  RecommendaXon  

⑥  Connectedness

EBC322  

hIp://doubleclix.wordpress.com/2011/09/13/when-­‐is-­‐big-­‐data-­‐really-­‐big-­‐data/  hIp://www.hpts.ws/posters/Poster2011_13_Bulkowski.pdf  

I.  Two  Main  Types  –  based  on  collecXon  i.  Big  Data  Streams  

o  Data  in  “moXon”  o  TwiIer  fire  hose,  Facebook,  G+    

ii.  Big  Data  Logs  o  Data  “at  rest”  o  Logs,  DW,  external  market  data,  POS,  …  

II.  Typically,  Big  Data  has  a  non-­‐determinisXc  angle  as  well  …  o  CreaXve  Discovery  o  IteraXve,  Model  based  AnalyXcs  o  Explore  quesXons  to  ask  

III.  Smart  Data  =  Big  Data  +  context  +  embedded/interacXve  (inference,  reasoning)  models  o  Model  Driven  o  DeclaraXvely  InteracXve  

hIp://www.slideshare.net/leonsp/hadoop-­‐slides-­‐11-­‐what-­‐is-­‐big-­‐data  hIp://www.slideshare.net/Dataversity/wed-­‐1550-­‐bacvanskivladimircolor  

Twitter §  200 million tweets/day §  Peak 10,000/second §  How would you handle the fire

hose for social network analytics ?

hIp://goo.gl/dcBsQ  

Storage §  4 U box = 40 TB, §  1 PB = 25 boxes !

Zynga §  “Analytics company, not a

gaming company!” §  Harvests data : 15 TB/day

§  Test new features §  Target advertising

§  230 million players/month

AWS – 600 Billion objects!

•  6  Billion  Messages  per  day  

•  2  PB  (w/compression)  online  

•  6  PB  w/  replicaXon  •  250  TB/Month  growth  •  HBase  Infrastructure  

Ref:  hIp://www.hpts.ws/sessions/2011HPTS-­‐TomFastner.pdf  

Path  Analysis  A/B  TesXng  

50  TB/Day  240  nodes,  84  PB  Teradata  InstallaXon  

Very  systemaXc  Diagram  speaks  volumes!  

•  “…  they  didn’t  need  a  genius,  …  but  build  the  world’s  most  impressive  dileIante  …  baIling  the  efficient  human  mind  with  spectacular  flamboyant  inefficiency”  –  Final  Jeopardy  by  Stephen  Baker  

•  15  TB  memory,  across  90  IBM  760  servers,  in  10  racks  •  1  TB  of  dataset  •  200  Million  pages  processed  by  Hadoop  •  This  is  a  good  example  of  Connected  data  

–  Contextual  w/  variability  –  Breath  of  interpretaXon  –  AnalyXcs  depth  

hIp://doubleclix.wordpress.com/2011/03/01/the-­‐educaXon-­‐of-­‐a-­‐machine-­‐%E2%80%93-­‐review-­‐of-­‐book-­‐%E2%80%9Cfinal-­‐jeopardy%E2%80%9D-­‐by-­‐stephen-­‐baker/  hIp://doubleclix.wordpress.com/2011/02/17/watson-­‐at-­‐jeopardy-­‐a-­‐race-­‐of-­‐machines/  

Storage  

Parallelism  

Inference  

NOSQL  

HPC  

Map/Reduce  

Object  Store  

Block  Store  

AnalyXcs  

Web  AnalyXcs  

Log  AnalyXcs  

Social  Media  

Social    Graph  

Knowledge  Graph  

Distributed  ApplicaXons  

Warehouse-­‐style  ApplicaXons  

RecommendaXon/Inference  Engines  Machine  Learning  

ClassificaXon,  Clustering  

Search,  Indexing  

Mahout  

Cloud   Architecture  

Big Data

Big  Data  to  Smart  Data

“A towel is about the most massively useful thing an interstellar hitchhiker can have … any man who can hitch the length and breadth of the Galaxy, rough it … win through, and still know where his towel is, is clearly a man to be reckoned with.”

- From The Hitchhiker's Guide to the Galaxy, by Douglas Adams. Published by Harmony Books in 1979

Big  data  to  smart  data •  summary

1 Don’t  throw  away  any  data  !

2 Be  ready  for  different  ways  of  organizing  the  data

h;p://goo.gl/fGw7r

Big  Data  Pipeline

If a problem has no solution, it is not a problem, but a fact, not to be solved but to be coped with, over time …

- Peres’s Law

Big  Data  Pipeline •  Stages

o  Collect o  Store o  Transform & Analyze o  Model & Reason o  Predict, Recommend & Visualize

•  Different systems have different characteristics o  Infrastructure optimization based in application/hardware

attributes correlation (short term) •  Hadoop, Splunk, internal Dashboard

o  Application performance trends (medium term) •  Analytics, Modeling,…

o  Product Metrics •  Feature set vs. usage, what is important to users, stratification •  Modeling using R, Visualization layers like Tableau

Volume

Velocity

Variety

Variability

Connectedness

Context

Model

Infer-ability

Big Data Pipeline

Decomplexify! Contextualize! Network! Reason! Infer!

Logs,  Scribe,  Flume,  Hadoop…  

SQL  NOSQL,  HDFS,  XML,  <iles,  …    

SQL,    BI  Tools,  Hadoop,  Pig,  Hive,    .NET  Dryad,  Various  other  tools  

Hand  coded  Programs,  R,  Mahout,  …    

Internal  dashboards,  Tableau    

Ref:h;p:goo.gl/Mm83k

The  NOSQL  !

I AM monarch of all I survey; My right there is none to dispute;

From the centre all round to the sea I am lord of the fowl and the brute

- Cowper, The Solitude Of Alexander SelKirk

Build to Fail - “It is working” is not binary

Agenda •  Opening Gambit

–  NOSQL  :  Toil,  Tears  &  Sweat  !  •  The Pragmas

–  ABCs  of  NOSQL  [ACID,  BASE  &  CAP]  •  The Mechanics

–  Algorithmics  &  Mechanisms  (For  reference)  

Referenced Links @ http://doubleclix.wordpress.com/2010/06/20/nosql-talk-references/

What is NOSQL Anyway ?

•  NOSQL    !=  NoSQL  or  NOSQL  !=  (!SQL)  •  NOSQL  =  Not  Only  SQL  •  Can  be  traced  back  to  Eric  Evans[2]!  

–  You  can  ask  him  during  the  ayernoon  session!  •  Unfortunate  Name,  but  is  stuck  now  •  Non  RelaXonal  could  have  been  beIer  •  Usually  OperaXonal,  Definitely  Distributed  •  NOSQL  has  certain  semanXcs  –  need  not  stay  that  way  

Key  Value   Column   Document   Graph  

Ref:  [22,51,52]  

NOSQL  

Neo4j  

FlockDB  

InfiniteGraph  

CouchDB  

MongoDB  

Lotus  Domino  

Riak  

Google  BigTable  

HBase  

Cassandra  

HyperTable  

In-­‐memory  

Disk  Based  

SimpleDB  

Memcached  

Redis  

Tokyo  Cabinet  

Dynamo  

Voldemort   Azure  TS  

WHAT WORKS NOSQL Tales from the field

When I think of my own native land, In a moment I seem to be there;

But, alas! recollection at hand Soon hurries me back to despair.

- Cowper, The Solitude Of Alexander SelKirk

•  Designer Augmenting RDBMS with a Distributed key Value Store[40 : A good talk by Geir]

•  Invitation only designer brand sales •  Limited inventory sales – start at 12:00, members have

10 min to grab them. 500K mails every day •  Keeps brand value, hidden from search •  Interesting load properties •  Each item a row in DB-BUY NOW reserves it

–  Can't order more •  Started out as a Rails app

–  shared nothing

•  Narrow peaks – half of revenue

Christian Louboutin Effect

•  ½ amz for Louboutin •  Use Voldemort •  Inventory, Shopping Cart,

Checkout •  Partition by prod ID •  Shared infrastructure – “fog”

not “cloud’ - Joyent! •  In-memory inventory •  Not afraid of sale anymore!

And SQL DBs are still relevant !

Typical NOSQL Example Bit.ly •  Bit,ly URL shortening service, uses MongoDB •  User, title, URL, hash, labels[I-5], sort by time •  Scale – ~50M users, ~10K concurrent, ~1.25B shortens

per month •  Criteria:

–  Simple, Zippy FAST, Very Flexible, Reasonable Durability, Low cost of ownership

•  Sharded by userid

•  New kind of “dictionary” a word repository, GPS for English – context, pronunciations, twitter … developer API

•  Characteristics[I-6,Tony Tam’s presentation] –  RO-centric, 10,000 reads for every write –  Hit a wall with MySQL (4B rows) –  MongoDB read was so good that memcached layer was not

required –  MongoDB used 4 times MySQL storage

•  Another example : –  Voldemort – Unified Communications, IP-Phone data stored

keyed off of phone number. Data relatively stable

Large Hadron Collider@CERN •  DAS is part of giant data management

enterprise (cms) –  Polygot Persistence (SQL + NOSQL, Mongo, Couch,

memcache, HDFS, Luster, Oracle, mySQL, …) •  Data Aggregation System [I-1,I-2,I-3,I-4]

–  Uses MongoDB –  Distributed Model, 2-6 pb data –  Combine info. from different metadata sources, query

without knowing their existence, user has domain knowledge – but shouldn’t deal with various formats, interfaces and query semantics

–  DAS aggregates, caches and presents data as JSON documents – preserving security & integrity

And SQL DBs are still relevant !

Scaling Twitter • 

•  Digg –  RDBMS places burden on reads than writes[I-8] –  Looked at NOSQL, selected Cassandra

•  Colum oriented, so more structure than key-value

•  Heard from noSQL Boston[http://twitter.com/#search?q=%23nosqllive] –  Baidu: 120 node HyperTable cluster managing

600TB of data –  StumbleUpon uses HBase for Analytics –  Twitter’s Current Cassandra cluster: 45 nodes

•  Adob is a HBase shop[I-10,I-11,2]

•  Adobe SaaS Infrastructure – tagging, content aggregation, search, storage and so forth

•  Dynamic schema & huge number of records[I-5]

•  40 million records in 2008 to 1 billion with 50 ms response

•  NOSQL not mature in 2008, now good enough

•  Prod Analytics:40 nodes, largest has 100 nodes

•  BBC is a CouchDB shop[I-13]

•  Sweet spot: •  Multi-master, multi

datacenter replication

•  Interactive Mediums •  Old data to CouchDB •  Thus free up DB to do

work!

•  Cloudkick is a Cassandra shop[I-12] •  Cloudkick offers cloud management services •  Store metrics data •  Linear scalability for write load •  Massive write performance

•  Memory table & serial commit log •  Low operational costs •  Data Structure

–  Metrics, Rolled-up data, Statuses at time slice : all indexed by timestamp

•  Guardian/UK –  Runs on Redis[I-14] ! –  “Long-term The Guardian is looking

towards the adoption of a schema-free database to sit alongside its Oracle database and is investigating CouchDB. … the relational database is now just a component in the overall data management story, alongside data caching, data stores, search engines etc.

–  NOSQL can increase performance of relational data by offloading specific data and tasks

And SQL DBs are still relevant ! "The evil that SQL DBs do lives after them; the good is oft interred with their bones...",

NOSQL at Netflix •  Netflix is fully in the cloud •  Uses NOSQL across the globe •  Customer Profiles, watchlog, usage logging (see next

slide) –  No multi-record locking

•  No DBA ! •  Easier Schema Changes •  Less complex, Highly Available data store •  Joins happen in the applications

http://www.hpts.ws/sessions/nosql-ecosystem.pdf http://www.hpts.ws/sessions/GlobalNetflixHPTS.pdf

21 NOSQL Themes •  Web  Scale  •  Scale  Incrementally/conXnuous  growth  •  Oddly  shaped  &  exponenXally  connected  •  Structure  data  as  it  will  be  used  –  i.e.  read,  query  •  Know  your  queries/updates  in  advance[96],  but  you  can  change  

them  later  •  Compute  aIributes  at  run  Xme  •  Create  a  few  large  enXXes  with  opXonal  parts  

–  NormalizaXon  creates  many  small  enXXes  •  Define  Schemas  in  models  (not  in  databases)  •  Avoid  impedance  mismatch  •  Narrow  down  &  solve  your  core  problem  •  Solve  the  right  problem  with  the  right  tool  

Ref:  [I-­‐8]  

21 NOSQL Themes •  ExisXng  soluXons  are  clunky[1]  (in  certain  situaXons)  •  Scale  automaXcally,  “becoming  prohibiXvely  costly  (in  

terms  of  manpower)  to  operate”  TwiIer[I-­‐9]    •  DistribuXon  &  parXXoning  are  built-­‐in  NOSQL  

•  RDBMS  distribuXon  &  sharding  not  fun  and  is  expensive  –  Lose  most  funcXonality  along  the  way  

•  Data  at  the  center,  Flexible  schema,  Less  joins  •  The  value  of  NOSQL  is  in  flexibility  as  much  as  it  is  in  “Big  

Data”  

21 NOSQL Themes •  Requirements[3]  

–  Data  will  not  fit  in  one  node  •  And  so  need  data  parXXon/distribuXon  by  the  system  

–  Nodes  will  fail,  but  data  needs  to  be  safe  –  replicaXon!  –  Low  latency  for  real-­‐Xme  use  

•  Data  Locality  –  Row  based  structures  will  need  to  read  whole  row,  even  for  a  column  

–  Column  based  structures  need  to  scan  for  each  row  •  SoluXon  :  Column  storage  with  Locality    

–  Keep  data  that  is  read  together,  don’t  read  what  you  don’t  care  •  For  example  friends  –  other  data  

Ref:  3  

ABCs of NOSQL -

ACID, BASE &

CAP

The woods are lovely, dark, and deep, But I have promises to keep,

And miles to go before I sleep, And miles to go before I sleep.

-Frost

CAP Principle

Consistency

Availability Partition

“CAP  Principle  →      Strong  Consistency,      High  Availability,      Par::on-­‐resilience:    

Pick  at  most  2”[37]

Which  feature  to  discard  depends  on  the  nature  of  your  system[41]  

CAP Principle

Consistency

Availability Partition

“CAP  Principle  →      Strong  Consistency,      High  Availability,      Par::on-­‐resilience:    

Pick  at  most  2”[37]  C-­‐A  No  P  →  Single  DB  server,  no  network  par::on  

Which  feature  to  discard  depends  on  the  nature  of  your  system[41]  

CAP Principle

Consistency

Availability Partition

“CAP  Principle  →      Strong  Consistency,      High  Availability,      Par::on-­‐resilience:    

Pick  at  most  2”[37]  C-­‐P  No  A  →  Block  transac:on  in  case  of  par::on  failure  

Which  feature  to  discard  depends  on  the  nature  of  your  system[41]  

CAP Principle

Consistency

Availability Partition

“CAP  Principle  →      Strong  Consistency,      High  Availability,      Par::on-­‐resilience:    

Pick  at  most  2”[37]   A-­‐P  No  C  →  Expira:on  based  caching,  vo:ng  majority  

Interesting (& controversial) from NOSQL perspective

ABCs  of  NOSQL  •  ACID  

o  Atomicity,  Consistency,  IsolaXon  &  Durability  –  fundamental  properXes  of  SQL  DBMS  

•  BASE[35,39]  o  Basically  Available  Soy  state(Scalable)  Eventually  Consistent    

•  CAP[36,39]  o  Consistency,  Availability  &  ParXXoning  o  This  C  is  ~A+C  

•  i.e.  Atomic  Consistency[36]  

ACID  •  Atomicity  

o  All  or  nothing  •  Consistent  

o  From  one  consistent  state  to  another  •  e.g.  ReferenXal  Integrity  

o  But  it  is  also  applicaXon  dependent  on    •  e.g.  min  account  balance  •  Predicates,  invariants,…  

•  IsolaXon  •  Durability  

CAP  Pragmas  •  PrecondiXons  

o  The  domain  is  scalable  web  apps  o  Low  Latency  For  real  Xme  use  o  A  small  sub-­‐set  of  SQL  FuncXonality  o  Horizontal  Scaling  

•  PritcheI[35]  talks  about  relaxing  consistency  across  funcXonal  groups  than  within  funcXonal  groups  

•  Idempotency  to  consider  o  Updates  inc/dec  are  rarely  idempotent  o  Order  preserving  trx  are  not  idempotent  either  o  MVCC  is  an  answer  for  this  (CouchDB)  

Consistency  

•  Strict  Consistency  o Any  read  on  Data  X  will  return  the  most  recent  write  on  X[42]  

•  SequenXal  Consistency  o Maintains  sequenXal  order  from  mulXple  processes  (No  menXon  of  Xme)  

•  Linearizability  o Add  Xmestamp  from  loosely  synchronized  processes  

Consistency  •  Write  availability,  not  read  availability[44]  •  Even  load  distribuXon  is  easier  in  eventually  consistent  systems  

•  MulX-­‐data  center  support  is  easier  in  eventually  consistent  systems  

•  Some  problems  are  not  solvable  with  eventually  consistent  systems  

•  Code  is  someXmes  simpler  to  write  in  strongly  consistent  systems  

CAP  EssenXals  –  1  of  3  •  “CAP  Principle  →  Strong  Consistency,  High  Availability,  ParXXon-­‐resilience:  Pick  at  most  2”[37]  o  C-­‐A  No  P  →  Single  DB  server,  no  network  parXXon  

o  C-­‐P  No  A  →  Block  transacXon  in  case  of  parXXon  failure  

o  A-­‐P  No  C  →  ExpiraXon  based  caching,  voXng  majority  

•  Which  feature  to  discard  depends  on  the  nature  of  your  system[41]  

CAP  EssenXals  –  2  of  3  •  Yield  vs.  Harvest[37]  

o  Yield  →  Probability  of  compleXng  a  request  o  Harvest  →  FracXon  of  data  reflected  in  the  response  

•  Some  systems  tolerate  <  100%  harvest  (e.g  search  i.e.  approximate  answers  OK)  others  need  100%  harvest  (e.g.  Trx  i.e.  correct  behavior  =  single  well  defined  response)  

•  For  sub-­‐systems  that  tolerate  harvest  degradaXon,  CAP  makes  sense      

CAP  EssenXals  –  3  of  3  •  Trading  Harvest  for  yield  –  AP  •  ApplicaXon  decomposiXon  &  use  NOSQL  in  

appropriate  sub-­‐systems  that  has  state  management  and  data  semanXcs  that  match  the  opera<onal  feature  &  impedance  o  Hence  NotOnly  SQL  not  No  SQL  o  Intelligent  homing  to  tolerate  parXXon  failures[44]  o  MulX  zones  in  a  region  (150  miles  -­‐  5  ms)  o  TwiIer  tweets  in  Cassandra  &  MySQL  o  BBC  using  MongoDB  for  offloading  DBMS  o  Polygot  persistence  at  LHC@CERN  

CAP  EssenXals  –  3  of  3  •  Trading  Harvest  for  yield  –  AP  •  ApplicaXon  decomposiXon  &  use  NOSQL  in  

appropriate  sub-­‐systems  that  has  state  management  and  data  semanXcs  that  match  the  opera<onal  feature  &  impedance  o  Hence  NotOnly  SQL  not  No  SQL  o  Intelligent  homing  to  tolerate  parXXon  failures[44]  o  MulX  zones  in  a  region  (150  miles  -­‐  5  ms)  o  TwiIer  tweets  in  Cassandra  and  MySQL  o  BBC  using  MongoDB  for  offloading  DBMS  o  Polygot  persistence  at  LHC@CERN  

Most important point in the whole

presentation

Eventual  Consistency  &  AMZ  •  DistribuXon  Transparency[38]  •  Larger  distributed  systems,  network  parXXons  are  given  

•  Consistency  Models  o  Strong  o Weak  

•  Has  an  inconsistency  window  before  update  and  guaranteed    view  

o  Eventual  •  If  no  new  updates,  all  will  see  the  value,  eventually  

Eventual  Consistency  &  AMZ  •  Guarantee  variaXons[38]  

o Read-­‐Your-­‐writes  o Session  consistency  o Monotonic  Read  consistency  

• Access  will  not  return  previous  value  o Monotonic  Write  consistency  

• Serialize  write  by  the  same  process  

•  Guarantee  order  (vector  clocks,  mvcc)  o  Example  :  Amz  Cart  merger  (let  cart  add  even  with  parXal  

failure)  

Eventual  Consistency  &  AMZ  -­‐  SimpleDB  •  SimpleDB  strong  consistency  semanXcs  [49,50]    o UnXl  Feb  2010,  SimpleDB  only  supported  eventual  consistency  i.e.  GetAIributes  ayer  PutAIributes  might  not  be  the  same  for  some  Xme  (1  second)  

o On  Feb  24,  AWS  Added  ConsistentRead=True  aIribute  for  read  

o Read  will  reflect  all  writes  that  got  200OK  Xll  that  Xme!  

Eventual  Consistency  &  AMZ  -­‐  SimpleDB  

•  SimpleDB  strong  consistency  semanXcs  [49,50]    o Also  added  condiXonal  put/delete  o Put  aIribute  has  a  specified  value  (Expected.1.Value=)  or  (Expected.1.Exists  =  true/false)  

o Same  condiXonal  check  capability  for  delete  also  

o   Only  on  one  aIribute  !  

Eventual  Consistency  &  AMZ  –  S3  •  S3  is  an  eventual  consistency  system  

o Versioning  o “S3  PUT  &  COPY  synchronously  store  data  across  mulXple  faciliXes  before  returning  SUCCESS”  

o Repair  Lost  redundancy,  repair  bit-­‐rot  o Reduced  Redundancy  opXon  for  data  that  can  be  reproduced  (99.999999999%    vs.  99.99%)    • Approx  1/3rd  less  

o CloudFront  for  caching  

!SQL  ?  •  “We  conclude  that  the  current  RDBMS  code  lines,  while  

aIempXng  to  be  a  “one  size  fits  all”  soluXon,  in  fact,  excel  at  nothing.  Hence,  they  are  25  year  old  legacy  code  lines  that  should  be  reXred  in  favor  of  a  collecXon  of  “from  scratch”  specialized  engines.”[43]  

•  “Current  systems  were  built  in  an  era  where  resources  were  incredibly  expensive,  and  every  compuXng  system  was  watched  over  by  a  collecXon  of  wizards  in  white  lab  coats,  responsible  for  the  care,  feeding,  tuning  and  opXmizaXon  of  the  system.  In  that  era,  computers  were  expensive  and  people  were  cheap”  

•  “The  1970  -­‐  1985  period  was  a  <me  of  intense  debate,  a  myriad  of  ideas,  &  considerable  upheaval.  We  predict  the  next  fiUeen  years  will  have  the  same  feel  “  

Further  deliberaXon  •  Daniel  Abadi[45],Mike  Stonebreaker[46],  James  Hamilton[47],  Pat  Hilland[48]  are  all  good  read  for  further  deliberaXons  

NOSQL Internals & Algorithmics

Caveats  •  A  representaXve  subset  of  the  mechanics  and  mechanisms  used  in  the  NOSQL  world  

•  Being  refined  &  newer  ones  are  being  tried  •  At  a  system  level  –  to  show  how  the  techniques  play  a  part  to  deliver  a  capability  

•  The  NOSQL  Papers  and  other  references  for  further  deliberaXon  

•  Even  if  we  don’t  cover  fully,  it  is  OK.  I  want  to  introduce  some  of  the  concepts  so  that  you  get  an  appreciaXon  …  

NOSQL  Mechanics  •  Horizontal  Scalability  

–  Gossip  (Cluster  membership)  

–  Failure  DetecXon  –  Consistent  Hashing  –  ReplicaXon  Techniques  •  Hinted  Handoff  •  Merkle  Trees  

–  Sharding  MongoDB  –  Regions  in  HBase    

•  Performance  –  SStables/memtables  –  LSM  w/Bloom  Filter  

•  Integrity/Version  reconcilia<on  –  Timestamps  –  Vector  Clocks  –  MVCC  –  SemanXc  vs.  syntacXc  reconciliaXon  

Consistent  Hashing  •  Origin:  web  caching  “To  decrease  ‘hot  spots’  

•  Three  goals[87]  –  Smooth  evoluXon  

• When  a  new  machine  joins,  minimum  rebalance  work  and  impact  

–  Spread  • Objects  assigned  to  a  min  number  of  nodes  

–  Load  • #  of  disXnct  objects  assigned  to  a  node  is  small  

Consistent  Hashing  •  Hash  Keyspace/Token  is  divided  into  parXXons/ranges  •  Cassandra  –  choice    

–  OrderPreserving  parXXoner  –  key  =  token  (for  range  queries)  –  Also  saw  a  CollaXngOrderPreservingParXXoner  

•  ParXXons  assigned  to  nodes  that  are  logically  arranged  in  a  circle  topology  

•  Amz  (dynamo)  –  assign  sets  of  (random)  mulXple  points  to  different  machines  depending  on  load  

•  Cassandra  –  monitor  load  &  distribute  

•  Specific  join  &  leave  protocols  •  ReplicaXon  –  next  3  consecuXve  •  Cassandra  –  Rack-­‐aware,  Datacenter-­‐aware  

Consistent  Hashing  -­‐  Hinted-­‐handoff  •  What  happens  when  a  node  is  not  available  ?  

–  May  be  under  load  –  May  be  network  parXXon  

•  Sloppy  Quorum  &  Hinted-­‐handoff  •  R/W  performed  on  the  1st  n  healthy  nodes  •  Replica  sent  to  a  host  node  with  hint  in  metadata  &  then  transferred  when  the  actual  node  is  up  

•  Burdens  neighboring  nodes  •  Cassandra  0.6.2  default  is  disabled  (I  think)  

Consistent  Hashing  -­‐  ReplicaXon  • What  happens  when  a  new  node  joins  ?  – It  gets  one  or  more  parXXons  – Dynamo  :  Copy  the  whole  parXXon  – Cassandra  :  Replicate  keyset  – Cassandra  :  working  on  a  bit  torrent  type  protocol  to  copy  from  replicas  

AnX-­‐entropy  •  Merge  and  reconciliaXon  operaXons  

–  Operate  on  two  states  and  return  a  new  state[86]  •  Merkle  Trees  

–  Dynamo  use  of  Merkle  trees  to  detect  inconsistencies  between  replicas  

–  AnXEntropy  in  Cassandra  exchanges  Merkle  trees  and  if  they  disagree,  range  repair  via  compacXon[91,92]  

–  Cassandra  uses  the  ScuIlebuI  ReconciliaXon[86]  

Gossip  • Membership  &  Failure  detecXon  •  Based  on  emergence  without  rigidity  –  pulse  coupled  oscillators,  biological  systems  like  fireflies  ![90]  

•  Also  used  for  state  propagaXon  – Used  in  Dynamo/Cassandra  

Gossip  •  Cassandra  exchanges  heartbeat  state,  applicaXon  state  

and  so  forth  •  Every  second,  random  live  node,  random  unreachable  

node  and  exchanges  key-­‐value  structures  •  Some  nodes  play  the  part  of  seeds  •  Seed  /iniXal  contact  points  in  staXc  conf  file  

storage.conf  file  •  Could  also  come  from  a  configuraXon  service  like  

zookeeper  •  To  guard  against  node  flap,  explicit  membership  join  and  

leave  –  now  you  know  why  hinted  handoff  was  added    

Membership  &  Failure  detecXon  •  Consensus  &  Atomic  Broadcast    -­‐  impossible  to  solve  in  a  distributed  system[88,89]  –  Cannot  differenXate  between  an  slow  system  and  a  crashed  system    

•  Completeness  –  Every  system  that  crashed  will  be  eventually  detected  

•  Correctness  –  A  correct  process  is  never  suspected  

•  In  short,  if  you  are  dead  somebody  will  no<ce  it  and  if  you  are  alive,  nobody  will  mistake  you  for  dead  !  

Ø  Accrual  Failure  Detector  •  Not    Boolean  value  but  a  probabilisXc  number  that  “accrues”  over  

an  exponenXal  scale  •  Captures  the  degree  of  confidence  that  a  corresponding  monitored  

process  has  crashed[94]  –  Suspicion  Level  –  Ø  =  1  -­‐>  prob(error)  10%  –  Ø  =  2  -­‐>  prob(error)  1%  –  Ø  =  3  -­‐>  prob(error)  0.1%  

•  If  process  is  dead,    –  Ø  is  monotonically  increasing  &  Ø→α  as  t  →α  

•  If  process  is  alive  and  kicking,  Ø=0  •  Account  for  lost  messages,  network  latency  and  actual  crash  of  

system/process  

•  Well  known  heartbeat  period  Δi,  then  network  latency  Δtr  can  be  tracked  by  inter-­‐arrival  Xme  modeling  

Write/Read  Mechanisms  •  Read  &  Write  to  a  random  node  (StorageProxy)  

•  Proxy  coordinates  the  read  and  write  strategy  (R/W  =  any,  quorum  et  al)  

• Memtables/SSTables  from  big  table  •  Bloom  Filter/Index  •  LSM  Trees  

BF

Index

BF

Index

BF

Index

Commit Logs

MemTable

SSTable • Immutable • Compaction • Maintain Index & Bloom Filter

Node

Node

Flushing

Read

Write

Memory

Disk

Hbase – WAL, Memstore, HDFS File system

How…  does  HBase  work  again?  

http://www.larsgeorge.com/2010/01/hbase-architecture-101-write-ahead-log.html

http://hbaseblog.com/2010/07/04/hug11-hbase-0-90-preview-wrap-up/

Bloom  Filter  •  The  BloomFilter  answers  the  quesXon    •  “Might  there  be  data  for  this  key  in  this  SSTable?”  [Ref:  Cassandra/Hbase  mailer]  –  “Maybe"  or  –   “Definitely  not“  –  When  the  BloomFilter  says  "maybe"  we  have  to  go  to  disk  to  check  out  the  content  of  the  SSTable  

•  Depends  on  implementaXon  –  Redone  in  Cassandra  –  Hbase  0.20.x  removed,  will  be  back  in  0.90  with  a  “jazzy”  implementaXon  

Was it a vision, or a waking dream? Fled is that music:—do I wake or sleep?

-Keats, Ode to a Nightingale

•  http://www.readwriteweb.com/enterprise/2011/11/infographic-data-deluge---8-ze.php

•  http://www.crn.com/news/data-center/232200061/efficiency-or-bust-data-centers-drive-for-low-power-solutions-prompts-channel-growth.htm

•  http://www.quantumforest.com/2011/11/do-we-need-to-deal-with-big-data-in-r/

•  http://www.forbes.com/special-report/2011/migration.html •  http://www.mercurynews.com/bay-area-news/ci_19368103 •  http://www.businessinsider.com/apple-new-data-center-north-

carolina-created-50-jobs-2011-11

Recommended