49
NearReal(me Processing over HBase Ryan Brush @ryanbrush

Near Realtime Processing over HBase

Embed Size (px)

Citation preview

Page 1: Near Realtime Processing over HBase

Near-­‐Real(me  Processing   over  HBaseRyan  Brush@ryanbrush

Page 2: Near Realtime Processing over HBase

Topics-The  story  so  far  -Complemen8ng  MapReduce  with  stream-­‐based  processing  -Techniques  and  lessons  -Query  and  search  -The  future

Page 3: Near Realtime Processing over HBase

The  story  so  far...

Page 4: Near Realtime Processing over HBase

Chart  Search

Page 5: Near Realtime Processing over HBase

Chart  Search-Informa8on  extrac8on  -Seman8c  markup  of  documents  -Related  concepts  in  search  results  -Processing  latency:  tens  of  minutes

Page 6: Near Realtime Processing over HBase

Medical  Alerts

Page 7: Near Realtime Processing over HBase

Medical  Alerts-Detect  health  risks  in  incoming  data  -No8fy  clinicians  to  address  those  risks  -Quickly  include  new  knowledge  -Processing  latency:  single-­‐digit  minutes

Page 8: Near Realtime Processing over HBase

Exploring  live  data

Page 9: Near Realtime Processing over HBase

Exploring  live  data-Novel  ways  of  exploring  records  -Pre-­‐computed  models  matching  users’  access  paLerns  -Very  fast  load  8mes  -Processing  latency:  seconds  or  faster

Page 10: Near Realtime Processing over HBase

And  many  othersPopula(on  analy(cs

Care  coordina(onPersonalized  health  plans

- Data  sets  growing  at  hundreds  of  GBs  per  day  - Approaching  1  petabyte  total  data  - Rate  is  increasing;  expec8ng  mul8-­‐petabyte  data  sets

Page 11: Near Realtime Processing over HBase

-Analyze  all  data  holis8cally  -Quickly  apply  incremental  updates

A  trend  towards  compe8ng  needs

Page 12: Near Realtime Processing over HBase

A  trend  towards  compe8ng  needs

MapReduce- (re-­‐)Process  all  data  - Move  computa8on  to  data  - Output  is  a  pure  func8on  of  the  input  

- Assumes  set  of  sta8c  input

Stream- Incremental  updates  - Move  data  to  computa8on  - Needs  to  clean  up  outdated  state  

- Input  may  be  incomplete  or  out  of  order

Both  processing  models  are  necessary  and  the  underlying  logic  must  be  the  same

Page 13: Near Realtime Processing over HBase

A  trend  towards  compe8ng  needs

Speed Layer

Batch Layer

hLp://nathanmarz.com/blog/how-­‐to-­‐beat-­‐the-­‐cap-­‐theorem.html

Page 14: Near Realtime Processing over HBase

Speed Layer

Batch LayerHigh  Latency  (minutes  or  hours  to  process)

Low  Latency  (seconds  to  process)

Move  data  to  computa(on

Move  computa(on  to  dataYears  of  data

Hours  of  data

Bulk  loads

Incremental  updates

A  trend  towards  compe8ng  needs

hLp://nathanmarz.com/blog/how-­‐to-­‐beat-­‐the-­‐cap-­‐theorem.html

Page 15: Near Realtime Processing over HBase

Realtime Layer

Batch LayerMapReduce

Storm

Stream-­‐based

Hadoop

A  trend  towards  compe8ng  needs

Page 16: Near Realtime Processing over HBase

Into  the  rabbit  hole-A  ride  through  the  system  -Techniques  and  lessons  learned  along  the  way

Page 17: Near Realtime Processing over HBase

Data  inges8on

-Stream  data  into  HTTPS  service  -Content  stored  as  Protocol  Buffers  -Mirror  the  raw  data  as  simply  as  possible

/source:1/document:123/source:2/allergy:345/source:2/document:456/source:2/order:234…/source:n/prescription:789

HBase

CollectorService

Source System 1

Source System 2

Source System N

. . . HTTPS

Page 18: Near Realtime Processing over HBase

Scan  for  updates

Process  incoming  data- Ini8ally  modeled  aYer  Google  Percolator  -“No8fica8on”  records  indicate  changes  -Scan  for  no8fica8ons

Data  Table

source:1/document:123

source:2/allergy:345

source:2/document:456

.  .  .  

source:150/order:71

No8fica8on  Table

source:1/document:123

source:150/order:71

Page 19: Near Realtime Processing over HBase

But  there’s  a  catch…-Percolator-­‐style  no8fica8on  records  require  external  coordina8on  -More  infrastructure  to  build,  maintain  -…so  let’s  use  HBase’s  primi8ves

Page 20: Near Realtime Processing over HBase

Scan  for  updatesProcess  incoming  data

- Consumers  scan  for  items  to  process  -Atomically  claim  lease  records  (CheckAndPut)  - Clear  the  record  and  no8fica8ons  when  done  - ~3000  no8fica8ons  per  second  per  node

Row  Key Qualifiers  (lease  record  and  keys  of  updated  items)

split:0 0000_LEASE,  source:2/allergy:345,  source:150/order:71,  …

split:1 0000_LEASE,  source:4/problem:78,  source:205/document:52,  …

.  .  .

Page 21: Near Realtime Processing over HBase

Advantages-No  addi8onal  infrastructure  -Leverages  HBase  guarantees  -No  lost  data  -No  stranded  data  due  to  machine  failure  

-Robust  to  volume  spikes  of  tens  of  millions  of  records

Page 22: Near Realtime Processing over HBase

Downsides-Weak  ordering  guarantees  -Processing  must  be  idempotent  -Lots  of  garbage  from  deleted  cells  -Schedule  major  compac8ons!  

-Must  split  to  avoid  hot  regions  -Poten8ally  beLer  op8ons  emerging  -Apache  Kana  with  replica8on

Page 23: Near Realtime Processing over HBase

Measure  Everything

- Instrumented  HBase  client  to  see  effec8ve  performance  

- We  use  Coda  Hale’s  Metrics  API  and  Graphite  Reporter  

- Revealed  impact  of  hot  HBase  regions  on  clients

Page 24: Near Realtime Processing over HBase

The  story  so  far

HBase

CollectorService

Source System 1

Source System 2

Source System N

. . . HTTPS Data Notifications

IncrementalProcessors

Load data

Scan for updates

Page 25: Near Realtime Processing over HBase

Into  the  Storm-Storm:  scalable  processing  of  data  in  mo8on  -Complements  HBase  and  Hadoop  -Guaranteed  message  processing  in  a  distributed  environment  -No8fica8ons  scanned  by  a  Storm  Spout

Page 26: Near Realtime Processing over HBase

Processing  with  Storm

CollectorService

Source System 1

Source System 2

Source System N

. . . HTTPS Raw Data

HBase

Bolt

Bolt

BoltSpout

Processed Data

Apps

Services

Page 27: Near Realtime Processing over HBase

Challenges  of  incremental  updates

-Incomplete  data  -Outdated  state  -Difficult  to  reason  about  changing  state  and  8ming  condi8ons

Page 28: Near Realtime Processing over HBase

Handling  Incomplete  Data

Row  Key Summary  Family Staging  Family

document:1 page:1

Incoming  data

- Process  (map)  components  into  a  staging  family

Page 29: Near Realtime Processing over HBase

Handling  Incomplete  Data

Row  Key Summary  Family Staging  Family

document:1 page:1                            page:3

Incoming  data

- Process  (map)  components  into  a  staging  family

Page 30: Near Realtime Processing over HBase

Handling  Incomplete  Data

Row  Key Summary  Family Staging  Family

document:1 page:1  page:2  page:3

Incoming  data

- Process  (map)  components  into  a  staging  family

Page 31: Near Realtime Processing over HBase

Handling  Incomplete  Data

Row  Key Summary  Family Staging  Family

document:1 document_summary page:1  page:2  page:3

- Process  (map)  components  into  a  staging  family  -Merge  (reduce)  components  when  everything  is  available    -Many  cases  need  no  merge  phase;  consuming  apps  simply  read  all  of  the  components

Incoming  data

Page 32: Near Realtime Processing over HBase

Outdated  State

Time  0:  Alice  lives  in  ChicagoTime  1:  Alice  lives  in  New  York

Incoming  DataChicago  resident  indexNew  York  resident  index

Processed  Data

- Big  Data  - MapReduce:  rebuild  processed  data  

- Outdated  state  is  simply  ignored  

- Fast  Updates  - ACID  database:  simply  update  Alice’s  loca8on  

- Big  and  Fast:  it  gets  complicated

Page 33: Near Realtime Processing over HBase

Outdated  State:  Reconcile  on  Read

Historical Data (MapReduce

Output)

Incremental Updates

Merge Application

-Akin  to  Marz’s  Lambda  Architecture  -Data  stores  op8mized  for              specific  workloads  - Keeps  processing  models  independent  -Adds  complexity  at  read                          8me,  but  simpler  overall

-Marz’s  Lambda  Architecture  

-Not  available  in  commodity  app  stacks  - Probably  best  approach  when  and  if  higher-­‐level  abstrac8ons  emerge

Page 34: Near Realtime Processing over HBase

Outdated  State:  Reconcile  on  Write

-Marz’s  Lambda  Architecture  

Time  0:  Alice  lives  in  ChicagoTime  1:  Alice  lives  in  New  York

Incoming  DataChicago  resident  indexNew  York  resident  index

Processed  Data

- Keep  history  of  your  incoming  data  

- When  the  event  at  Time  1  occurs,  read  that  history  and  update  both  indexes  

- Works  with  many  exis8ng  data  stores  

- Adds  complexity  to  processing  logic  

- Data  store  must  handle  MapReduce  and  real8me  loads  -­‐-­‐  may  not  be  op8mal

Page 35: Near Realtime Processing over HBase

Different  models,  same  logic-Incremental  updates  like  a  rolling  MapReduce  -Func(ons  are  the  center  of  the  universe  (not  InputFormats  or  Messages)  

-Write  logic  as  pure  func8ons,  coordinate  with  higher  libraries  - Storm  -Apache  Crunch

Page 36: Near Realtime Processing over HBase

Gesng  complicated?-Incremental  logic  is  complex  and  error  prone  -Use  MapReduce  as  a  failsafe

CollectorService

Source System 1

Source System 2

Source System N

. . . HTTPS Raw Data

HBase

Bolt

Bolt

BoltSpout

Processed Data

MapReduce

Apps

Services

Page 37: Near Realtime Processing over HBase

Reprocess  during  up8me

-Deploy  new  incremental  processing  logic  -“Older”  8mestamps  produced  by  MapReduce  -The  most  recently  wriLen  cell  in  HBase  need  not  be  the  logical  newest

Row  Key Document  Family

document:1 {doc,  ts=50}

document:2 {doc,  ts=100}

Real  8me  incremental  update

,  {doc,  ts=300}

MapReduce  outputs

,  {doc  ts=200},  {doc,  ts=200}

Page 38: Near Realtime Processing over HBase

Comple8ng  the  Picture

CollectorService

Source System 1

Source System 2

Source System N

. . . HTTPS Raw Data

HBase

Bolt

Bolt

BoltSpout

Processed Data

MapReduce

Apps

Services

Page 39: Near Realtime Processing over HBase

Comple8ng  the  Picture

CollectorService

Source System 1

Source System 2

Source System N

. . . HTTPS Raw Data

HBase

Bolt

Bolt

BoltSpout

Processed Data

MapReduce

Apps

Services

Search Indexes

Page 40: Near Realtime Processing over HBase

Building  indexes  with  MapReduce

-A  shard  per  task  -Build  index  in  Hadoop  -Copy  to  index  hosts

Embedded Solr

Map TaskIndex Shard

Embedded Solr

Map TaskIndex Shard

Embedded Solr

Map TaskIndex Shard

Page 41: Near Realtime Processing over HBase

Pushing  incremental  updates-POST  new  records  -Bursts  can  overwhelm  target  hosts  -Consumers  must  deal  with  transient  failures

SolrShard

SolrShard

SolrShard

Replica

Replica

Replica

ProcessorData stream

Page 42: Near Realtime Processing over HBase

Pulling  indexes  from  HBase- Custom  Solr  plugin  scans  a  range  of  HBase  rows  - Time-­‐based  scan  to  get  only  updates  - Pulls  items  to  index  from  HBase  - Cleanly  recovers  from  volume  spikes  and  transient  failures

person:1person:2. . . person:nperson:n + 1….person:m

HBase

SolrShard

SolrShard

Solr

Scan

Scan

Scan

Page 43: Near Realtime Processing over HBase

A  note  on  schema:  simplify  it!

-Heterogeneous  row  keys  efficient  but  hard  to  reason  about  -Must  inspect  row  key  to  know  what  it  is  -Mismatches  tools  like  Pig  or  Hive

Row  Key Qualifiers

person:1/name <content>

person:1/address <content>

person:1/friend:1 <content>

person:1/friend:2 <content>

person:2/name <content>

person:n/name <content>

person:n/friend:m <content>

Page 44: Near Realtime Processing over HBase

Logical  parent  per  row

-The  row  is  the  unit  of  locality  -Tabular  layout  is  easy  to  understand  -No  lost  efficiency  for  most  cases  -HBase  Schema  Design  -­‐-­‐  Ian  Varley  at  2012  HBaseCon

Row  Key Qualifiers

person:1 name<…>  address:<…>  friend:1:<…>  friend:2:<…>

person:2 name<…>  address:<…>  friend:1:<…>

.  .  .

person:n name<…>  address:<…>  friend:1:<…>

Page 45: Near Realtime Processing over HBase

The  path  forward

Page 46: Near Realtime Processing over HBase

This  paMern  has  been  successful…but  complexity  is  our  biggest    enemy

Page 47: Near Realtime Processing over HBase

We  may  be  in  the  assembly  

language  era  of  big  data

Page 48: Near Realtime Processing over HBase

Higher-­‐level  abstrac(ons  for  these  paMerns  will  emerge

It’s  going  to  be  fun

Page 49: Near Realtime Processing over HBase

Ques8ons?@ryanbrush

hLps://engineering.cerner.com