36
Big Data App Server Lance Riedel

Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event

Big  Data  App  Server  

Lance  Riedel  

Page 2: Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event

Big Data App Server

A  new  applica5on  framework  for  (4  V’s):  •  Volume  of  raw  data  (Petabytes)  •  Velocity  at  which  it  is  being  generated/

ingested    •  Variety  of  data  sources  and  schemas  •  Advanced  data  sciences  and  analy5cs  that  

can  be  applied  to  extract  Value  

 

Page 3: Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event
Page 4: Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event

Big Data App Server Use Cases

•  Log/Machine  Analy5cs  •  Security/Fraud  Detec5on  •  Sensor  Data  Analy5cs  •  Financial  Analy5cs  •  Retail  Analy5cs  •  Ad  Targe5ng  •  Recommenda5on  (e.g.  NeMlix,  Amazon)    

Page 5: Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event

Components B

ig D

ata

Pla

tform

Page 6: Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event

APP  SERVER  COMPONENTS    

Page 7: Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event

Storage and Compute B

ig D

ata

Pla

tform

Page 8: Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event

Storage and Compute

Mo8va8on  Google  needed  to  capture  the  web  and  process  it  efficiently    •  Calculate  importance  of  pages,  words,  

domains  against  each  other  •  The  more  cost-­‐effec5ve  they  could  make  

it  -­‐  the  more  they  could  process,  index,  understand  

 

Page 9: Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event

Storage/Compute: Centralized

•  Centralized  doesn’t  scale!    •  Move  a  lot  of  data  –  boWleneck  

Page 10: Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event

Storage/Compute: Sharding

•  Sharding  is  spliXng  the  problem  into  isolated  chunks  •  Sharding  scales,  but  fails  when  you  need  to  look  across  the  data  

•  E.G.  How  to  calculate  term  weights  or  top  pages  across  shards??  

✓   ✓   ✓   ✓   ✓   ✓   ✓  

≠  

Page 11: Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event

DFS, MapReduce

•  Used  a  new  programming  model  to  distribute  computa5on  AND  data  (NOT  sharding)  

•  Runs  on  commodity  hardware    •  Failure  resilience  using  so_ware  control  •  Easy  to  calculate  across  corpus    •  Two  parts  of  a  complete  Solu5on:  

•  Distributed  File  System  –  DFS  •  MapReduce  

Page 12: Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event

Distributed File System

Page 13: Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event

MapReduce

•  Process  where  the  data  resides  (Data  and  compute  are  local  to  each  other)  •  Map  (read  the  data,  emit  a  key  and  a  value)  •  Reduce  (group  all  values  per  key,  perform  another  opera5on)  

Page 14: Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event

Hadoop

•  Open  Source  implementa5on  of  Google’s  DFS  and  MapReduce  whitepaper  

•  Huge  Eco-­‐System  •  Used  by:  Yahoo,  Facebook,  TwiWer,  LinkedIn,  Sears,  Apple,  The  New  York  Times,  Telefonica,  +1000’s  more!  

Page 15: Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event

Management B

ig D

ata

Pla

tform

Page 16: Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event

Data Ingestion

Mo8va8on  •  Data  origina5ng  from  a  

variety  of  sources    

•  Some  data  more  valuable  than  others:  •  Time-­‐to-­‐live  (TTL)  •  Guarantees  on  

delivery  

Page 17: Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event

Data Ingestion: Apache Flume

•  A  scalable,  fault-­‐tolerant,  configurable  topology  data  inges5on  pipeline  that  works  hand  in  hand  with  the  Hadoop  Eco-­‐System  

•  Configurable  delivery  guarantees      -­‐  rou5ng,  replica5on,  failover  •  Extensible  sources  and  sinks  allows  for  pluggable  data  sources  

•  Scales  out  horizontally  –  100k’s  messages/sec    

Page 18: Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event

Workflow

Mo8va8on  Transforming,  storing,  joining,  data  can  take  a  lot  of  steps  that  need  to  be  repeatable  and  traceable  –  the  programming  model  for  data      

Page 19: Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event

Workflow: Oozie

A  workflow  engine  that  understands  the  dependency  graph  of  work  and  can  schedule,  replay,  and  report  on  the  steps    •  Jobs  triggered  by  5me  (frequency)  and  data  

availability  •  Integrated  with  the  rest  of  the  Hadoop  stack  •  Scalable,  reliable  and  extensible  system.              

Page 20: Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event

Schema Management

Mo8va8on  As  data  sources  explode,  the  need  to  understand  the  data  schemas  becomes  a  principle  concern    

Page 21: Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event

Schema: HCatalog

•  A  table  and  storage  management  layer  for  Hadoop    

•  Enables  users  with  different  data  processing  tools  –  Pig,  MapReduce,  and  Hive  –  to  more  easily  read  and  write  data  on  the  grid.    

       

Page 22: Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event

Schema: Avro

 •  A  data  serializa5on  system  •  When  Avro  data  is  stored  in  a  file,  its  schema  is  stored  with  it  

•  Correspondence  between  same  named  fields,  missing  fields,  extra  fields,  etc.  can  all  be  easily  resolved.  

•  Most  technologies  in  the  Hadoop  stack    understand  avro–  interoperability/data  passing  

     

Page 23: Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event

Data Access, Querying B

ig D

ata

Pla

tform

Page 24: Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event

Data Access

Mo8va8on  Various  data  access  paWerns  require  data  stores  beyond  just  the  DFS  files.  An  example  is  a  key  value  store  that  needs  random  access  to  data.    Solu8on(s)  There  are  a  number  of  solu5ons  depending  on  the  use  case.    •  Google’s  BigTable  whitepaper  •  SQL  has  been  adapted  to  Hadoop    

Page 25: Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event

Data Access: HBase

•  The  Hadoop  database  -­‐  a  distributed,  scalable,  big  data  store  (sorted  map)  –  from  Google’s  BigTable,  backed  by  Hadoop  DFS  

•  Linear  and  modular  scalability.  •  Automa5c  and  configurable  sharding  of  tables  

•  Automa5c  failover  support    •  Convenient  base  classes  for  backing  Hadoop  MapReduce  jobs  with  Apache  HBase  tables.  

Page 26: Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event

Data Access: SQL – Hive, Impala

•  SQL  querying  of  raw  data  on  the  distributed  file  system  

•  Impala  –  Query  files  on  HDFS  including  SELECT,  JOIN,  and  aggregate  func5ons  –  in  real  5me  

•  Hive  –  provides  easy  data  summariza5on,  ad-­‐hoc  queries,  and  the  analysis  of  large  datasets  stored  in  Hadoop  compa5ble  file  systems  

Page 27: Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event

Analytics B

ig D

ata

Pla

tform

Page 28: Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event

Data Analytics

Mo8va8on  •  Discover  the  latent  value  of  the  data.  The  core  

mo5va5on  behind  Big  Data!  •  Clustering,  Machine  Learning,  Correla5ons,  

Modeling  –  the  guts  of  the  Data  Science  –  o_en  extremely  diverse  use  cases.    

 Solu8on(s)  A  pluggable  architecture  that  can  share  schemas,  but  allow  for  a  suite  of  tools  appropriate  for  the  use  case  

Page 29: Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event

Data Analytics: Example Frameworks •  Mahout  

•  Machine  learning,  clustering  •  PaWern  -­‐  Machine  Learning  DSL  for  Hadoop  from  

Cascading  •  0xData  

•  Open  source  math  and  predic5on  engine  for  big  data  •  Sample  Algorithms  

•  Random  Forest  algorithm  •  K-­‐Means  Clustering  •  Hierarchical  Clustering  •  Linear  Regression  •  Logis5c  Regression  •  Support  Vector  Machines  •  Ar5ficial  Neural  Networks  •  Associa5on  Rule  Learning  

Page 30: Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event

Serving B

ig D

ata

Pla

tform

Page 31: Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event

Serving

Mo8va8on  •  Powering  applica5ons  for  end  users  •  Search/browse  and  recommenda5on  engines  

allow  real-­‐5me  access  to  data    

Page 32: Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event

Serving: Search – Solr Cloud •  Builds  indexes  on  top  of  Hadoop  •  Horizontally  scalable,  fault  tolerant  •  Incredible  flexibility  in  indexing  op5ons  

•  Tokeniza5on  •  Field  types  •  Data  storage  

•  Search  op5ons  just  as  flexible  •  AND,OR,NOT,  wildcard  •  Facets  (counts  from  a  derived  ontology)  •  Extensive  algorithm  and  weigh5ng  plug-­‐ability  

Page 33: Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event

Serving: Manas – Matching Engine

•  The  Hive’s  massively  scalable  matching  engine    

•  Handles  100’s  millions  to  billions  of  documents  efficiently  while  matching  against  100’s  to  1000’s  features  

•  Nothing  exists  today  in  the  Open  Source  community  that  has  these  capabili5es  

Page 34: Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event

EXAMPLE  APP  USE-­‐CASE  

Page 35: Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event

App Server Data Flow

Page 36: Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event

SecurityX on App Server