48
Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark

Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,

Shark  

Cliff  Engle,  Antonio  Lupher,  Reynold  Xin,  Matei  Zaharia,  Michael  Franklin,  Ion  Stoica,  

Scott  Shenker  

Hive  on  Spark  

Page 2: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,

Agenda  

•  Intro  to  Spark  •  Apache  Hive  •  Shark  •  Shark’s  Improvements  over  Hive  •  Demo  •  Alpha  status  •  Future  directions  

Page 3: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,

What  Spark  Is  

•  Not  a  wrapper  around  Hadoop  •  Separate,  fast,  MapReduce-­‐like  engine  –  In-­‐memory  data  storage  for  very  fast  iterative  queries  

– Powerful  optimization  and  scheduling  –  Interactive  use  from  Scala  interpreter  

•  Compatible  with  Hadoop’s  storage  APIs  – Can  read/write  to  any  Hadoop-­‐supported  system,  including  HDFS,  HBase,  SequenceFiles,  etc  

Page 4: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,

Project  History  

•  Started  in  summer  of  2009  •  Open  sourced  in  early  2010  •  In  use  at  UC  Berkeley,  UCSF,  Princeton,  Klout,    Conviva,  Quantifind,  Yahoo!  Research  

Page 5: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,

Spark  Programming  Model  

Resilient  distributed  datasets  (RDDs)  Distributed  collections  of  Scala  objects  Can  be  cached  in  memory  across  cluster  nodes  •  Manipulated  like  local  Scala  collections  Automatically  rebuilt  on  failure  

Page 6: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,

Example:  Log  Mining  

Load  error  messages  from  a  log  into  memory,  then  interactively  search  for  various  patterns  

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(_.startsWith(“ERROR”))

messages = errors.map(_.split(‘\t’)(2))

cachedMsgs = messages.cache()

Block  1  

Block  2  

Block  3  

Worker  

Worker  

Worker  

Driver  

cachedMsgs.filter(_.contains(“foo”)).count

cachedMsgs.filter(_.contains(“bar”)).count

. . .

tasks  

results  

Cache  1  

Cache  2  

Cache  3  

Base  RDD  Transformed  RDD  

Action  

Result:  full-­‐text  search  of  Wikipedia  in  <1  sec  (vs  20  sec  for  on-­‐disk  data)  

Result:  scaled  to  1  TB  data  in  5-­‐7  sec  (vs  170  sec  for  on-­‐disk  data)  

Page 7: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,

Apache  Hive  

•  Data  warehouse  solution  developed  at  Facebook  

•  SQL-­‐like  language  called  HiveQL  to  query  structured  data  stored  in  HDFS  

•  Queries  compile  to  Hadoop  MapReduce  jobs  

Page 8: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,
Page 9: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,

Hive  Principles  

• SQL  provides  a  familiar  interface  for  users  • Extensible  types,  functions,  and  storage  formats  • Horizontally  scalable  with  high  performance  on  large  datasets  

Page 10: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,

Hive  Applications  

•  Reporting  •  Ad  hoc  analysis  •  ETL  for  machine  learning  …  

Page 11: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,

Hive  Downsides  •  Not  interactive  – Hadoop  startup  latency  is  ~20  seconds,  even  for  small  jobs  

•  No  query  locality  –  If  queries  operate  on  the  same  subset  of  data,  they  still  run  from  scratch  

– Reading  data  from  disk  is  often  bottleneck  

•  Requires  separate  machine  learning  dataflow  

Page 12: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,

Shark  Motivation  •  Exploit  temporal  locality:  working  set  of  data  can  often  fit  in  memory  to  be  reused  between  queries  •  Provide  low  latency  on  small  queries  •  Integrate  distributed  UDF’s  into  SQL  

Page 13: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,

Introducing  Shark  •  Shark  =  Spark  +    

           Hive  •  Run  HiveQL  queries  through  Spark  with  Hive  UDF,  UDAF,  SerDe  • Utilize  Spark’s  in-­‐memory  RDD  caching  and  flexible  language  capabilities  

Page 14: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,

Shark  in  the  AMP  Stack  

Mesos  

Spark  

Private  Cluster   Amazon  EC2  

…  Had

oop  

MPI  

Bagel  (Pregel  on  Spark)  

Shark  …  

Debug  Tools  

Streaming  Spark  

Page 15: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,

Shark  

•  Physical  operators  using  Spark  •  Rely  on  Spark’s  fast  execution,  fault  tolerance,  and  in-­‐memory  RDD’s  

•  Reuse  as  much  Hive  code  as  possible  – Convert  logical  query  plan  generated  from  Hive  into  Spark  execution  graph  

•  Compatible  with  Hive  – Run  HiveQL  queries  on  existing  HDFS  data  using  Hive  metadata,  without  modifications  

Page 16: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,
Page 17: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,

0.1  alpha:  84%  Hive  tests  passing  (575  out  of  683)  

http://github.com/amplab/shark  

Page 18: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,

0.1  alpha  •  Experimental    SQL/RDD  integration  •  User  selected  caching  with  CTAS  •  Columnar  RDD  cache  •  Some  performance  improvements  

Page 19: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,

Caching  

•  User  selected  caching  with  CTAS  

•  CREATE  TABLE  mytable_cached  AS  SELECT  *  FROM  mytable  WHERE  count  >  10;  

•  mytable_cached  becomes  an  in-­‐memory  materialized  view  that  can  be  used  to  answer  queries.  

Page 20: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,

SQL/Spark  Integration  

•  Allow  users  to  implement  sophisticated  algorithms  as  UDFs  in  Spark  

•  Query  processing  UDFs  are  streamlined    val rdd = sc.sql2rdd( "select foo, count(*) c from pokes group by foo")println(rdd.count)println(rdd.mapRows(_.getInt(“c”)).reduce(_+_))  

Page 21: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,

Performance  optimizations  

•  Hash-­‐based  shuffle  (speeds  up  group-­‐bys)  – Hadoop  does  sort-­‐based  shuffle  which  can  be  expensive.  

•  Limit  push-­‐down  in  order  by  – select  *  from  pokes  order  by  foo  limit  10  

•  Columnar  cache  

Page 22: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,

Large  Caches  in  JVM  •  Java  objects  have  large  overhead  (2-­‐4x  the  actual  data  size)  • Boxing/Unboxing  is  slow  • GC  is  a  bottleneck  if  we  have  many  long-­‐lived  objects  

Page 23: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,

Example:  Caching  Rows  Int,  Double,  Boolean  25   10.7   True  

24  +  24  +  16  =  64  bytes*          Should  be  ~  13  bytes  

*Estimated  on  64  bit  VM.  Implementations  may  vary.  

Page 24: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,

Possible  Solutions  1.  Store  deserialized  Java  objects  2.  Serialize  objects  to  in-­‐memory  

byte  array  3.  Use  memory  slab  external  to  

JVM  4.  Store  data  in  primitive  arrays  

Page 25: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,

Deserialized  Java  Objects  

+  Fast  (requires  little  CPU)          (Up  to  5X  speedup)  -­‐  Poor  memory  usage  (3X  worse)  -­‐  Lots  of  Garbage  Collection  overhead    

 

Page 26: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,

Serialize  to  Bytes  +  Best  memory  efficiency  +  Efficient  GC  -­‐  CPU  usage  to  serialize/deserialize  each  obj    

 

Page 27: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,

Store  in  External  Slab  +  Efficient  memory  +  No  GC  -­‐  CPU  usage  to  serialize/deserialize  each  obj  -­‐  More  difficult  to  implement  

 

Page 28: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,

Primitive  Column  Arrays  

+  Efficient  memory    (similar  to  byte  arrays)  

+  Efficient  GC  +  No  extra  CPU  overhead  

Page 29: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,

Column  vs  Row  Storage  

1  

Column   Row  2   3   1  

john   mike   sally  

4.1   3.5   6.4  

john   4.1  

2   mike   3.5  

3   sally   6.4  

Page 30: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,

Columnar  Benefits  • Better  compression  • Fewer  objects  to  track  • Only  materialize  necessary  columns  =>  Less  CPU  

Page 31: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,

Columnar  in  Shark  • Store  all  primitive  columns  in  primitive  arrays  on  each  node  • Less  deserialization  overhead  • Space  efficient  and  easier  to  garbage  collect  

Page 32: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,

Result  Achieved  efficient  storage  space  without  sacrificing  performance  

Page 33: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,

Limit  Pushdown  in  Sort  

SELECT  *  FROM  table  ORDER  BY  key  LIMIT  10;  

Page 34: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,

Main  Idea  • The  limit  can  be  applied  before  sorting  all  data  to  increase  performance.  • Each  mapper  computes  its  Top  K  and  then  a  single  reducer  merges  these  

Page 35: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,

Possible  Implementations  

1.      PriorityQueue  (Heap)  to  keep  running  top  k  in  one  traversal  on  each  mapper.  

   O(n  logk)    

Page 36: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,

Possible  Implementations  

2.      QuickSelect  algorithm.  Essentially  quicksort,  but  prunes  elements  on  one  side  of  pivot  when  possible.  

           O(n)    

Page 37: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,

Result  • We  chose  to  use  Google  Guava’s  implementation  of  QuickSelect.    • We  observed  much  better  performance  than  Hive  which  applies  the  limit  after  sorting  on  each  reducer.  

Page 38: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,

Demo  

•  3-­‐node  EC2  large  cluster  (2  virtual  cores,  7GB  of  RAM)  

•  Synthetic  TPC-­‐H  data  (2x  scale  factor)  

Page 39: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,

Some  numbers  while  we  wait  for  Hive  to  finish  running…  

•  Brown/Stonebraker  benchmark  70GB1  

– Also  used  on  Hive  mailing  list2  

•  10  Amazon  EC2  High  Memory  Nodes  (30GB  of  RAM/node)  

•  Naively  cache  input  tables  •  Compare  Shark  to  Hive  0.7  

1  http://database.cs.brown.edu/projects/mapreduce-­‐vs-­‐dbms/  2  https://issues.apache.org/jira/browse/HIVE-­‐396  1  

Page 40: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,

Benchmarks:  Query  1  

SELECT * FROM grep WHERE field LIKE ‘%XYZ%’;

•  30GB  input  table  

Page 41: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,

Benchmark:  Query  2  

•  5  GB  input  table  

SELECT pagerank, pageURL FROM rankings WHERE pagerank > 10;

Page 42: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,

Status  

•  Passing  575  Hive  tests  out  of  683.  

•  You  can  help  us  decide  what  to  implement  next.  

Page 43: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,

Unsupported  Hive  Features  

•  ADD  FILE  (use  Hadoop  distributed  cache  to  distribute  user-­‐defined  scripts)  

•  STREAMTABLE  hint  not  supported  •  Outer  join  with  non-­‐equi  join  filter  •  Table  statistics  (piggyback  file  sink)  •  Table  with  buckets  •  ORDER  BY  with  multiple  reducers  has  a  bug  (will  be  fixed  this  week!)  

•  MapJoin  has  a  serialization  bug  (will  be  fixed  this  week!)  

Page 44: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,

Unsupported  Hive  Features  

•  Automatically  convert  joins  to  mapjoins  •  Sort-­‐merge  mapjoins  •  Partitions  with  different  input  formats  •  Unique  joins  •  Writing  to  two  or  more  tables  in  a  single  SELECT  INSERT  

•  Archiving  data  •  virtual  columns  (INPUT__FILE__NAME,  BLOCK__OFFSET__INSIDE__FILE)  

•  Merge  small  files  from  output  

Page 45: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,

What’s  coming  up  (in  a  month)?  

•  Performance  improvements  (groupbys,  joins)  

•  Demo  of  Shark  on  a  100-­‐node  cluster  in  SIGMOD  2012  (May  22,  Scottsdale,  AZ)  mining  Wikipedia  visit  log  data  

Page 46: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,

What’s  coming  up  (a  year)?  

•  Shark-­‐specific  query  optimizer  •  Automatic  tuning  of  parameters  (e.g.  number  of  reducers)  

•  Off-­‐heap  caching  (e.g.  EhCache)  •  Automatic  caching  based  on  query  analysis  •  Shared  caching  infrastructure  

Page 47: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,

Conclusion  

•  Shark  is  a  warehouse  solution  compatible  with  Apache  Hive  

•  Can  be  orders  of  magnitude  faster  •  We  are  looking  for  people  to  try  and  we  are  happy  to  provide  support  

Page 48: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,

Thank  you!  

Questions?    

http://shark.cs.berkeley.edu  (will  be  updated  tonight)