23
Eric Baldeschwieler VP, Hadoop Software HADOOP YAHOO & USING AND IMPROVING APACHE HADOOP AT YAHOO!

hadoop @ Ibmbigdata

Embed Size (px)

DESCRIPTION

Eric Baldeschwieler's talk about Hadoop at Yahoo, given at IBM Big Data Symposium

Citation preview

Page 1: hadoop @ Ibmbigdata

Eric Baldeschwieler VP, Hadoop Software

HADOOP

YAHOO &

USING  AND  IMPROVING  APACHE  HADOOP  AT  YAHOO!

Page 2: hadoop @ Ibmbigdata

•   Brief  Overview  

•   Hadoop  @  Yahoo!    •  Hadoop  Momentum  

•  The  Future  of  Hadoop  

AGENDA

2  

Page 3: hadoop @ Ibmbigdata

happening WHAT’S

-­‐  Big  Data  is  here!    -­‐ unstructured data -­‐    petabyte scale -­‐    operationally critical

Flickr : sub_lime79

Page 4: hadoop @ Ibmbigdata

INTO INSIGHTS TURNING DATA

machine learning time series

content clustering

factorization models

logic regression

Flickr : NASA Goddard Photo and Video

algorithms user interest prediction

ad inventory modeling

Page 5: hadoop @ Ibmbigdata

RELEVANT MAKING YAHOO

Flickr : ogimogi

Page 6: hadoop @ Ibmbigdata

POWERING HADOOP:

science  +  big  data + insight = personal relevance = VALUE

YAHOO!

Flickr : DDFic

Page 7: hadoop @ Ibmbigdata

WHAT IS HADOOP?

7  

HDFS

MapReduce

Pig Hive Programming Languages

Computation

Storage

Commodity •  Computers •  Network

Focus on •  Simplicity •  Redundancy •  Scale •  Availability

Transforms commodity equipment into a service that: •  HDFS – Stores peta bytes of data reliably •  Map-Reduce – Allows huge distributed computations

Key Attributes •  Redundant and reliable – Doesn’t stop or loose data even as hardware fails •  Easy to program – Our rocket scientists use it directly! •  Very powerful – Allows the development of big data algorithms & tools •  Batch processing centric

Page 8: hadoop @ Ibmbigdata

WHAT HADOOP ISN’T

•  A  replacement  for  relaFonal  and  data  warehouse  systems    

•  A  transacFonal  /  online  /  serving  system  •  A  low  latency  or  streaming  soluFon    

8  

Page 9: hadoop @ Ibmbigdata

HADOOP IN THE ENTERPRISE

9  

RDMS   EDW   Data  Marts  

HADOOP CLUSTER(S)

TransacFons,  Structured  Data  

Business  ApplicaFons  

Web  Logs,  Server  Logs,  Social  Media,  etc…  

InteracFons  Semi-­‐Structured  or  Un-­‐Structured  Data  

Business  Intelligence  ApplicaFons  

Page 10: hadoop @ Ibmbigdata

10  

HADOOP @ YAHOO!

Page 11: hadoop @ Ibmbigdata

11  

HADOOP @ YAHOO! “Where  Science  meets  Data”  

HADOOP CLUSTERS Tens of thousands of servers

PRODUCTS

APPLIED SCIENCE

Data Analytics Content Optimization Content Enrichment Yahoo! Mail Anti-Spam Advertising Products Ad Optimization Ad Selection Big Data Processing & ETL

User Interest Prediction Ad inventory prediction Machine learning - search ranking Machine learning - ad targeting Machine learning - spam filtering 10s of Petabytes

Page 12: hadoop @ Ibmbigdata

2006 2007 2008 2009 2010 12  

FROM PROJECT TO CORE PLATFORM

170 PB Storage

Thou

sand

s of

Ser

vers

Pet

abyt

es

90

80

70

60

50

40

30

20

10

0

250

200

150

100

50

0

Research  

Science  Impact  

Daily  ProducFon  

“Behind  every  click”    

40K+ Servers

5M+ Monthly Jobs

Page 13: hadoop @ Ibmbigdata

HADOOP POWERS THE YAHOO! NETWORK

advertising optimization

ad selection

Yahoo! Homepage

machine learning search ranking

ad inventory prediction

Yahoo! Mail anti-spam

user interest prediction

audience, ad and search pipelines advertising data systems

Content Optimization

data analytics

13  

Page 14: hadoop @ Ibmbigdata

         twice  the  engagement  

CASE STUDY YAHOO! HOMEPAGE

14  

Personalized    for  each  visitor    Result:    twice  the  engagement  

 

+160% clicks vs. one size fits all

+79% clicks vs. randomly selected

+43% clicks vs. editor selected

Recommended  links   News  Interests   Top  Searches  

Page 15: hadoop @ Ibmbigdata

CASE STUDY YAHOO! HOMEPAGE

15  

•  Serving  Maps  •  Users  -­‐  Interests  

 •  Five  Minute  ProducLon  

 •  Weekly  CategorizaLon  models  

SCIENCE HADOOP

CLUSTER

SERVING  SYSTEMS

PRODUCTION HADOOP

CLUSTER

USER  BEHAVIOR  

ENGAGED  USERS

CATEGORIZATION  MODELS  (weekly)  

SERVING  MAPS  

(every  5  minutes)  USER  

BEHAVIOR  

»  Identify user interests using Categorization models

»  Machine learning to build ever better categorization models

 Build  customized  home  pages  with  latest  data  (thousands  /  second)  

Page 16: hadoop @ Ibmbigdata

CASE STUDY YAHOO! MAIL Enabling  quick  response  in  the  spam  arms  race  

•  450M  mail  boxes    •  5B+  deliveries/day    •  AnLspam  models  retrained    every  few  hours  on  Hadoop  

 

40%  less  spam  than  Hotmail  and  55%  less  spam  than  Gmail  “ “

SCIENCE

PRODUCTION

16  

Page 17: hadoop @ Ibmbigdata

YAHOO! & APACHE HADOOP

17  

Yahoo!  has  contributed  70+%  of    Apache  Hadoop  code  to  date  Hadoop  is  not  our  business,  but  Hadoop  is  key  to  our  business  •   Yahoo!  benefits  from  open  source  eco-­‐system  around  Hadoop  •   Hadoop  drives  revenue  at  Yahoo!  by  making  our  core  products  be`er    We  need  Hadoop  to  be  rock  solid  •   We  invest  heavily  in  core  Hadoop  development  •   We  focus  on  scalability,  reliability,  availability    We  fix  bugs  before  you  see  them  •   We  run  very  large  clusters  •   We  have  a  large  QA  effort  •   We  run  a  huge  variety  of  workloads    We  are  good  Apache  Hadoop  ciLzens  •   We  contribute  our  work  to  Apache  •   We  share  the  exact  code  we  run  

Page 18: hadoop @ Ibmbigdata

18  

HADOOP MOMENTUM

Page 19: hadoop @ Ibmbigdata

HADOOP IS GOING MAINSTREAM 2007

2008

2009

19  

2010

The  Datagraph  Blog  

Page 20: hadoop @ Ibmbigdata

THE PLATFORM EFFECT BIRTH OF AN ECOSYSTEM                                                        and other Early Adopters

Scale and productize Hadoop

20  

Apache  Hadoop  

Orgs with Internet Scale Problems Add tools / frameworks, enhance Hadoop

Mainstream / Enterprise adoption Drive further development, enhancements

Enhance  Hadoop  Ecosystem  

Service Providers Grow ecosystem - Training, support, enhancements

Virtuous Circle! •  Investment -> Adoption •  Adoption -> Investment

Page 21: hadoop @ Ibmbigdata

21  

THE FUTURE OF HADOOP

Page 22: hadoop @ Ibmbigdata

MAKING HADOOP ENTERPRISE-READY WHAT’S NEXT

22  

Hadoop  is  far  from  “done”  •  Current  implementaFon  is  showing  its  age  •  Need  to  address  several  deficiencies  in  scalability,  flexibility,  ease  of  use  &  performance  

 

Yahoo!  is  working  on  Next  GeneraLon  of  Hadoop  •  MapReduce:  Rewrite  to  improve  performance;  pluggable  support  for  new  programming  models  

•  HDFS:  Adding  volumes  to  improve  scalability;  Flush  &  sync  support  for  applicaFons  that  log  to  HDFS  

 

Apache  should  remain  the  hub  of  Hadoop  ecosystem  •  Yahoo!  contributes  all  Hadoop  changes  back  to  Apache  Hadoop  •  Everyone  benefits  from  shared  neutral  foundaFon  

 

Page 23: hadoop @ Ibmbigdata

23  

Questions?