hadoop @ Ibmbigdata

Preview:

DESCRIPTION

Eric Baldeschwieler's talk about Hadoop at Yahoo, given at IBM Big Data Symposium

Citation preview

Eric Baldeschwieler VP, Hadoop Software

HADOOP

YAHOO &

USING  AND  IMPROVING  APACHE  HADOOP  AT  YAHOO!

•   Brief  Overview  

•   Hadoop  @  Yahoo!    •  Hadoop  Momentum  

•  The  Future  of  Hadoop  

AGENDA

2  

happening WHAT’S

-­‐  Big  Data  is  here!    -­‐ unstructured data -­‐    petabyte scale -­‐    operationally critical

Flickr : sub_lime79

INTO INSIGHTS TURNING DATA

machine learning time series

content clustering

factorization models

logic regression

Flickr : NASA Goddard Photo and Video

algorithms user interest prediction

ad inventory modeling

RELEVANT MAKING YAHOO

Flickr : ogimogi

POWERING HADOOP:

science  +  big  data + insight = personal relevance = VALUE

YAHOO!

Flickr : DDFic

WHAT IS HADOOP?

7  

HDFS

MapReduce

Pig Hive Programming Languages

Computation

Storage

Commodity •  Computers •  Network

Focus on •  Simplicity •  Redundancy •  Scale •  Availability

Transforms commodity equipment into a service that: •  HDFS – Stores peta bytes of data reliably •  Map-Reduce – Allows huge distributed computations

Key Attributes •  Redundant and reliable – Doesn’t stop or loose data even as hardware fails •  Easy to program – Our rocket scientists use it directly! •  Very powerful – Allows the development of big data algorithms & tools •  Batch processing centric

WHAT HADOOP ISN’T

•  A  replacement  for  relaFonal  and  data  warehouse  systems    

•  A  transacFonal  /  online  /  serving  system  •  A  low  latency  or  streaming  soluFon    

8  

HADOOP IN THE ENTERPRISE

9  

RDMS   EDW   Data  Marts  

HADOOP CLUSTER(S)

TransacFons,  Structured  Data  

Business  ApplicaFons  

Web  Logs,  Server  Logs,  Social  Media,  etc…  

InteracFons  Semi-­‐Structured  or  Un-­‐Structured  Data  

Business  Intelligence  ApplicaFons  

10  

HADOOP @ YAHOO!

11  

HADOOP @ YAHOO! “Where  Science  meets  Data”  

HADOOP CLUSTERS Tens of thousands of servers

PRODUCTS

APPLIED SCIENCE

Data Analytics Content Optimization Content Enrichment Yahoo! Mail Anti-Spam Advertising Products Ad Optimization Ad Selection Big Data Processing & ETL

User Interest Prediction Ad inventory prediction Machine learning - search ranking Machine learning - ad targeting Machine learning - spam filtering 10s of Petabytes

2006 2007 2008 2009 2010 12  

FROM PROJECT TO CORE PLATFORM

170 PB Storage

Thou

sand

s of

Ser

vers

Pet

abyt

es

90

80

70

60

50

40

30

20

10

0

250

200

150

100

50

0

Research  

Science  Impact  

Daily  ProducFon  

“Behind  every  click”    

40K+ Servers

5M+ Monthly Jobs

HADOOP POWERS THE YAHOO! NETWORK

advertising optimization

ad selection

Yahoo! Homepage

machine learning search ranking

ad inventory prediction

Yahoo! Mail anti-spam

user interest prediction

audience, ad and search pipelines advertising data systems

Content Optimization

data analytics

13  

         twice  the  engagement  

CASE STUDY YAHOO! HOMEPAGE

14  

Personalized    for  each  visitor    Result:    twice  the  engagement  

 

+160% clicks vs. one size fits all

+79% clicks vs. randomly selected

+43% clicks vs. editor selected

Recommended  links   News  Interests   Top  Searches  

CASE STUDY YAHOO! HOMEPAGE

15  

•  Serving  Maps  •  Users  -­‐  Interests  

 •  Five  Minute  ProducLon  

 •  Weekly  CategorizaLon  models  

SCIENCE HADOOP

CLUSTER

SERVING  SYSTEMS

PRODUCTION HADOOP

CLUSTER

USER  BEHAVIOR  

ENGAGED  USERS

CATEGORIZATION  MODELS  (weekly)  

SERVING  MAPS  

(every  5  minutes)  USER  

BEHAVIOR  

»  Identify user interests using Categorization models

»  Machine learning to build ever better categorization models

 Build  customized  home  pages  with  latest  data  (thousands  /  second)  

CASE STUDY YAHOO! MAIL Enabling  quick  response  in  the  spam  arms  race  

•  450M  mail  boxes    •  5B+  deliveries/day    •  AnLspam  models  retrained    every  few  hours  on  Hadoop  

 

40%  less  spam  than  Hotmail  and  55%  less  spam  than  Gmail  “ “

SCIENCE

PRODUCTION

16  

YAHOO! & APACHE HADOOP

17  

Yahoo!  has  contributed  70+%  of    Apache  Hadoop  code  to  date  Hadoop  is  not  our  business,  but  Hadoop  is  key  to  our  business  •   Yahoo!  benefits  from  open  source  eco-­‐system  around  Hadoop  •   Hadoop  drives  revenue  at  Yahoo!  by  making  our  core  products  be`er    We  need  Hadoop  to  be  rock  solid  •   We  invest  heavily  in  core  Hadoop  development  •   We  focus  on  scalability,  reliability,  availability    We  fix  bugs  before  you  see  them  •   We  run  very  large  clusters  •   We  have  a  large  QA  effort  •   We  run  a  huge  variety  of  workloads    We  are  good  Apache  Hadoop  ciLzens  •   We  contribute  our  work  to  Apache  •   We  share  the  exact  code  we  run  

18  

HADOOP MOMENTUM

HADOOP IS GOING MAINSTREAM 2007

2008

2009

19  

2010

The  Datagraph  Blog  

THE PLATFORM EFFECT BIRTH OF AN ECOSYSTEM                                                        and other Early Adopters

Scale and productize Hadoop

20  

Apache  Hadoop  

Orgs with Internet Scale Problems Add tools / frameworks, enhance Hadoop

Mainstream / Enterprise adoption Drive further development, enhancements

Enhance  Hadoop  Ecosystem  

Service Providers Grow ecosystem - Training, support, enhancements

Virtuous Circle! •  Investment -> Adoption •  Adoption -> Investment

21  

THE FUTURE OF HADOOP

MAKING HADOOP ENTERPRISE-READY WHAT’S NEXT

22  

Hadoop  is  far  from  “done”  •  Current  implementaFon  is  showing  its  age  •  Need  to  address  several  deficiencies  in  scalability,  flexibility,  ease  of  use  &  performance  

 

Yahoo!  is  working  on  Next  GeneraLon  of  Hadoop  •  MapReduce:  Rewrite  to  improve  performance;  pluggable  support  for  new  programming  models  

•  HDFS:  Adding  volumes  to  improve  scalability;  Flush  &  sync  support  for  applicaFons  that  log  to  HDFS  

 

Apache  should  remain  the  hub  of  Hadoop  ecosystem  •  Yahoo!  contributes  all  Hadoop  changes  back  to  Apache  Hadoop  •  Everyone  benefits  from  shared  neutral  foundaFon  

 

23  

Questions?

Recommended