Upload
jeff-hammerbacher
View
980
Download
0
Tags:
Embed Size (px)
Citation preview
Thursday, May 13, 2010
Evolving a New Analytical PlatformWhat Works and What’s Missing
Jeff HammerbacherChief Scientist and Vice President of Products, ClouderaMay 13, 2010
Thursday, May 13, 2010
My BackgroundThanks for Asking
▪ [email protected]▪ Studied Mathematics at Harvard▪ Worked as a Quant on Wall Street▪ Conceived, built, and led Data team at Facebook▪ Nearly 30 amazing engineers and data scientists▪ Several open source projects and research papers
▪ Founder of Cloudera▪ Vice President of Products and Chief Scientist▪ Also, check out the book “Beautiful Data”
Thursday, May 13, 2010
Presentation Outline▪ Architectures for large scale data analysis▪ Reference architecture: ETL, DW, BI, Analytics▪ New foundations: HDFS and MapReduce
▪ SQL Server 2008 R2▪ The new platform emerges
▪ Building a new platform▪ Motivations▪ Implementation
▪ Questions and Discussion
Thursday, May 13, 2010
Summary of the Presentation(I have a short attention span, too)
▪ The abstractions provided by a relational database are no longer useful on their own for analytical data management.
▪ The abstraction layer needs to be redrawn to include the functionality provided by ETL, MDM, stream management, reporting, OLAP, and search tools, with a unified user interface for collaboration on investigation and results.
▪ I don’t think the cloud has much to do with the above, except to kill “scale up” once and for all.
Thursday, May 13, 2010
Experiences at FacebookEarly 2006: The First Research Scientist
▪ Source data living on horizontally partitioned MySQL tier▪ Intensive historical analysis difficult▪ No way to assess impact of changes to the site
▪ First try: Python scripts pull data into MySQL▪ Second try: Python scripts pull data into Oracle
▪ ...and then we turned on impression logging
Thursday, May 13, 2010
Facebook Data Infrastructure2007▪ “Data Warehousing”▪ Began with Oracle database▪ Schedule data collection via cron▪ Collect data every 24 hours▪ “ETL” scripts: hand-coded Python▪ Data volumes quickly grew▪ Started at tens of GB in early 2006▪ Up to about 1 TB per day by mid-2007▪ Log files largest source of data growth
Oracle Database Server
Data Collection Server
MySQL TierScribe Tier
Thursday, May 13, 2010
Facebook Data Infrastructure2008
MySQL TierScribe Tier
Hadoop Tier
Oracle RAC Servers
Thursday, May 13, 2010
SQL Server 2008 R2Old Features
▪ ETL: SQL Server Integration Services▪ DW: SQL Server▪ Reporting: SQL Server Reporting Services▪ Analytics: SQL Server Analysis Services▪ Search: Full-Text Search
Thursday, May 13, 2010
SQL Server 2008 R2New Features
▪ Stream management: StreamInsight▪ OLAP: PowerPivot▪ Collaboration: SharePoint▪ MDM: Master Data Services▪ Scale-out: Parallel Data Warehouse
Thursday, May 13, 2010
A New FoundationMotivations and Implementation
▪ Orders of magnitude growth in data volumes and complexity▪ Often from machine-generated logs▪ Complex data is vast majority of data
▪ Built by consumer web teams and not enterprise software firms▪ Open source▪ Modular collection of tools, not an opaque abstraction▪ Applications, not just analysis▪ Solve user needs, don’t implement a spec
Thursday, May 13, 2010
(c) 2009 Cloudera, Inc. or its licensors. "Cloudera" is a registered trademark of Cloudera, Inc.. All rights reserved. 1.0
Thursday, May 13, 2010