20100513brown

Thursday, May 13, 2010

Evolving a New Analytical PlatformWhat Works and What’s Missing

Jeff HammerbacherChief Scientist and Vice President of Products, ClouderaMay 13, 2010


My BackgroundThanks for Asking

▪ [email protected]▪ Studied Mathematics at Harvard▪ Worked as a Quant on Wall Street▪ Conceived, built, and led Data team at Facebook▪ Nearly 30 amazing engineers and data scientists▪ Several open source projects and research papers

▪ Founder of Cloudera▪ Vice President of Products and Chief Scientist▪ Also, check out the book “Beautiful Data”


mailto:[email protected]

mailto:[email protected]

Presentation Outline▪ Architectures for large scale data analysis▪ Reference architecture: ETL, DW, BI, Analytics▪ New foundations: HDFS and MapReduce

▪ SQL Server 2008 R2▪ The new platform emerges

▪ Building a new platform▪ Motivations▪ Implementation

▪ Questions and Discussion


Summary of the Presentation(I have a short attention span, too)

▪ The abstractions provided by a relational database are no longer useful on their own for analytical data management.

▪ The abstraction layer needs to be redrawn to include the functionality provided by ETL, MDM, stream management, reporting, OLAP, and search tools, with a unified user interface for collaboration on investigation and results.

▪ I don’t think the cloud has much to do with the above, except to kill “scale up” once and for all.


Experiences at FacebookEarly 2006: The First Research Scientist

▪ Source data living on horizontally partitioned MySQL tier▪ Intensive historical analysis difficult▪ No way to assess impact of changes to the site

▪ First try: Python scripts pull data into MySQL▪ Second try: Python scripts pull data into Oracle

▪ ...and then we turned on impression logging


Facebook Data Infrastructure2007▪ “Data Warehousing”▪ Began with Oracle database▪ Schedule data collection via cron▪ Collect data every 24 hours▪ “ETL” scripts: hand-coded Python▪ Data volumes quickly grew▪ Started at tens of GB in early 2006▪ Up to about 1 TB per day by mid-2007▪ Log files largest source of data growth

Oracle Database Server

Data Collection Server

MySQL TierScribe Tier


Facebook Data Infrastructure2008

MySQL TierScribe Tier

Hadoop Tier

Oracle RAC Servers


SQL Server 2008 R2Old Features

▪ ETL: SQL Server Integration Services▪ DW: SQL Server▪ Reporting: SQL Server Reporting Services▪ Analytics: SQL Server Analysis Services▪ Search: Full-Text Search


SQL Server 2008 R2New Features

▪ Stream management: StreamInsight▪ OLAP: PowerPivot▪ Collaboration: SharePoint▪ MDM: Master Data Services▪ Scale-out: Parallel Data Warehouse


A New FoundationMotivations and Implementation

▪ Orders of magnitude growth in data volumes and complexity▪ Often from machine-generated logs▪ Complex data is vast majority of data

▪ Built by consumer web teams and not enterprise software firms▪ Open source▪ Modular collection of tools, not an opaque abstraction▪ Applications, not just analysis▪ Solve user needs, don’t implement a spec


(c) 2009 Cloudera, Inc. or its licensors. "Cloudera" is a registered trademark of Cloudera, Inc.. All rights reserved. 1.0


Documents

20100513brown