12
Thursday, May 13, 2010

20100513brown

Embed Size (px)

Citation preview

Page 1: 20100513brown

Thursday, May 13, 2010

Page 2: 20100513brown

Evolving a New Analytical PlatformWhat Works and What’s Missing

Jeff HammerbacherChief Scientist and Vice President of Products, ClouderaMay 13, 2010

Thursday, May 13, 2010

Page 3: 20100513brown

My BackgroundThanks for Asking

[email protected]▪ Studied Mathematics at Harvard▪ Worked as a Quant on Wall Street▪ Conceived, built, and led Data team at Facebook▪ Nearly 30 amazing engineers and data scientists▪ Several open source projects and research papers

▪ Founder of Cloudera▪ Vice President of Products and Chief Scientist▪ Also, check out the book “Beautiful Data”

Thursday, May 13, 2010

Page 4: 20100513brown

Presentation Outline▪ Architectures for large scale data analysis▪ Reference architecture: ETL, DW, BI, Analytics▪ New foundations: HDFS and MapReduce

▪ SQL Server 2008 R2▪ The new platform emerges

▪ Building a new platform▪ Motivations▪ Implementation

▪ Questions and Discussion

Thursday, May 13, 2010

Page 5: 20100513brown

Summary of the Presentation(I have a short attention span, too)

▪ The abstractions provided by a relational database are no longer useful on their own for analytical data management.

▪ The abstraction layer needs to be redrawn to include the functionality provided by ETL, MDM, stream management, reporting, OLAP, and search tools, with a unified user interface for collaboration on investigation and results.

▪ I don’t think the cloud has much to do with the above, except to kill “scale up” once and for all.

Thursday, May 13, 2010

Page 6: 20100513brown

Experiences at FacebookEarly 2006: The First Research Scientist

▪ Source data living on horizontally partitioned MySQL tier▪ Intensive historical analysis difficult▪ No way to assess impact of changes to the site

▪ First try: Python scripts pull data into MySQL▪ Second try: Python scripts pull data into Oracle

▪ ...and then we turned on impression logging

Thursday, May 13, 2010

Page 7: 20100513brown

Facebook Data Infrastructure2007▪ “Data Warehousing”▪ Began with Oracle database▪ Schedule data collection via cron▪ Collect data every 24 hours▪ “ETL” scripts: hand-coded Python▪ Data volumes quickly grew▪ Started at tens of GB in early 2006▪ Up to about 1 TB per day by mid-2007▪ Log files largest source of data growth

Oracle Database Server

Data Collection Server

MySQL TierScribe Tier

Thursday, May 13, 2010

Page 8: 20100513brown

Facebook Data Infrastructure2008

MySQL TierScribe Tier

Hadoop Tier

Oracle RAC Servers

Thursday, May 13, 2010

Page 9: 20100513brown

SQL Server 2008 R2Old Features

▪ ETL: SQL Server Integration Services▪ DW: SQL Server▪ Reporting: SQL Server Reporting Services▪ Analytics: SQL Server Analysis Services▪ Search: Full-Text Search

Thursday, May 13, 2010

Page 10: 20100513brown

SQL Server 2008 R2New Features

▪ Stream management: StreamInsight▪ OLAP: PowerPivot▪ Collaboration: SharePoint▪ MDM: Master Data Services▪ Scale-out: Parallel Data Warehouse

Thursday, May 13, 2010

Page 11: 20100513brown

A New FoundationMotivations and Implementation

▪ Orders of magnitude growth in data volumes and complexity▪ Often from machine-generated logs▪ Complex data is vast majority of data

▪ Built by consumer web teams and not enterprise software firms▪ Open source▪ Modular collection of tools, not an opaque abstraction▪ Applications, not just analysis▪ Solve user needs, don’t implement a spec

Thursday, May 13, 2010

Page 12: 20100513brown

(c) 2009 Cloudera, Inc. or its licensors.  "Cloudera" is a registered trademark of Cloudera, Inc.. All rights reserved. 1.0

Thursday, May 13, 2010