21
Monday, March 1, 2010

20100301icde

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: 20100301icde

Monday, March 1, 2010

Page 2: 20100301icde

Open Questions for BuildingAn Enterprise Data PlatformOn the Cloud

Jeff HammerbacherChief Scientist and Vice President of Products, ClouderaMarch 1, 2010

Monday, March 1, 2010

Page 3: 20100301icde

Presentation Outline▪ Who am I and what am I talking about?▪ My Background▪ Open Questions▪ Data Platforms▪ The Cloud

▪ Research Challenges▪ Infrastructure▪ Interface▪ Migration▪ Build something!

Monday, March 1, 2010

Page 4: 20100301icde

My BackgroundThanks for Asking

[email protected]▪ Studied Mathematics at Harvard▪ Worked as a Quant on Wall Street▪ Conceived, built, and led Data team at Facebook▪ Nearly 30 amazing engineers and data scientists▪ Several open source projects and research papers

▪ Founder of Cloudera▪ Vice President of Products and Chief Scientist▪ Also, check out the book “Beautiful Data”

Monday, March 1, 2010

Page 5: 20100301icde

Open QuestionsSome Context

▪ I don’t have a PhD▪ In fact, I don’t have a publication history▪ But I read a lot?

▪ Have deployed (and sometimes built) several distributed systems▪ Oracle RAC▪ Hadoop + Hive▪ Cassandra▪ New things at Cloudera

▪ Sort of like the Cubs GM asking a Cubs fan for advice

Monday, March 1, 2010

Page 6: 20100301icde

Data PlatformsCircumscribing our Focus

▪ Primarily concerned with infrastructure for analytics▪ To borrow a phrase from Ralph Kimball▪ Operational systems “turn the wheels”▪ Analytical systems “watch the wheels turn”

▪ Reference architecture▪ ETL/Data Integration▪ DW▪ BI▪ Complex Analytics

Monday, March 1, 2010

Page 7: 20100301icde

Data PlatformsAnother Perspective

▪ Analytical infrastructure as a platform▪ Infrastructure providers▪ Hardware and systems software

▪ Platform providers▪ Suite of software tools to collect, store, manage, and analyze data

▪ Content providers▪ Application developers▪ End users

Monday, March 1, 2010

Page 8: 20100301icde

The CloudSome Terminology

▪ Layers of providers (looks familiar)▪ Infrastructure as a Service (IaaS)▪ Platform as a Service (PaaS)▪ Software as a Service (SaaS)

▪ Where is it deployed?▪ Public cloud▪ Private cloud▪ Hybrid cloud

Monday, March 1, 2010

Page 9: 20100301icde

The CloudCurrent State

▪ Many infrastructure and software providers▪ Rackspace, Terremark, SoftLayer, and friends in infrastructure▪ Salesforce and Workday in traditional enterprise applications▪ SnapLogic, Cast Iron Systems in ETL▪ Kognitio in DW▪ LucidEra, PivotLink, Quantivo, and friends in BI

▪ Less developed PaaS market for analytics▪ RightScale + Talend + Vertica + Jaspersoft partnership

Monday, March 1, 2010

Page 10: 20100301icde

Research ChallengesProblem Statement

What are the research challenges we’ll encounter moving from today’s architectures for enterprise analytics to an integrated

platform-as-a-service model built on public, private, or hybrid cloud infrastructure?

Monday, March 1, 2010

Page 11: 20100301icde

Research ChallengesInfrastructure

▪ Server and data center design▪ Servers for WSCs project at Michigan▪ FAWN at CMU: low-power CPU and SSD for storage▪ Making use of multi-core and GPUs▪ Power management projects all over▪ Data center design projects▪ Evolution of containers▪ Yahoo!’s “chicken coop”

▪ OpenFlow, Vyatta, Arista, and Nicira in networking

Monday, March 1, 2010

Page 12: 20100301icde

Research ChallengesInfrastructure

▪ How to achieve isolation while maintaining performance?▪ Failure isolation▪ Performance isolation▪ Security isolation

▪ Many interesting projects▪ Process Groups/Containers: Solaris Zones, LXC, Job Objects▪ Lowered VM startup time via cloning: SnowFlock▪ Data locality for VM scheduling: Tashi▪ Resource management for grids: Nexus

Monday, March 1, 2010

Page 13: 20100301icde

Research ChallengesInfrastructure

▪ Configuration Management▪ Lots of work in industry: cfengine, bcfg2, Puppet, Chef▪ Not a lot of research on the topic!

▪ Scheduling▪ Benchmarks for concurrent queries and almost-full systems▪ Hybrid cloud (“cloudbursting”) scheduling▪ Scheduling in the presence of variable performance▪ Continuous version of fault tolerance?

Monday, March 1, 2010

Page 14: 20100301icde

Research ChallengesInfrastructure

▪ Bulk data transfer▪ Moving data over the WAN is scary▪ Aspera, FastSoft, WAM!NET built companies out of this research▪ UDT proposed as a protocol from Chicago▪ Incremental progress indicators and restart would be nice

▪ Latency-sensitive requests▪ Lower variability: better DNS?▪ Lower latency: SPDY?

Monday, March 1, 2010

Page 15: 20100301icde

Research ChallengesInterface

▪ Application Developers▪ Incremental query progress visualization▪ Run time simulation and prediction▪ ILLUSTRATE command for sample tuple generation▪ Compile-time rather than run-time checking▪ Libraries of basic operations which present higher-order APIs▪ Performance optimization suggestions▪ Distributed debugging utilities

Monday, March 1, 2010

Page 16: 20100301icde

Research ChallengesInterface

▪ New data models: when to use them and how do they interact?▪ Multi-dimensional hash maps with locality groups: BigTable,

HBase▪ Documents: CouchDB, MongoDB, Riak (MarkLogic?)▪ Arrays: SciDB▪ Graphs: SHS▪ Trajectories: TrajStore

▪ Cross-language serialization and RPC frameworks▪ ASN.1, XDR, CORBA, ICE, Thrift, Etch, PBs, DataSeries, Avro

Monday, March 1, 2010

Page 17: 20100301icde

Research ChallengesInterface

▪ Query languages▪ Programmer time-to-learn and productivity analysis for:▪ Various MapReduce implementations▪ Sawzall, PigLatin, SCOPE, Hive, DryadLINQ, ScalaQL▪ Existing stuff: PL/SQL, TSQL, SQL*Loader, XQuery, XPath, etc.?▪ Languages for analytics: R, S, SAS, SPSS, Matlab

▪ Can these all target a single execution layer?▪ Should we be embedding our queries in a host language?▪ LINQ, ScalaQL, Ferry

Monday, March 1, 2010

Page 18: 20100301icde

Research ChallengesInterface

▪ Collaborative analytics▪ User profiles, news feed, message inboxes, recommendations

▪ Improve the browser▪ Interactive visualization libraries in JavaScript▪ What does HTML5 mean for the data analyst?

▪ How can we leverage multi-touch interfaces?▪ What do new mobile devices mean for data analysts?▪ Netbooks, iPhone, Android phones, Kindle, Nook, etc.

Monday, March 1, 2010

Page 19: 20100301icde

Research ChallengesMigration

▪ How do we get there from here?▪ Workload analysis to identify what can be moved to PaaS first▪ Ethnographic studies of what’s hard for data analysts today▪ Privacy and security considerations▪ Integration with third-party data sources▪ Retention policies

▪ Cloud interoperability!▪ Tools to prototype locally and deploy to platform later▪ New university courses to build these skills

Monday, March 1, 2010

Page 20: 20100301icde

Research ChallengesBuild Something!

▪ “A man who carries a cat by the tail...”▪ Participate in an open source community▪ Build a website and make the data available (e.g. MovieLens)▪ Experience the joys of▪ installation▪ configuration▪ deployment▪ monitoring▪ performance tuning, debugging, upgrades, and more!

Monday, March 1, 2010

Page 21: 20100301icde

(c) 2009 Cloudera, Inc. or its licensors.  "Cloudera" is a registered trademark of Cloudera, Inc.. All rights reserved. 1.0

Monday, March 1, 2010