Monday, March 1, 2010
Open Questions for BuildingAn Enterprise Data PlatformOn the Cloud
Jeff HammerbacherChief Scientist and Vice President of Products, ClouderaMarch 1, 2010
Monday, March 1, 2010
Presentation Outline▪ Who am I and what am I talking about?▪ My Background▪ Open Questions▪ Data Platforms▪ The Cloud
▪ Research Challenges▪ Infrastructure▪ Interface▪ Migration▪ Build something!
Monday, March 1, 2010
My BackgroundThanks for Asking
▪ [email protected]▪ Studied Mathematics at Harvard▪ Worked as a Quant on Wall Street▪ Conceived, built, and led Data team at Facebook▪ Nearly 30 amazing engineers and data scientists▪ Several open source projects and research papers
▪ Founder of Cloudera▪ Vice President of Products and Chief Scientist▪ Also, check out the book “Beautiful Data”
Monday, March 1, 2010
Open QuestionsSome Context
▪ I don’t have a PhD▪ In fact, I don’t have a publication history▪ But I read a lot?
▪ Have deployed (and sometimes built) several distributed systems▪ Oracle RAC▪ Hadoop + Hive▪ Cassandra▪ New things at Cloudera
▪ Sort of like the Cubs GM asking a Cubs fan for advice
Monday, March 1, 2010
Data PlatformsCircumscribing our Focus
▪ Primarily concerned with infrastructure for analytics▪ To borrow a phrase from Ralph Kimball▪ Operational systems “turn the wheels”▪ Analytical systems “watch the wheels turn”
▪ Reference architecture▪ ETL/Data Integration▪ DW▪ BI▪ Complex Analytics
Monday, March 1, 2010
Data PlatformsAnother Perspective
▪ Analytical infrastructure as a platform▪ Infrastructure providers▪ Hardware and systems software
▪ Platform providers▪ Suite of software tools to collect, store, manage, and analyze data
▪ Content providers▪ Application developers▪ End users
Monday, March 1, 2010
The CloudSome Terminology
▪ Layers of providers (looks familiar)▪ Infrastructure as a Service (IaaS)▪ Platform as a Service (PaaS)▪ Software as a Service (SaaS)
▪ Where is it deployed?▪ Public cloud▪ Private cloud▪ Hybrid cloud
Monday, March 1, 2010
The CloudCurrent State
▪ Many infrastructure and software providers▪ Rackspace, Terremark, SoftLayer, and friends in infrastructure▪ Salesforce and Workday in traditional enterprise applications▪ SnapLogic, Cast Iron Systems in ETL▪ Kognitio in DW▪ LucidEra, PivotLink, Quantivo, and friends in BI
▪ Less developed PaaS market for analytics▪ RightScale + Talend + Vertica + Jaspersoft partnership
Monday, March 1, 2010
Research ChallengesProblem Statement
What are the research challenges we’ll encounter moving from today’s architectures for enterprise analytics to an integrated
platform-as-a-service model built on public, private, or hybrid cloud infrastructure?
Monday, March 1, 2010
Research ChallengesInfrastructure
▪ Server and data center design▪ Servers for WSCs project at Michigan▪ FAWN at CMU: low-power CPU and SSD for storage▪ Making use of multi-core and GPUs▪ Power management projects all over▪ Data center design projects▪ Evolution of containers▪ Yahoo!’s “chicken coop”
▪ OpenFlow, Vyatta, Arista, and Nicira in networking
Monday, March 1, 2010
Research ChallengesInfrastructure
▪ How to achieve isolation while maintaining performance?▪ Failure isolation▪ Performance isolation▪ Security isolation
▪ Many interesting projects▪ Process Groups/Containers: Solaris Zones, LXC, Job Objects▪ Lowered VM startup time via cloning: SnowFlock▪ Data locality for VM scheduling: Tashi▪ Resource management for grids: Nexus
Monday, March 1, 2010
Research ChallengesInfrastructure
▪ Configuration Management▪ Lots of work in industry: cfengine, bcfg2, Puppet, Chef▪ Not a lot of research on the topic!
▪ Scheduling▪ Benchmarks for concurrent queries and almost-full systems▪ Hybrid cloud (“cloudbursting”) scheduling▪ Scheduling in the presence of variable performance▪ Continuous version of fault tolerance?
Monday, March 1, 2010
Research ChallengesInfrastructure
▪ Bulk data transfer▪ Moving data over the WAN is scary▪ Aspera, FastSoft, WAM!NET built companies out of this research▪ UDT proposed as a protocol from Chicago▪ Incremental progress indicators and restart would be nice
▪ Latency-sensitive requests▪ Lower variability: better DNS?▪ Lower latency: SPDY?
Monday, March 1, 2010
Research ChallengesInterface
▪ Application Developers▪ Incremental query progress visualization▪ Run time simulation and prediction▪ ILLUSTRATE command for sample tuple generation▪ Compile-time rather than run-time checking▪ Libraries of basic operations which present higher-order APIs▪ Performance optimization suggestions▪ Distributed debugging utilities
Monday, March 1, 2010
Research ChallengesInterface
▪ New data models: when to use them and how do they interact?▪ Multi-dimensional hash maps with locality groups: BigTable,
HBase▪ Documents: CouchDB, MongoDB, Riak (MarkLogic?)▪ Arrays: SciDB▪ Graphs: SHS▪ Trajectories: TrajStore
▪ Cross-language serialization and RPC frameworks▪ ASN.1, XDR, CORBA, ICE, Thrift, Etch, PBs, DataSeries, Avro
Monday, March 1, 2010
Research ChallengesInterface
▪ Query languages▪ Programmer time-to-learn and productivity analysis for:▪ Various MapReduce implementations▪ Sawzall, PigLatin, SCOPE, Hive, DryadLINQ, ScalaQL▪ Existing stuff: PL/SQL, TSQL, SQL*Loader, XQuery, XPath, etc.?▪ Languages for analytics: R, S, SAS, SPSS, Matlab
▪ Can these all target a single execution layer?▪ Should we be embedding our queries in a host language?▪ LINQ, ScalaQL, Ferry
Monday, March 1, 2010
Research ChallengesInterface
▪ Collaborative analytics▪ User profiles, news feed, message inboxes, recommendations
▪ Improve the browser▪ Interactive visualization libraries in JavaScript▪ What does HTML5 mean for the data analyst?
▪ How can we leverage multi-touch interfaces?▪ What do new mobile devices mean for data analysts?▪ Netbooks, iPhone, Android phones, Kindle, Nook, etc.
Monday, March 1, 2010
Research ChallengesMigration
▪ How do we get there from here?▪ Workload analysis to identify what can be moved to PaaS first▪ Ethnographic studies of what’s hard for data analysts today▪ Privacy and security considerations▪ Integration with third-party data sources▪ Retention policies
▪ Cloud interoperability!▪ Tools to prototype locally and deploy to platform later▪ New university courses to build these skills
Monday, March 1, 2010
Research ChallengesBuild Something!
▪ “A man who carries a cat by the tail...”▪ Participate in an open source community▪ Build a website and make the data available (e.g. MovieLens)▪ Experience the joys of▪ installation▪ configuration▪ deployment▪ monitoring▪ performance tuning, debugging, upgrades, and more!
Monday, March 1, 2010
(c) 2009 Cloudera, Inc. or its licensors. "Cloudera" is a registered trademark of Cloudera, Inc.. All rights reserved. 1.0
Monday, March 1, 2010
Recommended