Multiprocessor system where processors share resources : • Operating System (OS)• Memory• I/O devices connected using a common bus
SCALE OUT (MPP)
• Multiple processing nodes• OS• RAM• Network
+(n)
Add nodes to the clusterUpgrade components or buy bigger server each time
Scaling
Up vs. Out
2002 2003 2004 2005 2006 2007 2008 2009
Doug Cutting & Mike Cafarellastarted working on Nutch
Doug Cutting adds DFS &MapReduce support to Nutch
NY Times converts 4TB of image archives over 100 EC2s
Fastest sort of a TB,62 secs over 1,460 nodesSorted a PB in 16.25 hours
over 3,658 nodes
Fastest sort of a TB, 3.5 minsover 910 nodes
Google publishes GFS & MapReduce papers
Yahoo! Hires Cutting,Hadoop spins out of Nutch
Facebooks launches Hive:SQL Support for Hadoop Hadoop Summit 2009,
750 attendees
Doug Cuttingjoins ClouderaFounded
Innovation
Timeline
Presenter
Presentation Notes
Led by Chad Cook, second presenters will likely be from local offices.
THE
FUNDAMENTALS OF HADOOP
• Hadoop evolved directly from commodity scientific supercomputing clusters developed in the 1990s
• Hadoop consists of:• MapReduce• Hadoop Distributed File System
(HDFS)
WHAT’S
NEW…
BASICS OF
MPP
1 bill/ sec = 400 Seconds400 bills
BASICS OF
MPP
Total = 200 Seconds
1 bill/ sec = 200 Seconds200 bills
1 bill/ sec = 200 Seconds200 bills
BASICS OF
MPP
Total = 100 Seconds
= 100 Seconds100 Bills 1 bill/ sec
= 100 Seconds100 Bills 1 bill/ sec
= 100 Seconds100 Bills 1 bill/ sec
= 100 Seconds100 Bills 1 bill/ sec
HDFS &
MAPREDUCEThe Main Node: runs the Job tracker and the name node controls the files.Each node runs two processes: Task Tracker and Data Node
Job Tracker Task Tracker Task Tracker
Name Node Data Node Data Node
Map Reduce
HDFS Cluster
1 N
BASICS OF
MAPREDUCEThe Main Node: runs the Job tracker and the name node controls the files.Each node runs two processes: Task Tracker and Data Node
Query
Result
Name Node/Job TrackerQuery
Data Nodes/Task Trackers
EXECUTION UNITS
MAPREDUCE
SOME DISTRIBUTIONS OF
APACHEHADOOP
Apache Foundation
Sandbox
Hortonworks
MAPREDUCE
PIG & HIVE
MAPREDUCE
• Java
• Write many lines of code
PIG
• Mostly used by Yahoo
• Most used for data processing
• Shares some constructs w/ SQL
• Is more Verbose• Needs a lot of training for users with
limited procedural programming background
• Offers control over the flow of data
HIVE
• Mostly used by Facebook for analytic purposes
• Used for analytics• Relatively easier for developers w/
SQL experience• Less control over optimization of data
flows compared to Pig
Not as efficient as MapReduceHigher productivity for data scientists and developers
THE
EXPLOSION OFHADOOP
17
THE
HISTORY OFSPARK
2006 2010 2013
2009 2011 20142004
MapReduce
Spark Paper
BSD Open Source
Top Level
Apache
SPARK
SHAREDLIBRARIES
SPARK
THE UNIFIED PLATFORMFOR BIG DATA
APIs for :• Scala• Java • Python• R
SparkSQL
SparkStreaming
MLlib(machinelearning)
GraphX(graph)
Spark Core
SPARK
BENEFITSPerformance
Using in-memory computing, Spark is considerably faster than Hadoop (100x in some tests).
Can be used for batch and real-time data processing.
Unified Engine
Integrated framework includes higher-level libraries for interactive SQL queries, processing streaming data, machine learning and graph processing.
A single application can combine all types of processing.
Developer Productivity
Easy-to-use APIs for processing large datasets.
Includes 100+ operators for transforming.
Ecosystem
Spark has built-in support for many data sources such as HDFS, RDBMS, S3, Apache Hive, Cassandra and MongoDB.
Runs on top of the Apache YARN resource manager.
ANALYTICS
CORTANA
SQL Server
Big Data Optimizations
APSSQL Server
Presenter
Presentation Notes
KEY POINT To convey that the major pillars of the Analytics Platform System with key points. To help organizations with a simple and smooth seamlessly transition to this new world of data, Microsoft introduces the Microsoft Analytics Platform System (APS) – the only, no-compromise modern data warehouse solution that brings both Hadoop and RDBMS in a single, pre-built appliance with tier-one performance, the lowest TCO in the industry, and accessibility to all their users through some of the most widely used BI tools in the industry. Enterprise-ready Big Data Microsoft APS combines Microsoft’s industry leading RDBMS platform, the Parallel Data warehouse Appliance (PDW), with Microsoft’s Hadoop Distribution, HDInsight, for non-relational data to offer an all-in Big Data Analytics appliance. Tying together and integrating the worlds of relational and Hadoop data is PolyBase, Microsoft’s integrated query tool available only in APS. Your Modern Data Warehouse in One Turnkey Appliance: APS integrates PDW and HDInsight to operate seamlessly together in a single appliance Integrated Querying across All Data Types Using T-SQL: PolyBase allows Hadoop data to be queried using rich featured T-SQL , while taking advantage of Hadoop processing, without additional Hadoop-based skills or training. Enterprise-Ready Hadoop: HDInsight is Microsoft’s Hadoop-based distribution with end-user authentication via Active Directory and managed by IT using System Center Big Data Insights to Any User: Native Microsoft BI integration within PolyBase allows everyone access to insights through familiar tools such as SSAS and Excel Next-generation performance at scale APS was built to scale into multi-petabytes, handling both RDBMS and the data stored in Hadoop, to deliver the performance that meets today’s near real-time sand rapid insights requirements. Scale-Out to accommodate your Growing Data: APS contains PDW and HDInsight that both have linear scale-out architecture. Start small with a few terabytes and dynamically add capacity for seamless, linear scale-out Remove DW bottlenecks with MPP SQL Server: Get the dynamic performance and scale that your modern data warehouse requires while retaining your skills and investment in SQL Server. Real-Time Performance with In-Memory: Provides up to 100x improvement in query performance and 15x compression via updateable in-memory columnstore Concurrency that Supports High Adoption: Scales in simultaneous user accessibility. APS has high concurrency, allowing for multiple workloads. Optimal architecture More than just a converged system, PDW has reshaped the very hardware specifications required through software innovations to deliver optimal value. Through features delivered in Windows Server 2012, customers get exceptional value: APS provides the industry’s lowest DW Price/TB Lower cost while maintaining performance using WS2012 Storage Spaces that replace SAN with economical Windows Storage Spaces Save up to 70% of PDW storage with up to 15x compression via updateable in-memory columnstore Value through single appliance solution Reduce hardware footprint by having PDW and HDInsight within a single appliance Remove the need for costly integration efforts Value through flexible hardware options Avoid hardware lock-in through flexible hardware options from HP, Dell, and Quanta
Base Unit Scale UnitExtension Base Unit
SQL Server
APS Growth Topology
Presenter
Presentation Notes
HP example – Dell is slightly different
Azure SQL DWSQL Server
Presenter
Presentation Notes
KEY POINT To convey that the major pillars of the Analytics Platform System with key points. To help organizations with a simple and smooth seamlessly transition to this new world of data, Microsoft introduces the Microsoft Analytics Platform System (APS) – the only, no-compromise modern data warehouse solution that brings both Hadoop and RDBMS in a single, pre-built appliance with tier-one performance, the lowest TCO in the industry, and accessibility to all their users through some of the most widely used BI tools in the industry. Enterprise-ready Big Data Microsoft APS combines Microsoft’s industry leading RDBMS platform, the Parallel Data warehouse Appliance (PDW), with Microsoft’s Hadoop Distribution, HDInsight, for non-relational data to offer an all-in Big Data Analytics appliance. Tying together and integrating the worlds of relational and Hadoop data is PolyBase, Microsoft’s integrated query tool available only in APS. Your Modern Data Warehouse in One Turnkey Appliance: APS integrates PDW and HDInsight to operate seamlessly together in a single appliance Integrated Querying across All Data Types Using T-SQL: PolyBase allows Hadoop data to be queried using rich featured T-SQL , while taking advantage of Hadoop processing, without additional Hadoop-based skills or training. Enterprise-Ready Hadoop: HDInsight is Microsoft’s Hadoop-based distribution with end-user authentication via Active Directory and managed by IT using System Center Big Data Insights to Any User: Native Microsoft BI integration within PolyBase allows everyone access to insights through familiar tools such as SSAS and Excel Next-generation performance at scale APS was built to scale into multi-petabytes, handling both RDBMS and the data stored in Hadoop, to deliver the performance that meets today’s near real-time sand rapid insights requirements. Scale-Out to accommodate your Growing Data: APS contains PDW and HDInsight that both have linear scale-out architecture. Start small with a few terabytes and dynamically add capacity for seamless, linear scale-out Remove DW bottlenecks with MPP SQL Server: Get the dynamic performance and scale that your modern data warehouse requires while retaining your skills and investment in SQL Server. Real-Time Performance with In-Memory: Provides up to 100x improvement in query performance and 15x compression via updateable in-memory columnstore Concurrency that Supports High Adoption: Scales in simultaneous user accessibility. APS has high concurrency, allowing for multiple workloads. Optimal architecture More than just a converged system, PDW has reshaped the very hardware specifications required through software innovations to deliver optimal value. Through features delivered in Windows Server 2012, customers get exceptional value: APS provides the industry’s lowest DW Price/TB Lower cost while maintaining performance using WS2012 Storage Spaces that replace SAN with economical Windows Storage Spaces Save up to 70% of PDW storage with up to 15x compression via updateable in-memory columnstore Value through single appliance solution Reduce hardware footprint by having PDW and HDInsight within a single appliance Remove the need for costly integration efforts Value through flexible hardware options Avoid hardware lock-in through flexible hardware options from HP, Dell, and Quanta
Deployment options and hybrid solutionsSQL Server
Presenter
Presentation Notes
KEY POINT To convey that the major pillars of the Analytics Platform System with key points. To help organizations with a simple and smooth seamlessly transition to this new world of data, Microsoft introduces the Microsoft Analytics Platform System (APS) – the only, no-compromise modern data warehouse solution that brings both Hadoop and RDBMS in a single, pre-built appliance with tier-one performance, the lowest TCO in the industry, and accessibility to all their users through some of the most widely used BI tools in the industry. Enterprise-ready Big Data Microsoft APS combines Microsoft’s industry leading RDBMS platform, the Parallel Data warehouse Appliance (PDW), with Microsoft’s Hadoop Distribution, HDInsight, for non-relational data to offer an all-in Big Data Analytics appliance. Tying together and integrating the worlds of relational and Hadoop data is PolyBase, Microsoft’s integrated query tool available only in APS. Your Modern Data Warehouse in One Turnkey Appliance: APS integrates PDW and HDInsight to operate seamlessly together in a single appliance Integrated Querying across All Data Types Using T-SQL: PolyBase allows Hadoop data to be queried using rich featured T-SQL , while taking advantage of Hadoop processing, without additional Hadoop-based skills or training. Enterprise-Ready Hadoop: HDInsight is Microsoft’s Hadoop-based distribution with end-user authentication via Active Directory and managed by IT using System Center Big Data Insights to Any User: Native Microsoft BI integration within PolyBase allows everyone access to insights through familiar tools such as SSAS and Excel Next-generation performance at scale APS was built to scale into multi-petabytes, handling both RDBMS and the data stored in Hadoop, to deliver the performance that meets today’s near real-time sand rapid insights requirements. Scale-Out to accommodate your Growing Data: APS contains PDW and HDInsight that both have linear scale-out architecture. Start small with a few terabytes and dynamically add capacity for seamless, linear scale-out Remove DW bottlenecks with MPP SQL Server: Get the dynamic performance and scale that your modern data warehouse requires while retaining your skills and investment in SQL Server. Real-Time Performance with In-Memory: Provides up to 100x improvement in query performance and 15x compression via updateable in-memory columnstore Concurrency that Supports High Adoption: Scales in simultaneous user accessibility. APS has high concurrency, allowing for multiple workloads. Optimal architecture More than just a converged system, PDW has reshaped the very hardware specifications required through software innovations to deliver optimal value. Through features delivered in Windows Server 2012, customers get exceptional value: APS provides the industry’s lowest DW Price/TB Lower cost while maintaining performance using WS2012 Storage Spaces that replace SAN with economical Windows Storage Spaces Save up to 70% of PDW storage with up to 15x compression via updateable in-memory columnstore Value through single appliance solution Reduce hardware footprint by having PDW and HDInsight within a single appliance Remove the need for costly integration efforts Value through flexible hardware options Avoid hardware lock-in through flexible hardware options from HP, Dell, and Quanta
Provides a single T-SQL query model for PDW and Hadoop with rich features of T-SQL, including joins without ETL
Uses the power of MPP to enhance query execution performance
Supports Windows Azure HDInsight to enable new hybrid cloud scenarios
Provides the ability to query non-Microsoft Hadoop distributions, such as Hortonworks and Cloudera
SQL ServerParallel DataWarehouseMicrosoft Azure
HDInsight
PolyBase
Microsoft HDInsight
Hortonworks for Windows and Linux
Cloudera
Connecting Islands of Data with PolyBaseResult
setSelect
…
SQL Server
Presenter
Presentation Notes
KEY POINT PolyBase is available only within the Microsoft Analytics Platform System. TALK TRACK PolyBase simplifies this by allowing Hadoop data to be queried with standard Transact-SQL (T-SQL) query language without the need to learn MapReduce and without the need to move the data into the data warehouse. PolyBase unifies relational and non-relational data at the query level. Integrated query: PolyBase accepts a standard T-SQL query that joins tables containing a relational source with tables in a Hadoop cluster referencing a non-relational source. It then seamlessly returns the results to the user. PolyBase can query Hadoop data in other Hadoop distributions such as Hortonworks or Cloudera. No difficult learning curve: Standard T-SQL can be used to query Hadoop data. Users are not required to learn MapReduce to execute the query. Cloud-hybrid scenario options: PolyBase can also query across Windows Azure HDInsight, providing a Hybrid Cloud solution to the data warehouse