Upload
baldwin-mosley
View
227
Download
0
Tags:
Embed Size (px)
Citation preview
HDInsight on Azure and Map-Reduce
Richard ConwayWindows Azure MVPElastacloud Limited
Agenda
Introduction
Big Data with HDInsight
Introduction
Solving problems through distributionSome challenges become bound by hardware capacity; 24 hours on 1 machine can be 1 hours on 24 machines.
These 24 machines require orchestration; jobs are to be divided into tasks and tasks are distributed across a cluster.
There are systems of software required to facilitate the distribution; examples are Hadoop and HPC Server.
We will now provision a Hadoop cluster on Windows Azure.
Big Data vs Big Compute
Compute Bound IO Bound
HPC ServerOpen MPI
Hadoop
All distributed compute works on the basis of taking a large JOB and breaking it to many smaller TASKS which are then parallelised
Hadoop
Name Node Name Node
Data Nodes
HPC
Head Node Broker Node
Worker Nodes
Understanding Big Data
Cheap Storage
$100 gets you 3million times
more storage in 30 years)
Inexpensive Computing
1980 10 MIPS/$ 2005 10M MIPS/$
Device Explosion
>5.5 billion (70+% of global population)
KEY TRENDS
Social Networks
>2 Billionusers
Ubiquitous Connection
Web traffic2010 130 Exabyte (10 E18)
2015 1.6 ZettaByte (10 E21)
Sensor Networks
>10 Billion
Internet of things Audio /
VideoLog Files
Text/Image
Social Sentiment
Data Market FeedseGov Feeds
Weather
Wikis / Blogs
Click Stream
Sensors / RFID / Devices
Spatial & GPS Coordinates
WEB 2.0Mobile
Advertising
Collaboration
eCommerce
Digital Marketing
Search Marketing
Web Logs
Recommendations
ERP / CRM
Sales Pipeline
PayablesPayroll
Inventory
Contacts
Deal Tracking
Terabytes(10E12)
Gigabytes(10E9)
Exabytes(10E18)
Petabytes(10E15)
Velocity - Variety - variability
Volu
me
1980190,000$
20100.07$
19909,000$
200015$Storage/GB
ERP / CRM WEB 2.0
Internet of things
What is Big Data?
Big Data, BIG OPPORTUNITY
Big Data is a top priority for institutions
49% CEOs and CIOs are planning big data projects
Software Growth
2012
2013
2014
2015
0
41.8 2.5
3.44.6
Bil
lio
ns
$
34% compound annual growth rate2
Services Growth
2012
2013
2014
2015
048
2.7 3.9 5.16.5
Bil
lio
ns
$
39% compound annual growth rate2
1. McKinsey&Company, McKinsey Global Survey Results, Minding Your Digital Business, 20122. IDC Market Analysis, Worldwide Big Data Technology and Services 2012–2015 Forecast , 2012
Devices: Internet and Internet of things
Internet of
things Invisible devicesTrillions of networked
nodes
Low bandwidth last-mile
connection
100kBit/sec
Mostly addressed by local schemes
Machine-centric Sensing-focus
Trillions of computer-enabled
devices which are part of the
IoT
Global addressing
User-centricCommunication-
focus
Internet
Laptops / tablets / smartphones
Billions of networked devices
High-bandwidth access
Cable: 10Mbs+Fiber: 50-100Mbs
6+billion people
1.5 billion use net
US: 4.3 devices per adult
Big Data Scenarios
Short History of Hadoop
Seminal whitepapers by Google in 2004 on a new programming paradigm to handle data at internet scaleHadoop started as a part of the Nutch project.In Jan 2006 Doug Cutting started working on Hadoop at YahooFactored out of Nutch in Feb 2006First release of Apache Hadoopin September 2007Jan 2008 Hadoop became a top level Apache project
Hadoop Distributed Architecture
FIRST, STORE THE DATA
Server
ServerServer
MapReduce: Move Code to the Data
Files
Server
SECOND, TAKE THE PROCESSING TO THE DATA
So How Does It Work?
// Map Reduce function in JavaScript
var map = function (key, value, context) {var words = value.split(/[^a-zA-Z]/);for (var i = 0; i < words.length; i++) {
if (words[i] !== "")context.write(words[i].toLowerCase(),1);}}};
var reduce = function (key, values, context) {var sum = 0;while (values.hasNext()) {sum += parseInt(values.next());
}context.write(key, sum);};
ServerServer
ServerServer
RUNTIME
Code
Traditional RDBMS vs. NoSQL
TRADITIONAL RDBMS HADOOP
Data Size Gigabytes (Terabytes) Petabytes (Hexabytes)
Access Interactive and Batch Batch
Updates Read / Write many times Write once, Read many times
Structure Static Schema Dynamic Schema
Integrity High (ACID) Low
Scaling Nonlinear Linear
DBA Ratio 1:40 1:3000
Reference: Tom White’s Hadoop: The Definitive Guide
Windows Azure HDInsight Service
Creating an HDInsightCluster Demo
MICROSOFT CONFIDENTIAL – INTERNAL ONLY
Distributed Storage(HDFS)
Query(Hive)
Distributed Processing
(MapReduce)
Scripting(Pig)
NoSQ
L Data
base
(HB
ase
)
Metadata(HCatalog)
Data
Inte
gra
tion
( OD
BC
/ SQ
OO
P/ REST)
Rela
tiona
l(S
QL
Serve
r)
Machine Learning(Mahout)
Graph(Pegasus)
Stats processin
g(RHadoo
p)
Eve
nt Pip
elin
e(Flu
me)
Active Directory (Security)
Monitoring & Deployment
(System Center)
C#, F#, .NET
JavaScript
Pipelin
e / w
orkflo
w(O
ozie
)
Azure Storage Vault (ASV)
PD
W Po
lybase
Busin
ess
Inte
lligence
(E
xcel, Po
wer
Vie
w, S
SA
S)
HDINSIGHT / HADOOP Eco-System
World's Data (Azure Data Marketplace)
Eve
nt
Drive
n
Proce
ssing
LegendRed = Core HadoopBlue = Data processingPurple = Microsoft integration points and value addsOrange = Data MovementGreen = Packages
Storing Data with HDInsight
MICROSOFT CONFIDENTIAL – INTERNAL ONLY
Front end
Front end
Stream Layer
Partition Layer
HDFS on Azure: Tale of two File Systems
Name Node
de
Data Node Data Node
Front end
HDFS API
DFS (1 Data Node per Worker Role)and Compute Cluster
Azure Storage (ASV)
…
Azure Blob Storage
MICROSOFT CONFIDENTIAL – INTERNAL ONLY
Azure Storage (ASV)• Default file system for HDInsight Service• Provides sharable, persistent, highly-scalable Storage with high
availability (Azure Blob Store)• Azure storage itself does not provide compute• Fast access from compute nodes to data in same data center• Several file systems, addressable via:asv[s]:<container>@<account>.blob.core.windows.net/<path>
• Requires storage key in core-site.xml:<property> <name>fs.azure.account.key.accountname</name> <value>enterthekeyvaluehere</value></property>
Map Reduce
Examples in C#
Map/Reduce
Map/Reduce is a programming model for efficient distributed computingInput > Map > Shuffle & Sort > Reduce > Output
Efficiency from Streaming through data, reducing seeksA good fit for a lot of applicationsLog processingWeb index buildingData mining and machine learning
Hadoop SDK
C# integrationRemote Data & JobsHive in C#Serialization
http://hadoopsdk.codeplex.com
public class FrenchSessionsJob : HadoopJob<FrenchSessionsMapper, SessionsReducer>
{
public override HadoopJobConfiguration Configure(ExecutorContext context)
{
var config = new HadoopJobConfiguration()
{
InputPath = "\"/AllSessions/*.gz\"",
OutputFolder = "/FrenchSessions/"
};
return config;
}
}
Jobs
public class FrenchSessionsMapper : MapperBase
{
public override void Map(string inputLine, MapperContext context)
{
if (inputLine.Contains("Country=France")
{
context.IncrementCounter("FrenchSession");
context.EmitKeyValue("FR", "1");
}
}
}
Mapper
public class SessionsReducer : ReducerCombinerBase
{
public override void Reduce(string key, IEnumerable<string> values, ReducerContext context)
{
context.EmitKeyValue(key, values.Count());
}
}
Reducer
Navigating the HDInsight portal Demo
C# and Map/ReduceDemo
https://elastastorage.blob.core.windows.net/hdinsight/Map-Reduce HDInsight Lab.pdf
Questions?