HDInsight on Azure and Map-Reduce Richard Conway Windows Azure MVP Elastacloud Limited

HDInsight on Azure and Map-Reduce

Richard ConwayWindows Azure MVPElastacloud Limited

Agenda

Introduction

Big Data with HDInsight

Introduction

Solving problems through distributionSome challenges become bound by hardware capacity; 24 hours on 1 machine can be 1 hours on 24 machines.

These 24 machines require orchestration; jobs are to be divided into tasks and tasks are distributed across a cluster.

There are systems of software required to facilitate the distribution; examples are Hadoop and HPC Server.

We will now provision a Hadoop cluster on Windows Azure.

Big Data vs Big Compute

Compute Bound IO Bound

HPC ServerOpen MPI

Hadoop

All distributed compute works on the basis of taking a large JOB and breaking it to many smaller TASKS which are then parallelised

Hadoop

Name Node Name Node

Data Nodes

HPC

Head Node Broker Node

Worker Nodes

Understanding Big Data

Cheap Storage

$100 gets you 3million times

more storage in 30 years)

Inexpensive Computing

1980 10 MIPS/$ 2005 10M MIPS/$

Device Explosion

>5.5 billion (70+% of global population)

KEY TRENDS

Social Networks

>2 Billionusers

Ubiquitous Connection

Web traffic2010 130 Exabyte (10 E18)

2015 1.6 ZettaByte (10 E21)

Sensor Networks

>10 Billion

Internet of things Audio /

VideoLog Files

Text/Image

Social Sentiment

Data Market FeedseGov Feeds

Weather

Wikis / Blogs

Click Stream

Sensors / RFID / Devices

Spatial & GPS Coordinates

WEB 2.0Mobile

Advertising

Collaboration

eCommerce

Digital Marketing

Search Marketing

Web Logs

Recommendations

ERP / CRM

Sales Pipeline

PayablesPayroll

Inventory

Contacts

Deal Tracking

Terabytes(10E12)

Gigabytes(10E9)

Exabytes(10E18)

Petabytes(10E15)

Velocity - Variety - variability

Volu

me

1980190,000$

20100.07$

19909,000$

200015$Storage/GB

ERP / CRM WEB 2.0

Internet of things

What is Big Data?

Big Data, BIG OPPORTUNITY

Big Data is a top priority for institutions

49% CEOs and CIOs are planning big data projects

Software Growth

2012

2013

2014

2015

0

41.8 2.5

3.44.6

Bil

lio

ns

$

34% compound annual growth rate2

Services Growth

2012

2013

2014

2015

048

2.7 3.9 5.16.5

Bil

lio

ns

$

39% compound annual growth rate2

1. McKinsey&Company, McKinsey Global Survey Results, Minding Your Digital Business, 20122. IDC Market Analysis, Worldwide Big Data Technology and Services 2012–2015 Forecast , 2012

Devices: Internet and Internet of things

Internet of

things Invisible devicesTrillions of networked

nodes

Low bandwidth last-mile

connection

100kBit/sec

Mostly addressed by local schemes

Machine-centric Sensing-focus

Trillions of computer-enabled

devices which are part of the

IoT

Global addressing

User-centricCommunication-

focus

Internet

Laptops / tablets / smartphones

Billions of networked devices

High-bandwidth access

Cable: 10Mbs+Fiber: 50-100Mbs

6+billion people

1.5 billion use net

US: 4.3 devices per adult

Big Data Scenarios

Short History of Hadoop

Seminal whitepapers by Google in 2004 on a new programming paradigm to handle data at internet scaleHadoop started as a part of the Nutch project.In Jan 2006 Doug Cutting started working on Hadoop at YahooFactored out of Nutch in Feb 2006First release of Apache Hadoopin September 2007Jan 2008 Hadoop became a top level Apache project

Hadoop Distributed Architecture

FIRST, STORE THE DATA

Server

ServerServer

MapReduce: Move Code to the Data

Files

Server

SECOND, TAKE THE PROCESSING TO THE DATA

So How Does It Work?

// Map Reduce function in JavaScript

var map = function (key, value, context) {var words = value.split(/[^a-zA-Z]/);for (var i = 0; i < words.length; i++) {

if (words[i] !== "")context.write(words[i].toLowerCase(),1);}}};

var reduce = function (key, values, context) {var sum = 0;while (values.hasNext()) {sum += parseInt(values.next());

}context.write(key, sum);};

ServerServer

ServerServer

RUNTIME

Code

Traditional RDBMS vs. NoSQL

TRADITIONAL RDBMS HADOOP

Data Size Gigabytes (Terabytes) Petabytes (Hexabytes)

Access Interactive and Batch Batch

Updates Read / Write many times Write once, Read many times

Structure Static Schema Dynamic Schema

Integrity High (ACID) Low

Scaling Nonlinear Linear

DBA Ratio 1:40 1:3000

Reference: Tom White’s Hadoop: The Definitive Guide

Windows Azure HDInsight Service

Creating an HDInsightCluster Demo

MICROSOFT CONFIDENTIAL – INTERNAL ONLY

Distributed Storage(HDFS)

Query(Hive)

Distributed Processing

(MapReduce)

Scripting(Pig)

NoSQ

L Data

base

(HB

ase

)

Metadata(HCatalog)

Data

Inte

gra

tion

( OD

BC

/ SQ

OO

P/ REST)

Rela

tiona

l(S

QL

Serve

r)

Machine Learning(Mahout)

Graph(Pegasus)

Stats processin

g(RHadoo

p)

Eve

nt Pip

elin

e(Flu

me)

Active Directory (Security)

Monitoring & Deployment

(System Center)

C#, F#, .NET

JavaScript

Pipelin

e / w

orkflo

w(O

ozie

)

Azure Storage Vault (ASV)

PD

W Po

lybase

Busin

ess

Inte

lligence

(E

xcel, Po

wer

Vie

w, S

SA

S)

HDINSIGHT / HADOOP Eco-System

World's Data (Azure Data Marketplace)

Eve

nt

Drive

n

Proce

ssing

LegendRed = Core HadoopBlue = Data processingPurple = Microsoft integration points and value addsOrange = Data MovementGreen = Packages

Storing Data with HDInsight


Front end

Front end

Stream Layer

Partition Layer

HDFS on Azure: Tale of two File Systems

Name Node

de

Data Node Data Node

Front end

HDFS API

DFS (1 Data Node per Worker Role)and Compute Cluster

Azure Storage (ASV)

…

Azure Blob Storage


Azure Storage (ASV)• Default file system for HDInsight Service• Provides sharable, persistent, highly-scalable Storage with high

availability (Azure Blob Store)• Azure storage itself does not provide compute• Fast access from compute nodes to data in same data center• Several file systems, addressable via:asv[s]:<container>@<account>.blob.core.windows.net/<path>

• Requires storage key in core-site.xml:<property> <name>fs.azure.account.key.accountname</name> <value>enterthekeyvaluehere</value></property>

Map Reduce

Examples in C#

Map/Reduce

Map/Reduce is a programming model for efficient distributed computingInput > Map > Shuffle & Sort > Reduce > Output

Efficiency from Streaming through data, reducing seeksA good fit for a lot of applicationsLog processingWeb index buildingData mining and machine learning

Hadoop SDK

C# integrationRemote Data & JobsHive in C#Serialization

http://hadoopsdk.codeplex.com

public class FrenchSessionsJob : HadoopJob<FrenchSessionsMapper, SessionsReducer>

{

public override HadoopJobConfiguration Configure(ExecutorContext context)

{

var config = new HadoopJobConfiguration()

{

InputPath = "\"/AllSessions/*.gz\"",

OutputFolder = "/FrenchSessions/"

};

return config;

}

}

Jobs

public class FrenchSessionsMapper : MapperBase

{

public override void Map(string inputLine, MapperContext context)

{

if (inputLine.Contains("Country=France")

{

context.IncrementCounter("FrenchSession");

context.EmitKeyValue("FR", "1");

}

}

}

Mapper

public class SessionsReducer : ReducerCombinerBase

{

public override void Reduce(string key, IEnumerable<string> values, ReducerContext context)

{

context.EmitKeyValue(key, values.Count());

}

}

Reducer

Navigating the HDInsight portal Demo

C# and Map/ReduceDemo

https://elastastorage.blob.core.windows.net/hdinsight/Map-Reduce HDInsight Lab.pdf

Questions?

Documents

HDInsight on Azure and Map-Reduce Richard Conway Windows Azure MVP Elastacloud Limited