107
Wenming Ye Sr. Research Program Manager Microsoft Research Connections Twitter: @wenmingye

Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions

Wenming Ye

Sr. Research Program Manager

Microsoft Research Connections

Twitter: @wenmingye

Page 2: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions
Page 3: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions
Page 4: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions
Page 5: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions
Page 6: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions
Page 7: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions
Page 8: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions
Page 9: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions
Page 10: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions
Page 11: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions

http://www.windowsazure.com/

en-us/develop/nodejs/how-to-

guides/command-line-tools/

Page 12: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions
Page 13: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions
Page 14: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions
Page 15: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions
Page 16: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions
Page 17: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions
Page 18: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions
Page 19: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions
Page 20: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions

Gallery Images Available

MicrosoftWindows Server 2008 R2

SQL Server Eval 2012

Windows Server 2012

Biztalk Server 2013 Beta

Open SourceOpenSUSE 12.2

CentOS 6.3

Ubuntu 12.04/12.10

SUSE Linux Enterprise Server 11 SP2

Page 21: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions
Page 22: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions
Page 23: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions
Page 24: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions
Page 25: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions

VM with persistent drive

Page 26: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions

VM with persistent drive

Page 27: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions

VM with persistent drive

Page 28: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions
Page 29: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions
Page 30: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions

Server Rack 1 Server Rack 2

Page 31: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions
Page 32: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions
Page 33: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions
Page 34: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions
Page 35: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions
Page 36: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions
Page 37: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions
Page 38: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions
Page 39: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions
Page 40: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions
Page 41: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions

Blobs, Disks, Tables and Queues

8.5 trillion stored objects

900K request/sec on average (2.3+ trillion per month)

Page 42: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions
Page 43: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions

# Create containerfrom azure.storage import BlobServiceblob_service = BlobService(account_name, account_key)blob_service.create_container('taskcontainer')

# Uploadfrom azure.storage import BlobServiceblob_service = BlobService(account_name, account_key)blob_service.put_blob('taskcontainer', 'task1', file('task1-upload.txt').read(), 'BlockBlob')

#Downloadfrom azure.storage import BlobServiceblob_service = BlobService(account_name, account_key)blob = blob_service.get_blob('taskcontainer', 'task1')

Page 44: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions
Page 45: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions

Data centers

Page 46: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions

Account

Container Blobs

Table Entities

Queue Messages

https://<account>.blob.core.windows.net/<container>

https://<account>.table.core.windows.net/<table>

https://<account>.queue.core.windows.net/<queue>

Page 47: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions
Page 48: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions
Page 49: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions
Page 50: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions

Design Goals

• “Windows Azure Storage: A Highly Available Cloud Storage Service with Strong Consistency”, ACM Symposium on Operating System Principals (SOSP), Oct. 2011

Page 51: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions

Storage Stamp

LB

Storage

Location

Service

Access blob storage via the URL: http://<account>.blob.core.windows.net/

Data access

Partition Layer

Front-Ends

DFS Layer

Intra-stamp replication

Storage Stamp

LB

Partition Layer

Front-Ends

DFS Layer

Intra-stamp replication

Inter-stamp (Geo) replication

Page 52: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions

Index

Partition Layer

Page 53: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions

Partition Layer

Page 54: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions

Partition Layer

Page 55: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions

• Does not move data around, only reassigns what part of the index a partition server is responsible for

Partition Layer

Index

Page 56: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions

Partition Layer

Page 57: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions

• “Windows Azure Storage: A Highly Available Cloud Storage Service with Strong Consistency”, ACM Symposium on Operating System Principals (SOSP), Oct. 2011

Page 58: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions
Page 59: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions
Page 60: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions
Page 61: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions

and Queues (NEW)

Europe

West

North

Europe

Geo-replication

South

Central

US

North

Central

US

Geo-replication

East AsiaSouth

East Asia

Geo-replication

West US East US

Geo-replication

Page 62: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions

East USWest US

Azure

DNShttp://account.blob.core.windows.net/

DNS lookup

Data access

Hostname IP Address

account.blob.core.windows.net West US

Failover

Update DNS

East US

Geo-replication

Page 63: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions
Page 64: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions
Page 65: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions

Windows

Azure

Storage

Page 66: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions
Page 67: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions
Page 68: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions

180

182

184

186

188

190

192

194

196

198

200

660000

665000

670000

675000

680000

685000

690000

695000

700000

Average of TransactionCount

Average of TPS

Page 69: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions

0

50

100

150

200

250

300

350

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

20000

6/2

4/2

013

6/2

4/2

013 0

:03

6/2

4/2

013 0

:06

6/2

4/2

013 0

:09

6/2

4/2

013 0

:12

6/2

4/2

013 0

:15

6/2

4/2

013 0

:18

6/2

4/2

013 0

:21

6/2

4/2

013 0

:24

6/2

4/2

013 0

:27

6/2

4/2

013 0

:30

6/2

4/2

013 0

:33

6/2

4/2

013 0

:36

6/2

4/2

013 0

:39

6/2

4/2

013 0

:42

6/2

4/2

013 0

:45

6/2

4/2

013 0

:48

6/2

4/2

013 0

:51

6/2

4/2

013 0

:54

6/2

4/2

013 0

:57

6/2

4/2

013 1

:00

Average of TransactionCount

Average of TPS

Page 70: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions

J S O N

Page 71: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions

http://www.nuget.org/packages/WindowsAzure.Storage

Page 72: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions
Page 73: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions
Page 74: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions
Page 75: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions
Page 76: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions
Page 77: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions

XL VM Uploading 512, 256MB Blobs (Total upload size = 128GB)

• C=1, P=1 => Averaged ~ 13. 2 MB/s

• C=1, P=30 => Averaged ~ 50.72 MB/s

• C=30, P=1 => Averaged ~ 96.64 MB/s

• Single TCP connection is bound by TCP

• rate control & RTT

• P=30 vs. C=30: Test completed almost

• twice as fast!

• Single Blob is bound by the limits of a

• single partition

• Accessing multiple blobs concurrently

• scales

P=1,

C=1

P=30, C

=1 P=1…

0

2000

4000

6000

8000

10000

Tim

e (

s)

Page 78: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions

• XL VM Downloading 50, 256MB Blobs (Total download size = 12.5GB)

• C=1, P=1 => Averaged ~ 96 MB/s

• C=30, P=1 => Averaged ~ 130 MB/s

0

20

40

60

80

100

120

140

C=1, P=1 C=30, P=1Tim

e (

s)

Page 79: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions
Page 80: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions
Page 81: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions

Internet of thingsAudio / Video

Log Files

Text/Image

Social Sentiment

Data Market Feeds

eGov Feeds

Weather

Wikis / Blogs

Click StreamSensors / RFID / Devices

Spatial & GPS Coordinates

WEB 2.0Mobile

Advertising CollaborationeCommerce

Digital Marketing

Search Marketing

Web Logs

Recommendations

ERP / CRM

Sales Pipeline

Payables

Payroll

Inventory

Contacts

Deal Tracking

Terabytes

(10E12)

Gigabytes

(10E9)

Exabytes

(10E18)

Petabytes

(10E15)

Velocity - Variety - variability

Vo

lum

e

1980

190,000$

2010

0.07$

1990

9,000$2000

15$Storage/GB

ERP / CRM WEB

2.0

Internet of things

Page 82: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions

Big Data, BIG OPPORTUNITY

Big Data is a top priority for institutions

49% CEOs and CIOs are planning big data projects

Software Growth

1.82.5

3.44.6

0

5

2012 2013 2014 2015

Bil

lio

ns

$ 34% compound

annual growth

rate2

Services Growth

2.73.9

5.16.5

0

5

10

2012 2013 2014 2015

Bil

lio

ns

$ 39% compound

annual growth

rate2

1. McKinsey&Company, McKinsey Global Survey Results, Minding Your Digital Business, 2012

2. IDC Market Analysis, Worldwide Big Data Technology and Services 2012–2015 Forecast , 2012

Page 83: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions
Page 84: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions
Page 85: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions
Page 86: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions

How do I optimize my services

based on patterns of weather,

traffic. How do I build a

recommendation engine?

What’s the social sentiment

of my product?How do I better predict

future outcomes?

Page 87: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions
Page 88: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions
Page 89: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions
Page 90: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions
Page 91: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions
Page 92: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions
Page 93: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions

Distributed Storage

(HDFS)

Query

(Hive)

Distributed Processing

(MapReduce)

OD

BC

Legend

Red = Core

Hadoop

Blue = Data

processing

Purple =

Microsoft

integration

points and

value adds

Orange = Data

Movement

Green =

Packages

Page 94: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions
Page 95: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions

Front

endFront

end

Stream

Layer

Partition

Layer

Name Node

de

Data Node Data Node

Front end

HDFS API

DFS (1 Data Node per Worker Role)

and Compute ClusterAzure Storage (ASV)

Azure Blob Storage

Page 96: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions
Page 97: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions

Hive, Pig, Mahout, Cascading, Scalding, Scoobi,

Pegasus…

C#, F# Map/Reduce, LINQ to Hive, .NET

management clients

JavaScript Map/Reduce, Browser hosted console,

Node.js management clients

PowerShell, Cross Platform CLI tools

Page 98: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions
Page 100: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions

Deploying and Interacting With HDInsight Service

demo

Page 101: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions
Page 102: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions

Batch Processing Interactive analysis Stream

processing

Query runtime Minutes to hours Milliseconds to minutes Never-ending

Data volume TBs to PBs GBs to PBs Continuous stream

Programming model MapReduce Queries DAG

Users Developers Analysts and developers Developers

Originating project Google MapReduce Google Dremel Twitter Storm

Open source project Hadoop / Spark Drill / Shark /Impala

Hbase / Cassandra

Storm / Apache S4 /Kafka

Page 103: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions
Page 104: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions
Page 105: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions
Page 106: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions

http://www.windowsazure.com/en-us/develop/net/

http://blogs.msdn.com/b/windowsazurestorage/

http://blogs.msdn.com/b/windowsazurestorage/archive/2011/11/20/windows-azure-storage-a-highly-available-cloud-storage-service-with-strong-consistency.aspx

Page 107: Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions

Windows Azure Python SDKWindows AzureHow to use Service Management from Pythonhttp://www.windowsazure.com/en-us/manage/linux/other-resources/command-line-tools/http://research.microsoft.com/en-us/projects/azure/