56
Big Data on Azure What do I need to know as a developer to make it worthwhile ? Mihai Nadăș Chief Technology Officer, Yonder Most Valuable Professional, Microsoft

Iasi code camp 20 april 2013 mihai nadas hadoop azure

Embed Size (px)

Citation preview

Page 1: Iasi code camp 20 april 2013 mihai nadas hadoop azure

Big Data on Azure

What do I need to know as a developer to make it worthwhile?

Mihai NadășChief Technology Officer, YonderMost Valuable Professional, Microsoft

Page 2: Iasi code camp 20 april 2013 mihai nadas hadoop azure

About myself

@mihainadasblog.mihainadas.com

Page 3: Iasi code camp 20 april 2013 mihai nadas hadoop azure

Agenda

Why Big Data?

Understanding the Basics

Microsoft and Hadoop

Page 4: Iasi code camp 20 april 2013 mihai nadas hadoop azure

Two Big Data examples

1. Google Flu Trends

2. Farecast

Page 5: Iasi code camp 20 april 2013 mihai nadas hadoop azure

Why Big Data?

Page 6: Iasi code camp 20 april 2013 mihai nadas hadoop azure
Page 7: Iasi code camp 20 april 2013 mihai nadas hadoop azure
Page 8: Iasi code camp 20 april 2013 mihai nadas hadoop azure
Page 9: Iasi code camp 20 april 2013 mihai nadas hadoop azure
Page 10: Iasi code camp 20 april 2013 mihai nadas hadoop azure

Gartner’s Hype Cycle on Big Data

Page 11: Iasi code camp 20 april 2013 mihai nadas hadoop azure

Key Technologies• Accessible storage (non-relational) in cloud: Amazon S3, Azure Blob &

Table storage, Google Cloud Storage

• In memory databases & grids: MemSQL, XAP (Gigaspaces ), SAP Hana

• Parallel processing frameworks: Hadoop

• Online analytics frameworks: Google BigQuery, Hive

• Data stream processing: Twitter Storm

• Complex event processing: Oracle CEP Server, Microsoft StreamInsight

• Sentiment analysis – Radian6

Page 12: Iasi code camp 20 april 2013 mihai nadas hadoop azure

It’s BIG

Page 13: Iasi code camp 20 april 2013 mihai nadas hadoop azure

Example Scenario

Page 14: Iasi code camp 20 april 2013 mihai nadas hadoop azure

OPERATIONAL DATA

Traditional E-Commerce Data Flow

NEW USER REGISTRY

NEW PURCHASE

NEW PRODUCT

Excess Data

Logs

ETL Some Data

Data Warehouse

Page 15: Iasi code camp 20 april 2013 mihai nadas hadoop azure

OPERATIONAL DATA

New E-Commerce Big Data Flow

Raw Data“Store it All” Cluster

Raw Data“Store it All” Cluster

NEW USER REGISTRY

NEW PURCHASE

NEW PRODUCT

Data Warehouse

Logs

Logs

How much do views for certain products increase when our TV ads run?

Page 16: Iasi code camp 20 april 2013 mihai nadas hadoop azure

Viktor Mayer-SchonbergerProfessor at Oxford

Kenneth CukierEditor, The Economist

Page 17: Iasi code camp 20 april 2013 mihai nadas hadoop azure

Big Data Principles1. More: store over trash

2. Messy: quantity over quality

3. Correlation: what over why

Page 18: Iasi code camp 20 april 2013 mihai nadas hadoop azure

Understanding the Basics Move the Compute to the Data

Page 19: Iasi code camp 20 april 2013 mihai nadas hadoop azure

Characteristics of Big Data

Page 20: Iasi code camp 20 april 2013 mihai nadas hadoop azure

MapReduce

Page 21: Iasi code camp 20 april 2013 mihai nadas hadoop azure

Think of the following problem...

Page 22: Iasi code camp 20 april 2013 mihai nadas hadoop azure

What if we parallelize?

Page 23: Iasi code camp 20 april 2013 mihai nadas hadoop azure

What if we parallelize?

Page 24: Iasi code camp 20 april 2013 mihai nadas hadoop azure

Welcome, MapReduce

Map Reduce

Page 25: Iasi code camp 20 april 2013 mihai nadas hadoop azure

So How Does It Work?

Page 26: Iasi code camp 20 april 2013 mihai nadas hadoop azure

MapReduce – Workflow

Page 27: Iasi code camp 20 april 2013 mihai nadas hadoop azure

Hadoop

Page 28: Iasi code camp 20 april 2013 mihai nadas hadoop azure

The Hadoop EcosystemETL Tools BI Reporting RDBMS

Reference: Tom White’s Hadoop: The Definitive Guide

Page 29: Iasi code camp 20 april 2013 mihai nadas hadoop azure

Traditional RDBMS vs. MapReduce

TRADITIONAL RDBMS MAPREDUCE

Data Size Gigabytes (Terabytes) Petabytes (Hexabytes)

Structure Static Schema Dynamic Schema

Integrity High (ACID) Low

Scaling Nonlinear Linear

DBA Ratio 1:40 1:3000

Reference: Tom White’s Hadoop: The Definitive Guide

Page 30: Iasi code camp 20 april 2013 mihai nadas hadoop azure

Microsoft and Hadoop

Page 32: Iasi code camp 20 april 2013 mihai nadas hadoop azure

Deploying and Interacting With a Hadoop Cluster on Azure

step-by-step walktrough

Page 33: Iasi code camp 20 april 2013 mihai nadas hadoop azure

Objectives1. Run a basic Java MapReduce program using a Hadoop jar file

2. Import data from the Windows Azure Marketplace into a Hadoop on Azure cluster using the Interactive Hive Console

Page 34: Iasi code camp 20 april 2013 mihai nadas hadoop azure

Prerequisites1. Access to a Hadoop on Azure account

2. Request an invitation to the Preview Feature

Page 35: Iasi code camp 20 april 2013 mihai nadas hadoop azure

Creating a new HDInsight Cluster (I)

Page 36: Iasi code camp 20 april 2013 mihai nadas hadoop azure

Creating a new HDInsight Cluster (II)

Page 37: Iasi code camp 20 april 2013 mihai nadas hadoop azure

Cluster Management Interface

Page 38: Iasi code camp 20 april 2013 mihai nadas hadoop azure

Hadoop Sample Gallery

Page 39: Iasi code camp 20 april 2013 mihai nadas hadoop azure

Objective #1: Basic MapReduce Task• We will use the Pi Estimator sample job

• Distributed Pi Estimator with 16 maps, each will compute 10 million samples

Page 40: Iasi code camp 20 april 2013 mihai nadas hadoop azure

Pi Estimator

2r

r=1

• Uses the Monte Carlo Simulation method to compute π

Page 41: Iasi code camp 20 april 2013 mihai nadas hadoop azure

Pi Estimator• Uses the Monte Carlo Simulation method to

compute π

Page 42: Iasi code camp 20 april 2013 mihai nadas hadoop azure

Pi Estimator: Running the Job

Page 43: Iasi code camp 20 april 2013 mihai nadas hadoop azure

Pi Estimator: And the result is.... • 160.000.000 random

points• 16 mappers• 10.000.000 samples /

map

• Computed in 65.108 seconds

Page 44: Iasi code camp 20 april 2013 mihai nadas hadoop azure

Objective #2: Import data from the Windows Azure Marketplace into a Hadoop

• Windows Azure Marketplace is a cloud one-stop-shop for premium data and applications

• We will see how we can use the „2006 – 2008 Crime in the US” dataset to play with on Hadoop using Hive

Page 45: Iasi code camp 20 april 2013 mihai nadas hadoop azure

Windows Azure Marketplace

Page 46: Iasi code camp 20 april 2013 mihai nadas hadoop azure

Apache Hive

• Data Warehouse infrastructure built on top of Hadoop

• Provides data summarization, query and analysis

• Initially developed by Facebook, now an Apache project

Page 47: Iasi code camp 20 april 2013 mihai nadas hadoop azure

Apache Hive: Features

• Analysis of large datasets stored in Hadoop-compatible file-systems

• Provides a SQL-like language called HiveQL while maintaining full support for map/reduce

• By default, stores data in Apache Derby database

Page 48: Iasi code camp 20 april 2013 mihai nadas hadoop azure

Importing data to Hadoop on Azure

Page 49: Iasi code camp 20 april 2013 mihai nadas hadoop azure

Importing data to Hadoop on Azure

Page 50: Iasi code camp 20 april 2013 mihai nadas hadoop azure

Querying huge datasets using Hive

Page 51: Iasi code camp 20 april 2013 mihai nadas hadoop azure

Querying huge datasets using Hive

Page 52: Iasi code camp 20 april 2013 mihai nadas hadoop azure

Querying huge datasets using Hive

Page 53: Iasi code camp 20 april 2013 mihai nadas hadoop azure

Hadoop on WindowsInsights to all users by activating new types of data

Integrate with Microsoft Business Intelligence

Choice of deployment on Windows Server + Windows Azure

Integrate with Windows Components (AD, Systems Center)Easy installation and configuration of Hadoop on Windows

Simplified programming with . Net & Javascript integration

Integrate with SQL Server Data Warehousing

Diff

ere

nti

ati

on

Page 54: Iasi code camp 20 april 2013 mihai nadas hadoop azure

Microsoft Big Data RoadmapTo accelerate the delivery of Microsoft’s Hadoop based solution for Windows Server and service for Windows Azure, Microsoft is announcing a partnership with HortonworksMicrosoft is committed to broadening accessibility and usage of Hadoop to end users, developers and IT professionals in organizations of all sizes

Microsoft is announcing an end-to-end roadmap for Big Data that embraces Apache HadoopTM by distributing enterprise class Hadoop based solutions on both Windows Server and Windows Azure

Microsoft is extending its leadership in business intelligence and data warehousing to provide insights to all users by activating new types of data of any size

Page 55: Iasi code camp 20 april 2013 mihai nadas hadoop azure

Things to do1. Get a trial of Windows Azure2. Subscribe to the Preview Program of Hadoop on

Azure3. Write your first Map/Reduce job

4. Have a talk in autumn at CodeCamp on your experience with Big Data

Page 56: Iasi code camp 20 april 2013 mihai nadas hadoop azure

Thank you