37
Sponsored & Brought to you by Analyzing StackExchange data with Azure Data Lake Tom Kerkhove http://www.twitter.com/TomKerkhove https://be.linkedin.com/in/tomkerkhove

Analyzing StackExchange data with Azure Data Lake

Embed Size (px)

Citation preview

Page 1: Analyzing StackExchange data with Azure Data Lake

Sponsored & Brought to you by

Analyzing StackExchange data with Azure Data LakeTom Kerkhove

http://www.twitter.com/TomKerkhove

https://be.linkedin.com/in/tomkerkhove

Page 2: Analyzing StackExchange data with Azure Data Lake

Analysing StackExchange datawith Azure Data Lake

Analysing StackExchange data with Azure Data Lake

Page 3: Analyzing StackExchange data with Azure Data Lake

Nice to meet youTom KERKHOVE➔ Integration Professional➔ IoT Competency Lead➔ Windows Development &

Microsoft Azure MVP

[email protected]+32 473 701 [email protected]/in/tomkerkhovegithub.com/tomkerkhove

Page 4: Analyzing StackExchange data with Azure Data Lake

Agenda• Why should we care about Big

Data?• Big Data in Azure• Azure Data Lake• Demo• Q & A

4

Page 5: Analyzing StackExchange data with Azure Data Lake

10101010110101

10101010110101

101010111

10101010110101

10101010110101

Page 6: Analyzing StackExchange data with Azure Data Lake

Integration of ThingsInternet of Things

6

Page 7: Analyzing StackExchange data with Azure Data Lake

Connect and scale with efficiency

Analyze and act on new

data

Integrate and transform

business processes

Business Systems1010100111010010110101010110101

10101010110100011010001011

10101010110101

10101010110100011010001011010101

Connect and scale with efficiency

Analyze and act on new

data

Integrate and transform

business processes

Page 8: Analyzing StackExchange data with Azure Data Lake

Event producers & gateways

Ingestion & transformation Report, Act, Predict

Page 9: Analyzing StackExchange data with Azure Data Lake

Microsoft Patterns & Practices – IoT Journey

Page 10: Analyzing StackExchange data with Azure Data Lake

10

Page 11: Analyzing StackExchange data with Azure Data Lake

11

Cluster Management

Page 12: Analyzing StackExchange data with Azure Data Lake

12

Languages

Page 13: Analyzing StackExchange data with Azure Data Lake

Platform Services

Infrastructure Services

OS/Server Compute Storage

Datacenter Infrastructure (24 Regions, 22 Online)

Web and Mobile

Web Apps

MobileApps

APIManagement

API Apps

Logic Apps

Notification Hubs

Media & CDNContent DeliveryNetwork (CDN)

Media Services

Integration

BizTalkServices

HybridConnections

Service Bus

StorageQueues

HybridOperations

Backup

StorSimple

Azure SiteRecovery

Import/Export

Networking

Data

SQL Database

DocumentDB

RedisCache Azure

SearchStorageTables

DataWarehouse Azure AD

Health Monitoring

Virtual Network

ExpressRoute

BLOB Storage AzureFiles

PremiumStorage

Virtual Machines

AD PrivilegedIdentity Management

Traffic Manager

AppGateway

OperationalAnalytics

Services ComputeCloud Services

Batch RemoteApp

ServiceFabric

Developer Services

Visual Studio

AppInsights

Azure SDK

VS Online

ContainerService

DNS VPN GatewayLoad Balancer

Domain Services

Analytics & IoT

HDInsight MachineLearning

StreamAnalytics

Data Factory

EventHubs

MobileEngagement

Data Lake

IoT Hub

Data Catalog

Security & Management

Azure ActiveDirectory

Multi-FactorAuthentication

Automation

Portal

Key Vault

Store/Marketplace

VM Image Gallery& VM Depot

Azure ADB2C

Scheduler

Page 14: Analyzing StackExchange data with Azure Data Lake

Overview in Azure

14

DocumentDB

Data Factory Stream Analytics Data Lake HDInsight Data Lake(Store & Analytics)

Virtual Machine

IoT Hub SQL DataWarehouse

SQL DatabaseStorageEvent HubsDocument Db

Data Ingestion Data Storage

Data Pipelines

Machine Learning

Data Analytics

Page 15: Analyzing StackExchange data with Azure Data Lake

Personal Digital Assistant – Cortana

Perceptual Intelligence

Preconfigured Solutions

Dashboards and Visualizations

Machine Learning and Analytics

Big Data Store

Information Management

Cortana Analytics Suite

Page 16: Analyzing StackExchange data with Azure Data Lake

16

Page 17: Analyzing StackExchange data with Azure Data Lake

Analysing Big Data in Azure

Azure Data Lake Family

HDInsight Data Lake Store Data Lake Analytics

• Unlimited storage• WebHDFS Store

• Managed cluster service• Open-source technology• Runs on Windows or

Linux

• Managed job service• U-SQL batch-processing

Page 18: Analyzing StackExchange data with Azure Data Lake

Azure Data Lake Store➔ WebHDFS compatible➔ Any size➔ Any format as-is➔ Write-once-read-many➔ Enterprise-grade security

➔ Thé big data store in Azure

18

Page 19: Analyzing StackExchange data with Azure Data Lake

Characteristics➔ Data Warehousing

➔ Structured data➔ Defined set of schemas➔ Requires Extract-

Transform-Load (ETL) before storing

➔ Known for some of us

➔ Exploratory analysis is hard because of transforming the data

19

Data Lake vs Data Warehousing➔ Data Lake

➔ Raw data(unstructured/semi-structured/structured)

➔ “Dump” all your data in the lake

➔ Data scientists will interpret data from the lake

➔ Without metadata, turns in a data swamp pretty fast

Page 20: Analyzing StackExchange data with Azure Data Lake

20

Martin Fowler on Data Lake & Data Warehouses(link)

Page 21: Analyzing StackExchange data with Azure Data Lake

Azure Data Lake Analytics➔ Run analytics jobs on managed clusters

➔ Don’t worry about scale➔ Written in U-SQL

➔ SQL Syntax➔ Extensibility in C#

➔ Easily scaled with Analytics Units➔ Pay for processing time only

21

Page 22: Analyzing StackExchange data with Azure Data Lake

Writing U-SQL scripts

22

Extract from data source by using built-in or custom extractors.

Transform / Analyse the data using SQL-syntax, in-line C# or C# method calls

Output the result to a data source by using built-in or custom extractors

Page 23: Analyzing StackExchange data with Azure Data Lake

23

Page 24: Analyzing StackExchange data with Azure Data Lake

Data Lake Analytics - Data Sources

U-SQL Query

Query

Query

Query

Write

Query

Azure Storage Blobs

Azure Data Lake Store

Azure SQL Database

Azure SQL Data

Warehouse

Azure SQL in VMs

Azure Data Lake Analytics

Page 25: Analyzing StackExchange data with Azure Data Lake

25

Page 26: Analyzing StackExchange data with Azure Data Lake

Meet StackExchange➔ Over 280 subwebsites➔ 150+ GB of open-source data➔ Different kinds of data

➔ Posts➔ Users➔ Votes➔ ...

➔ A big data sample data set

Page 27: Analyzing StackExchange data with Azure Data Lake

What Are We Going To Do?

• Downloading the original data set

Acquiring The Data

• Upload data set to Azure• Determine what

service to use

Moving The Data

• Visualize what we’ve learned

Visualizing The Data

27

Page 28: Analyzing StackExchange data with Azure Data Lake

Azure Data Lake tools for Visual Studio➔ Projects / Solutions / Source control➔ Store Explorer

➔ Browse store➔ Download complete / subset of file➔ Preview

➔ Job Visualizer➔ Determine bottlenecks by using heatmaps➔ Playback jobs based on telemetry➔ Query optimization➔ Job Profiler

➔ Off-Line execution28

Page 29: Analyzing StackExchange data with Azure Data Lake

Integration with Azure Services➔ Integrate in your data pipelines in Azure Data

Factory➔ Move data from Azure Data Lake Store to other store➔ Move data to Azure Data Lake Store➔ Run U-SQL query within pipeline

➔ Integration with Azure Data Catalog➔ Register your Azure Data Lake Store assets

29

Page 30: Analyzing StackExchange data with Azure Data Lake

Pricing➔ Data Lake Store

➔ $0,08/GB stored per month➔ $0,14 per 1M transactions

• 1 transaction is block of up to 128 kB➔ Egress will be billed but not know yet

➔ Data Lake Analytics➔ $0,05 per job➔ $0,05 per minute per Analytics Unit for processing

time

30

Page 31: Analyzing StackExchange data with Azure Data Lake

Azure Data Lake Store vs Blob Storage

31

No LimitationsStore whatever you want in any format

SecurityBuilt-in Azure Active Directory support

PricingMore expensive than Storage RA-GRS

RedundancyIt’s there but no control over it

Built for ScaleOptimized for high-scale reads

IntegrationWith Data Factory, Data Catalog & HDInsight

Page 32: Analyzing StackExchange data with Azure Data Lake

32

Page 33: Analyzing StackExchange data with Azure Data Lake

Summary➔ Big Data is not just a hype so get ready➔ Azure Data Lake Store

➔ Analyse today & explore tomorrow➔ Data Swamps

➔ Data Lake Analytics➔ No cluster management➔ Re-use existing skills➔ Pay for what we use

➔ Big Data in Azure? Azure Data Lake family and it’s easy!

Page 34: Analyzing StackExchange data with Azure Data Lake
Page 35: Analyzing StackExchange data with Azure Data Lake

35

Page 36: Analyzing StackExchange data with Azure Data Lake

36

Page 37: Analyzing StackExchange data with Azure Data Lake

37