Analyzing StackExchange data with Azure Data Lake

Preview:

Citation preview

Sponsored & Brought to you by

Analyzing StackExchange data with Azure Data LakeTom Kerkhove

http://www.twitter.com/TomKerkhove

https://be.linkedin.com/in/tomkerkhove

Analysing StackExchange datawith Azure Data Lake

Analysing StackExchange data with Azure Data Lake

Nice to meet youTom KERKHOVE➔ Integration Professional➔ IoT Competency Lead➔ Windows Development &

Microsoft Azure MVP

tom.kerkhove@codit.eu+32 473 701 074@TomKerkhovebe.linkedin.com/in/tomkerkhovegithub.com/tomkerkhove

Agenda• Why should we care about Big

Data?• Big Data in Azure• Azure Data Lake• Demo• Q & A

4

10101010110101

10101010110101

101010111

10101010110101

10101010110101

Integration of ThingsInternet of Things

6

Connect and scale with efficiency

Analyze and act on new

data

Integrate and transform

business processes

Business Systems1010100111010010110101010110101

10101010110100011010001011

10101010110101

10101010110100011010001011010101

Connect and scale with efficiency

Analyze and act on new

data

Integrate and transform

business processes

Event producers & gateways

Ingestion & transformation Report, Act, Predict

Microsoft Patterns & Practices – IoT Journey

10

11

Cluster Management

12

Languages

Platform Services

Infrastructure Services

OS/Server Compute Storage

Datacenter Infrastructure (24 Regions, 22 Online)

Web and Mobile

Web Apps

MobileApps

APIManagement

API Apps

Logic Apps

Notification Hubs

Media & CDNContent DeliveryNetwork (CDN)

Media Services

Integration

BizTalkServices

HybridConnections

Service Bus

StorageQueues

HybridOperations

Backup

StorSimple

Azure SiteRecovery

Import/Export

Networking

Data

SQL Database

DocumentDB

RedisCache Azure

SearchStorageTables

DataWarehouse Azure AD

Health Monitoring

Virtual Network

ExpressRoute

BLOB Storage AzureFiles

PremiumStorage

Virtual Machines

AD PrivilegedIdentity Management

Traffic Manager

AppGateway

OperationalAnalytics

Services ComputeCloud Services

Batch RemoteApp

ServiceFabric

Developer Services

Visual Studio

AppInsights

Azure SDK

VS Online

ContainerService

DNS VPN GatewayLoad Balancer

Domain Services

Analytics & IoT

HDInsight MachineLearning

StreamAnalytics

Data Factory

EventHubs

MobileEngagement

Data Lake

IoT Hub

Data Catalog

Security & Management

Azure ActiveDirectory

Multi-FactorAuthentication

Automation

Portal

Key Vault

Store/Marketplace

VM Image Gallery& VM Depot

Azure ADB2C

Scheduler

Overview in Azure

14

DocumentDB

Data Factory Stream Analytics Data Lake HDInsight Data Lake(Store & Analytics)

Virtual Machine

IoT Hub SQL DataWarehouse

SQL DatabaseStorageEvent HubsDocument Db

Data Ingestion Data Storage

Data Pipelines

Machine Learning

Data Analytics

Personal Digital Assistant – Cortana

Perceptual Intelligence

Preconfigured Solutions

Dashboards and Visualizations

Machine Learning and Analytics

Big Data Store

Information Management

Cortana Analytics Suite

16

Analysing Big Data in Azure

Azure Data Lake Family

HDInsight Data Lake Store Data Lake Analytics

• Unlimited storage• WebHDFS Store

• Managed cluster service• Open-source technology• Runs on Windows or

Linux

• Managed job service• U-SQL batch-processing

Azure Data Lake Store➔ WebHDFS compatible➔ Any size➔ Any format as-is➔ Write-once-read-many➔ Enterprise-grade security

➔ Thé big data store in Azure

18

Characteristics➔ Data Warehousing

➔ Structured data➔ Defined set of schemas➔ Requires Extract-

Transform-Load (ETL) before storing

➔ Known for some of us

➔ Exploratory analysis is hard because of transforming the data

19

Data Lake vs Data Warehousing➔ Data Lake

➔ Raw data(unstructured/semi-structured/structured)

➔ “Dump” all your data in the lake

➔ Data scientists will interpret data from the lake

➔ Without metadata, turns in a data swamp pretty fast

20

Martin Fowler on Data Lake & Data Warehouses(link)

Azure Data Lake Analytics➔ Run analytics jobs on managed clusters

➔ Don’t worry about scale➔ Written in U-SQL

➔ SQL Syntax➔ Extensibility in C#

➔ Easily scaled with Analytics Units➔ Pay for processing time only

21

Writing U-SQL scripts

22

Extract from data source by using built-in or custom extractors.

Transform / Analyse the data using SQL-syntax, in-line C# or C# method calls

Output the result to a data source by using built-in or custom extractors

23

Data Lake Analytics - Data Sources

U-SQL Query

Query

Query

Query

Write

Query

Azure Storage Blobs

Azure Data Lake Store

Azure SQL Database

Azure SQL Data

Warehouse

Azure SQL in VMs

Azure Data Lake Analytics

25

Meet StackExchange➔ Over 280 subwebsites➔ 150+ GB of open-source data➔ Different kinds of data

➔ Posts➔ Users➔ Votes➔ ...

➔ A big data sample data set

What Are We Going To Do?

• Downloading the original data set

Acquiring The Data

• Upload data set to Azure• Determine what

service to use

Moving The Data

• Visualize what we’ve learned

Visualizing The Data

27

Azure Data Lake tools for Visual Studio➔ Projects / Solutions / Source control➔ Store Explorer

➔ Browse store➔ Download complete / subset of file➔ Preview

➔ Job Visualizer➔ Determine bottlenecks by using heatmaps➔ Playback jobs based on telemetry➔ Query optimization➔ Job Profiler

➔ Off-Line execution28

Integration with Azure Services➔ Integrate in your data pipelines in Azure Data

Factory➔ Move data from Azure Data Lake Store to other store➔ Move data to Azure Data Lake Store➔ Run U-SQL query within pipeline

➔ Integration with Azure Data Catalog➔ Register your Azure Data Lake Store assets

29

Pricing➔ Data Lake Store

➔ $0,08/GB stored per month➔ $0,14 per 1M transactions

• 1 transaction is block of up to 128 kB➔ Egress will be billed but not know yet

➔ Data Lake Analytics➔ $0,05 per job➔ $0,05 per minute per Analytics Unit for processing

time

30

Azure Data Lake Store vs Blob Storage

31

No LimitationsStore whatever you want in any format

SecurityBuilt-in Azure Active Directory support

PricingMore expensive than Storage RA-GRS

RedundancyIt’s there but no control over it

Built for ScaleOptimized for high-scale reads

IntegrationWith Data Factory, Data Catalog & HDInsight

32

Summary➔ Big Data is not just a hype so get ready➔ Azure Data Lake Store

➔ Analyse today & explore tomorrow➔ Data Swamps

➔ Data Lake Analytics➔ No cluster management➔ Re-use existing skills➔ Pay for what we use

➔ Big Data in Azure? Azure Data Lake family and it’s easy!

35

36

37

Recommended