32
SQLBits 2016 Azure Data Lake & U-SQL Michael Rys, @MikeDoesBigData http://www.azure.com/datalake {mrys, usql}@microsoft.com

Azure Data Lake Intro (SQLBits 2016)

Embed Size (px)

Citation preview

Page 1: Azure Data Lake Intro (SQLBits 2016)

SQLBits 2016

Azure Data Lake &U-SQLMichael Rys, @MikeDoesBigData

http://www.azure.com/datalake{mrys, usql}@microsoft.com

Page 2: Azure Data Lake Intro (SQLBits 2016)

The Data Lake Approach

Page 3: Azure Data Lake Intro (SQLBits 2016)

CLOUD

MOBILE

Growth of data

INTERNET CONNECTEDDIGITALANALOG

1985 1990 1995 2000 2005 2010 2015 2020

CLOUD

MOBILE

Page 4: Azure Data Lake Intro (SQLBits 2016)

Implement Data WarehouseReporting & Analytics Development

Reporting & Analytics Design

Physical Design

Dimension Modelling

ETL DevelopmentETL Design

Install and TuneSetup Infrastructure

Traditional data warehousing approach

Data sources

ETL

BI and analytics

Dashboards

Reporting

Data warehouse

Understand Corporate Strategy

Gather Requirements

Business Requirement

s

Technical Requirements

Page 5: Azure Data Lake Intro (SQLBits 2016)

The Data Lake approach

Ingest all data regardless of requirements

Store all data in native format without schema definition

Do analysisUsing analytic engines like Hadoop

Interactive queriesBatch queries

Machine LearningData warehouse

Real-time analytics

Devices

Page 6: Azure Data Lake Intro (SQLBits 2016)

Source: ComScore 2009-2015 Search Report US

2009 2010 2011 2012 2013 2014 20150%

5%

10%

15%

20%

25%

9%11%

15%16%

18%19% 20%

MICROSOFT DOUBLES SEARCH SHARE

How Microsoft has used Big DataWe needed to better leverage data and analytics to win in searchWe changed our approach• More experiments by more people!

So we…Built an Exabyte-scale data lake for everyone to put their data.Built tools approachable by any developer.Built machine learning tools for collaborating across large experiment models.

Page 7: Azure Data Lake Intro (SQLBits 2016)

Introducing Azure Data LakeBig Data Made Easy

Page 8: Azure Data Lake Intro (SQLBits 2016)

Business ScenariosRecommendations,

customer churn,forecasting, etc.

Perceptual IntelligenceFace, vision

Speech, text

Personal Digital Assistant

Cortana

Dashboards and Visualizations

Power BI

Machine Learning

and Analytics

Azure Machine Learning

Azure Stream Analytics

Cortana Analytics SuiteBig Data & Advanced Analytics

DATA

Business apps

Custom apps

Sensors and devices

INTELLIGENCE ACTION

People

Automated Systems

Information Management

Azure Data Factory

Azure Data Catalog

Azure Event Hub

Big Data Stores

Azure SQL Data Warehouse

Azure Data Lake store

Azure Data Lake Analytics

Azure Data LakeManaged clusters

Page 9: Azure Data Lake Intro (SQLBits 2016)

Analytics

Storage

HDInsight(“managed clusters”)

Azure Data Lake Analytics

Azure Data Lake Storage

Azure Data Lake

Page 10: Azure Data Lake Intro (SQLBits 2016)

Azure Data Lake Storage Service

Page 11: Azure Data Lake Intro (SQLBits 2016)

No limits to SCALE

Store ANY DATA in its native format

HADOOP FILE SYSTEM (HDFS) for the cloud

ENTERPRISE GRADE access control, encryption at rest

Optimized for analytic workload PERFORMANCE

Azure Data Lake StoreA hyper scale repository for big data analytics workloads

IN PREVIEW

Page 12: Azure Data Lake Intro (SQLBits 2016)

Data Lake Store: Built for the cloudSecure Must be highly secure to prevent unauthorized access (especially as all data is in one

place).

Native format Must permit data to be stored in its ‘native format’ to track lineage and for data provenance.

Low latency Must have low latency for high-frequency operations.

Must support multiple analytic frameworks—Batch, Real-time, Streaming, Machine Learning, etc. No one analytic framework can work for all data and all types of analysis.

Multiple analytic frameworks

Details Must be able to store data with all details; aggregation may lead to loss of details.

Throughput Must have high throughput for massively parallel processing via frameworks such as Hadoop and Spark.

Reliable Must be highly available and reliable (no permanent loss of data).

Scalable Must be highly scalable. When storing all data indefinitely, data volumes can quickly add up.

All sources Must be able ingest data from a variety of sources-LOB/ERP, Logs, Devices, Social NWs etc.

Page 13: Azure Data Lake Intro (SQLBits 2016)

Four pillars of security and compliance

Authentication

Authorization

Auditing Data Protection

Azure Active Directory

OAuth

Role-based access control

POSIX ACLs

Audit logs

Forensic analysis

Transparent encryption

Key Mgmt.

Page 14: Azure Data Lake Intro (SQLBits 2016)

MICROSOFT CONF IDENT IAL – INTERNAL ONLY

Scenario: Securing a Big Data pipeline

Social

ClickstreamWeb

Contoso Acme.com

CONTOSO• Retail company with large market presence.• Records sales transactions, user interactions on

website, social data, etc.

RETAIL ANALYTICS (Acme.com)• Provides social media-based sentiment analysis.• Hired by Contoso to:

• Develop insights from social-media information combined with user activity on Contoso portal.

TASK FOR CONTOSO IT ADMIN:• Provide Acme.com employees access to Contoso

data.• Allow Acme.com employees to submit U-SQL

jobs.

Page 15: Azure Data Lake Intro (SQLBits 2016)

FULLY SUPPORTED Hadoop for the cloud

Available on LINUX and WINDOWS

Works on AZURE STORAGE or DATA LAKE STORE

100% OPEN SOURCE Apache Hadoop (HDP 2.3)

Clusters up and RUNNING IN MINUTES

Use familiar BI TOOLS FOR ANALYSIS like Excel

Azure HDInsightHadoop Platform as a Service on Azure

Page 16: Azure Data Lake Intro (SQLBits 2016)

Azure Data Lake Analytics Service

Page 17: Azure Data Lake Intro (SQLBits 2016)

WebHDFS

YARN

U-SQL

ADL Analytics

ADL HDInsight

1

1

1

1

1

1 1

1

1

1

1

1

Store

HiveAnalytics

Storage

Azure Data Lake (Store, HDInsight, Analytics)

Page 18: Azure Data Lake Intro (SQLBits 2016)

ADLA complements HDInsightTarget the same scenarios, tools, and customers

HDInsightFor developers familiar with the Open Source: Java, Eclipse, Hive, etc.

Clusters offer customization, control, and flexibility in a managed Hadoop cluster

ADLAEnables customers to leverage existing experience with C#, SQL & PowerShell

Offers convenience, efficiency, automatic scale, and management in a “job service” form factor

Page 19: Azure Data Lake Intro (SQLBits 2016)

No limits to SCALE

Includes U-SQL, a language that unifies the benefits of SQL with the expressive power of C#

Optimized to work with ADL STORE

FEDERATED QUERY across Azure data sources

ENTERPRISE GRADE role-based access control and auditing

Pay PER QUERY and scale PER QUERY

Azure Data Lake AnalyticsA distributed analytics servicebuilt on Apache YARN that dynamically scales to your needs

IN PREVIEW

Page 20: Azure Data Lake Intro (SQLBits 2016)

ADL and SQLDW

XML

JSON

Preparation• Pre-process • Transpose• Re-format

TEXT Model• Load• Transform• Aggregate• Consume

High Value Data

Unknown Value

Data

BatchAd-hocBatch

Page 21: Azure Data Lake Intro (SQLBits 2016)

Work across all cloud data

Azure Data Lake Analytics

Azure SQL DW Azure SQL DB Azure Storage Blobs

Azure Data Lake Store

SQL DB in an Azure VM

Page 22: Azure Data Lake Intro (SQLBits 2016)

DemoShow me ADL!

Page 23: Azure Data Lake Intro (SQLBits 2016)

Simplified management and administration

Web-based management in Azure PortalAutomate tasks using PowerShellRole-based access control with Azure ADMonitor service operations and activity

Page 24: Azure Data Lake Intro (SQLBits 2016)

Get started

Log in to Azure

Create an ADLA account

Write and submit an ADLA job with U-SQL (or Hive/Pig)

The job reads and writes data from storage

1 2 3 4

30 seconds

ADLSAzure BlobsAzure DB…

Page 25: Azure Data Lake Intro (SQLBits 2016)

Azure Data Lake SDK/CLI

Page 26: Azure Data Lake Intro (SQLBits 2016)

ADL Store (ADLS) feature set(not exhaustive)Account ManagementCreate new accountList accountsUpdate account propertiesDelete account

Transferring DataUpload into store from local diskDownload from store to local disk

Files and FoldersList contents of folderCreateMoveDeleteDoes file exist

SecurityGet ACLsUpdate ACLsGet OwnerSet Owner

File ContentSet file contentAppend file contentGet file contentMerge files

Page 27: Azure Data Lake Intro (SQLBits 2016)

ADL Analytics (ADLA) feature set(not exhaustive)Account ManagementCreate new accountList accountsUpdate account propertiesDelete account

Data SourcesAdd a data sourceList data sourcesUpdate data sourceDelete data source

ComputeList jobsSubmit jobCancel job

Catalog ItemsList items in U-SQL catalogUpdate item

Catalog SecretsCreate catalog secretList catalog secretsDelete catalog secrets

Page 28: Azure Data Lake Intro (SQLBits 2016)

SDKs/APIs: development options

ADL .NET SDKs

Azure and ADL REST APIs

ADL PowerShe

llADL XPlat

CLI

ADL Node.js SDK ADL Java SDK

Your application

Page 29: Azure Data Lake Intro (SQLBits 2016)

Five REST API endpoints

ManagementCreate and manage ADLA accounts

JobsSubmit and manage jobs

CatalogExplore catalog items

ManagementCreate and manage ADLS accounts

File SystemUpload, download, list, delete, rename, append

(WebHDFS)

Analytics Store

Page 30: Azure Data Lake Intro (SQLBits 2016)

Developer landscape: .NET SDKs

Analytics .NET SDK

Store .NET SDK

• Management• Catalog• Jobs• Management• Filesystem• Uploader

SDKs NuGet packages

Available from NuGet (nuget.org – search for DataLake)Microsoft.Azure.Management.DataLake.Store

Microsoft.Azure.Management.DataLake.AnalyticsMicrosoft.Azure.Management.DataLake.Uploader

Page 31: Azure Data Lake Intro (SQLBits 2016)

Workflow1. Authenticate using OAuth 2.0 grant flow

Get an OAuth token from Azure Active Directory (Azure AD, AAD)2. Setup

Create a service client object3. Do work

Call methods on the client object

Using the SDKs

Page 32: Azure Data Lake Intro (SQLBits 2016)

http://aka.ms/AzureDataLake