46
Hands-On with U-SQL and Azure Data Lake Analytics (ADLA) A first look at U-SQL on Azure Data Lake Public Preview Jason Brugger (@JasonLBrugger) MCSE: Data Platform, MCSE: Business Intelligence July 16, 2016

Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)

Embed Size (px)

Citation preview

Page 1: Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)

Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)A first look at U-SQL on Azure Data Lake Public Preview

Jason Brugger (@JasonLBrugger)MCSE: Data Platform, MCSE: Business IntelligenceJuly 16, 2016

Page 2: Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)

This presentation has been modified from its original

format. Animations have been removed and it has been reformatted for publication on Slideshare.net.

Page 3: Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)

Assumptions

• You are familiar with the differences between a traditional RDBMS and a Big Data solution.

• You are familiar with both T-SQL and C#.

Page 4: Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)

What is a data lake? What are Azure Data Lake Store and Azure Data Lake Analytics?• A data lake is a storage repository that holds a vast amount of raw data

in its native format until it is needed. – Margaret Rouse (on AWS)

• Pentaho CTO James Dixon has generally been credited with coining the term “data lake”. He describes a data mart (a subset of a data warehouse) as akin to a bottle of water…” cleansed, packaged and structured for easy consumption” while a data lake is more like a body of water in its natural state. – Chris Campbell, Blue Granite

• Data Lake Analytics is an Azure Big Data computation service that lets you use data to drive your business using the insights gained from your data in the cloud, regardless of where it is and regardless of its size. – Ed Macauley, Microsoft

Data LakeStore

Data LakeAnalytics

Page 5: Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)

ADLA vs. HDInsight (e.g. Hadoop)

• HDInsight (Cluster as a Service)

• Provision cluster of n nodes• Run your queries• Delete cluster• (Repeat)

• ADLA (Query as a service)• Don’t provision anything• Specify node count

(parallelism) at job submission time

• Pay per query

Page 6: Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)

Getting Started – What’s Needed?• Azure subscription

• Sign-up for ADL preview

• Visual Studio 2015 + Azure Data Lake Tools for Visual Studio

• Microsoft Azure PowerShell (1.0+ via WPI)

• Not to be confused with the version of PowerShell, e.g. 5.0.

• Microsoft Azure SDK for .NET (Optional)

Page 7: Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)

Tools and Navigation

• Azure Portal

Page 8: Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)

Tools and Navigation

• Azure Portal • Visual Studio

• Server Explorer

Page 9: Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)

Tools and Navigation

• Azure Portal • Visual Studio

• Server Explorer• Project Templates

• PowerShell • SDKs

• C#• Node.js

Page 10: Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)

Getting data into ADL

• Portal

Page 11: Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)

Getting data into ADL

• Portal• PowerShell

• Login-AzureRmAccount

• Import-AzureRmDataLakeStoreItem

• Connecting to External Data (Demo #2)

• SSIS • ADF

Page 12: Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)

The Data (NOAA Weather observations)

Station Datekey

Element Value Mflag Qflag Sflag TimeKey

US1FLSL0019 20150101 PRCP 173 N

US1TXTV0133 20150101 PRCP 119 N

USC00178998 20150101 TMAX -33 700

USC00178998 20150101 TMIN -167 700

USC00178998 20150101 TOBS -67 700

USC00178998 20150101 PRCP 0 700

USC00178998 20150101 SNOW 0

USC00178998 20150101 SNWD 0

USR0000CSNR 20150101 TMAX 194 H D U

Notice sparse data with many null values

Page 13: Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)

The Data (NOAA Weather observations)

Station Datekey

Element Value Mflag Qflag Sflag TimeKey

US1FLSL0019 20150101 PRCP 173 N

US1TXTV0133 20150101 PRCP 119 N

USC00178998 20150101 TMAX -33 700

USC00178998 20150101 TMIN -167 700

USC00178998 20150101 TOBS -67 700

USC00178998 20150101 PRCP 0 700

USC00178998 20150101 SNOW 0

USC00178998 20150101 SNWD 0

USR0000CSNR 20150101 TMAX 194 H D U

Multiple observation types, per site, per day

Page 14: Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)

The Data (NOAA Weather observations)

Station Datekey

Element Value Mflag Qflag Sflag TimeKey

US1FLSL0019 20150101 PRCP 173 N

US1TXTV0133 20150101 PRCP 119 N

USC00178998 20150101 TMAX -33 700

USC00178998 20150101 TMIN -167 700

USC00178998 20150101 TOBS -67 700

USC00178998 20150101 PRCP 0 700

USC00178998 20150101 SNOW 0

USC00178998 20150101 SNWD 0

USR0000CSNR 20150101 TMAX 194 H D U

Tenths of degree C

Page 15: Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)

The Data (NOAA Weather observations)

Station Datekey

Element Value Mflag Qflag Sflag TimeKey

US1FLSL0019 20150101 PRCP 173 N

US1TXTV0133 20150101 PRCP 119 N

USC00178998 20150101 TMAX -33 700

USC00178998 20150101 TMIN -167 700

USC00178998 20150101 TOBS -67 700

USC00178998 20150101 PRCP 0 700

USC00178998 20150101 SNOW 0

USC00178998 20150101 SNWD 0

USR0000CSNR 20150101 TMAX 194 H D U

Correlates to external data uploaded to Azure SQL Database

Page 16: Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)

Basic U-SQL query

Load .csv file from Data Lake

Store using built-in Extractor

Schematize using C# data types, note nullability

Output schematized rows to a table variable

SELECT using familiar SQL-like queryOutput query

result to Data Lake Store using built-in

Outputter

Page 17: Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)

Key Takeaways & ‘gotchas’

• SQL statements MUST be uppercase• Header rows not currently supported by Extractor

• e.g skipFirstNRows:1 not currently supported

• Be mindful of nullability in C# types• Built-in operators include support for .Csv(), .Tsv(), & .Text()

• Various options such as delimiter• Build custom extractors by inheriting IExtractor

Page 18: Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)

DEMO #1

• Demo local execution• Simple aggregation of 10,000 rows down to 43, by element

type

Page 19: Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)

Persisted schema with meta data object model

Familiar CREATE DATABASE statement

Familiar CREATE VIEW statement; View maintains extractor and schema definitions so from now on, we can just select from the view.

Note data kept in its native compressed

(.gz) format. Extractor handles decompression in

this case Wildcard {*} yields file-set of all

matching files

Page 20: Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)

Combining with external data

• Create catalog secret using PowerShell (specifies remote Host & credentials & ADLA catalog)

• New-AzureRmDataLakeAnalyticsCatalogSecret

• Create credential (in turn, references catalog secret)

• Create data source (in turn, references credential & specifies data source type (e.g. Azure SQL Db) & specifies remote catalog)

CREATE EXTERNAL TABLE denotes

underlying table resides remotely

Schema using C# types

The data

source

nameRemote table name

Page 21: Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)

External data with federated querySELECT FROM

EXTERNAL data source EXECUTE

Embedded query executes remotely at data source. This

is T-SQL, not U-SQL

Page 22: Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)

External data with federated query

Embedded query executes remotely at data source. This

is T-SQL, not U-SQL

Table variable contains only rows

returnedid nameUS1FLHB0090 TAMPA 10.2 NNW US1FLHB0048 GREATER NORTHDALE 0.4 ENE USW00012810 MACDILL AFB USC00088890 TEMPLE TERRACES US1FLHB0007 TAMPA 8.4 NW US1FLHB0025 CARROLLWOOD 1.7 SE US1FLHB0040 UNIVERSITY WEST 2.0 WNW USC00088786 TAMPA US1FLHB0028 WEST PARK 0.4 S US1FLHB0055 TAMPA 5.0 NNE US1FLHB0012 CARROLLWOOD 0.5 WNW US1FLHB0096 TAMPA 5.4 SSW US1FLHB0005 CITRUS PARK 1.3 ENE USC00080520 BAY LAKE US1FLHB0093 TEMPLE TERRACE 1.5 SE US1FLHB0039 CARROLLWOOD 2.0 SSE US1FLHB0087 TAMPA 7.9 N US1FLHB0071 TAMPA 6.1 N USW00012842 TAMPA INTL AP US1FLHB0010 TAMPA 5.1 S US1FLHB0036 TAMPA 4.4 SSW US1FLHB0029 TAMPA 6.5 NNE US1FLHB0064 TAMPA 4.7 NW US1FLHB0051 LUTZ 2.2 SSE

Page 23: Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)

Data relationship exhibit

Azure SQL Db

Federated Query

dbo.station

dbo.calendar

Azure Data Lake Analytics

@station_tpa

dbo.calendar

dbo.observ-ation

Result

Azure Data Lake Store

U-SQ

L

Page 24: Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)

Complex types in U-SQL & EXPLODE

• SQL.ARRAY• Like a List or Array in C#• Can be used in conjunction with

String.Split()

• SQL.MAP• Key-Value pairs• Like a Dictionary (Hash table) in C#

• EXPLODE• Expands to rows

ID (int) Data (SQL.MAP)1 ((“A”,25),(“B”,35),(“C”,45))2 ((“A”,27),(“B”,38),(“C”,42))

ID

1 A 251 B 351 C 452 A 272 B 382 C 42

EXPLODE

Page 25: Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)

Tying it all together with U-SQL

Familiar JOIN syntax; Note double equals “==“, the only supported JOIN operatorFamiliar WHERE and GROUP BY syntax

Page 26: Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)

Tying it all together with U-SQLCROSS APPLY the value (recall this was in tenths of degrees C)

Page 27: Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)

Tying it all together with U-SQLCROSS APPLY the value (recall this was in tenths of degrees C)Declare a new SQL.MAP with conversion factors by C, F, and K

Page 28: Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)

Tying it all together with U-SQLCROSS APPLY the value (recall this was in tenths of degrees C)Declare a new SQL.MAP with conversion factors by C, F, and KEXPLODE the SQL.MAP into rows and new columns scale, temp

Page 29: Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)

Tying it all together with U-SQL

Familiar aggregation AVG using exploded column temp

Page 30: Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)

Tying it all together with U-SQL

Using String.Concat .NET method to build description of derived column e.g. “AVG_TMAX_F”

AVG_TMAX_CAVG_TMAX_KAVG_TMIN_FAVG_TMIN_CAVG_TMIN_K

Page 31: Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)

Tying it all together with U-SQLCREATE TABLE AS SELECT (CTAS) – Conceptually similar to select into

Page 32: Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)

Tying it all together with U-SQLCREATE TABLE AS SELECT (CTAS) – Conceptually similar to select into

No HEAPs in ADLA; Clustered index required

Page 33: Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)

Tying it all together with U-SQLCREATE TABLE AS SELECT (CTAS) – Conceptually similar to select into

No HEAPs in ADLA; Clustered index required

Partitioned by Round Robin distributes data evenly.

Page 34: Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)

Tying it all together with U-SQLCREATE TABLE AS SELECT (CTAS) – Conceptually similar to select into

No HEAPs in ADLA; Clustered index required

Partitioned by Round Robin distributes data evenly.

No update or merge support

Page 35: Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)

DEMO #2

• Data Lake data set consists of daily readings from 98,035 stations over 5 years

• ~32,720,048 rows per file• About 164M rows total

• 24 Tampa stations• Filtering and aggregating it down to 5 years x 12 months x 2 elements

x 3 temperature scales, or 360 rows• Monitor job execution status

• Streams, Vertices, Display avg execution time (heat map), Diagnostics, History, Script*

Page 36: Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)

Working with Assemblies & Libraries

• Code-behind file• Convenient, simple• Assembly created and referenced

automatically• No support for NuGet, but manually add

references…OR:

• Class library• Right-click, register assembly to ADLA• Option to automatically copy to DLS• NuGet supported normally

Page 37: Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)

Example: Simple Linear Regression to predict temp

.dlls for statistics library copied to Data Lake Store

CREATE ASSEMBLY from file; (Can also create from binary)

Custom C# class method signature Noaa.Predict.Regress(int, SqlMap<int, decimal?>) : decimal?

Page 38: Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)

Example: Simple Linear Regression to predict temp

.dlls for statistics library copied to Data Lake Store

CREATE ASSEMBLY from file; (Can also create from binary)

Custom C# class method signature Noaa.Predict.Regress(int, SqlMap<int, decimal?>) : decimal?

MAP_AGG() function returns SQL.MAP – like a reverse EXPLODE, which we pass as function parameter

Page 39: Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)

MAP_AGG() Exhibit

Month Year Avg Temp

1 2011 34

1 2012 36

1 2013 33

1 2014 35

1 2015 37

2 2011 41

2 2012 39

12 2015 26

Page 40: Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)

.CS code-behind file

Year to predict, e.g. 2016

Referenced Library namespace

Page 41: Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)

.CS code-behind file

Return predicted Temp

Year to predict, e.g. 2016

SqlMap contains series against which regression is performed

Referenced Library namespace

Page 42: Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)

DEMO #3

• Pivoting our existing averages on Month & aggregating Year & Temp into Key-Value pairs (SQL.MAP) which we pass as parameter to custom function

• Passing predictive year (2016) as a parameter• Limit our selection to just AVG_TMAX_F• Result adds another 12 rows of predicted temps to our

existing 360 row result table

Page 43: Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)

Additional homework subjects, not covered• Extractor (UDO) by inheriting IExtractor• IOutputter• IProcessor – transform single row, read one,

output one• IReducer – read n rows, output 1 row• ICombiner – like a user-defined Join• IApplier – input one row, output n rows• User-defined Aggregators (IAggregate) –

AGG keyword• ARRAY_AGG()

• Blob as External storage• No Primary Keys• No columnstore (yet)• Table Value Functions - YES, but not with

cross apply• No support for R, but leverage .NET

libraries as demo’d• User-defined Types• Partitioning by Hash, Direct Hash, Range

• http://www.slideshare.net/MichaelRys/usql-partitioned-data-and-tables-sqlbits-2016

Page 44: Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)

Reference

• Code• https://github.com/SQL-Jason/NOAA_USQL_Demo.git

• Data• http://www1.ncdc.noaa.gov/pub/data/ghcn/daily/by_year/

• Blog• http://jasonbrugger.wordpress.com

Page 45: Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)

Attribution

The Accord.NET FrameworkCopyright (c) 2009-2014, César Roberto de Souza <[email protected]>This library is free software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation; either version 2.1 of the License, or (at your option) any later version.The copyright holders provide no reassurances that the source code provided does not infringe any patent, copyright, or any other intellectual property rights of third parties. The copyright holders disclaim any liability to any recipient for claims brought against recipient by any third party for infringement of that parties intellectual property rights.This library is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details.

National Oceanic and Atmospheric Administration (NOAA)README FILE FOR DAILY GLOBAL HISTORICAL CLIMATOLOGY NETWORK (GHCN-DAILY) Version 3.22How to cite:Note that the GHCN-Daily dataset itself now has a DOI (Digital Object Identifier)so it may be relevant to cite both the methods/overview journal article as well as the specific version of the dataset used.The journal article describing GHCN-Daily is:Menne, M.J., I. Durre, R.S. Vose, B.E. Gleason, and T.G. Houston, 2012: An overview of the Global Historical Climatology Network-Daily Database. Journal of Atmospheric and Oceanic Technology, 29, 897-910, doi:10.1175/JTECH-D-11-00103.1.To acknowledge the specific version of the dataset used, please cite:Menne, M.J., I. Durre, B. Korzeniewski, S. McNeal, K. Thomas, X. Yin, S. Anthony, R. Ray, R.S. Vose, B.E.Gleason, and T.G. Houston, 2012: Global Historical Climatology Network - Daily (GHCN-Daily), Version 3. [indicate subset used following decimal, e.g. Version 3.12]. NOAA National Climatic Data Center. http://doi.org/10.7289/V5D21VHZ [access date].

Page 46: Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)

Bibliography• Campbell, C. “Top Five Differences between Data Lakes and Data Warehouses.” Business Insights. Blue Granite, 26 Jan 2015. Web.

https://www.blue-granite.com/blog/bid/402596/Top-Five-Differences-between-Data-Lakes-and-Data-Warehouses• Gopalan, R. (21 Jun 2016). U-SQL Part 4: Use custom code to extend U-SQL [Webinar]. PASS Big Data Virtual Chapter.• Macauley, E. “Overview of Microsoft Azure Data Lake Analytics.” Microsoft Azure. Microsoft, 16 May 2016. Web.

https://azure.microsoft.com/en-us/documentation/articles/data-lake-analytics-overview/• Reddy, S. (31 May 2016). Introduction to Azure Data Lake [Webinar]. PASS Big Data Virtual Chapter.• Rossello, Justin. “Querying Azure SQL Database from an Azure Data Lake Analytics U-SQL Script.” eat{Code}live. 21 Nov 2015. Web.

http://eatcodelive.com/2015/11/21/querying-azure-sql-database-from-an-azure-data-lake-analytics-u-sql-script/• Rouse, M. “Definition Data Lake.” SearchAws. TechTarget, May 2015. Web. http://searchaws.techtarget.com/definition/data-lake• Rys, M. (8 Mar 2016). Introducing U-SQL; Part 2 of 2: Scaling U-SQL and doing SQL in U-SQL [Webinar]. PASS Big Data Virtual Chapter. Retrieved

from http://www.youtube.com/channel/UCkOKmMW_LEsACOqE8C1RWdw• Rys, M. (16 Feb 2016). Introducing U-SQL; Part 1 of 2: Introduction and C# extensibility [Webinar]. PASS Big Data Virtual Chapter. Retrieved from

http://www.youtube.com/channel/UCkOKmMW_LEsACOqE8C1RWdw• Rys, M., et. al. Azure/usql, (2016), GitHub repository, https://github.com/Azure/usql• “U-SQL Language Reference.” Microsoft Azure. Microsoft, 28 Oct 2015. Web.

https://msdn.microsoft.com/en-US/library/azure/mt591959(Azure.100).aspx