22
BASLE BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENEVA HAMBURG COPENHAGEN LAUSANNE MUNICH STUTTGART VIENNA ZURICH Polybase challenges Hive relational access to non - relational HDFS Olaf Nimz

Trivadis TechEvent 2016 Polybase challenges Hive relational access to non-relational HDFS by Olaf Nimz

Embed Size (px)

Citation preview

Page 1: Trivadis TechEvent 2016 Polybase challenges Hive relational access to non-relational HDFS by Olaf Nimz

BASLE BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENEVA

HAMBURG COPENHAGEN LAUSANNE MUNICH STUTTGART VIENNA ZURICH

Polybase challenges Hiverelational access to non-relational HDFS

Olaf Nimz

Page 2: Trivadis TechEvent 2016 Polybase challenges Hive relational access to non-relational HDFS by Olaf Nimz

Agenda

Proposed marriage between SQL Server and Hadoop

Building Bridges to HDFS

Distributed query processing

Sensible Hybrid Scenarios

Page 3: Trivadis TechEvent 2016 Polybase challenges Hive relational access to non-relational HDFS by Olaf Nimz

Take Home Message

1. Access to non-relational world is easier with Polybase

T-SQL only

Unstructured data still complex e.g. nested JSON stuctures

2. Hybrid solutions

Fact Extractor - IoT

Staging Area for DWH – keep entire history

Dirty data source files

Near real-time

3. Scenarios

Swiss Air - Flight Logs

SwissCom - Call Data Records

Archiving (c)old DWH Facts

Page 4: Trivadis TechEvent 2016 Polybase challenges Hive relational access to non-relational HDFS by Olaf Nimz

Polybase

Page 5: Trivadis TechEvent 2016 Polybase challenges Hive relational access to non-relational HDFS by Olaf Nimz

Polybase

Requirements

– Java (64-bit JRE >7.51)

– Azure storage account or Hadoop (not HDInsight)

> Hortonwork’s Data Platform (HDP 1.3, 2.0 – 2.3)

> Cloudera’s CDH (4.3, 5.1 – 5.5)

Installation Check

– SELECT SERVERPROPERTY ('IsPolybaseInstalled'); returns 1?

Configuration external data source

– sp_configure @configname = 'hadoop connectivity', @configvalue = 7;

Page 6: Trivadis TechEvent 2016 Polybase challenges Hive relational access to non-relational HDFS by Olaf Nimz

Data Movement Services

Page 7: Trivadis TechEvent 2016 Polybase challenges Hive relational access to non-relational HDFS by Olaf Nimz

FeatureSQL Server

2016

Azure SQL Data

WarehouseAPS Appliance - PDW

Query Hadoop data with Transact-SQL yes no yes

Query Azure blob storage with

Transact-SQLyes yes yes

Import data from Hadoop yes no yes

Import data from Azure blob storage yes yes yes

Export data to Hadoop yes no yes

Export data to Azure blob storage yes yes yes

Run PolyBase queries from Microsoft's

BI toolsyes yes yes

Push down query computations to

Hadoopyes no yes

Feature

Page 8: Trivadis TechEvent 2016 Polybase challenges Hive relational access to non-relational HDFS by Olaf Nimz

Objects for Polybase

Page 9: Trivadis TechEvent 2016 Polybase challenges Hive relational access to non-relational HDFS by Olaf Nimz

2015 © Trivadis

Define external objects

CREATE MASTER KEY ENCRYPTION

BY PASSWORD = 'S0me!nfo';

CREATE DATABASE SCOPED CREDENTIAL

HadoopUser

WITH IDENTITY = '<hadoop_user_name>', SECRET = '<hadoop_password>';

CREATE EXTERNAL DATA SOURCE

HadoopCluster

WITH ( TYPE = HADOOP,

LOCATION ='hdfs://10.xxx.xx.xxx:xxxx',

RESOURCE_MANAGER_LOCATION = '10.xxx.xx.xxx:xxxx',

CREDENTIAL = HadoopUser);

Page 10: Trivadis TechEvent 2016 Polybase challenges Hive relational access to non-relational HDFS by Olaf Nimz

2015 © Trivadis

Define external objects

CREATE EXTERNAL FILE FORMAT

TextFileFormat

WITH ( FORMAT_TYPE = DELIMITEDTEXT,

FORMAT_OPTIONS (FIELD_TERMINATOR ='|', USE_TYPE_DEFAULT = TRUE)

CREATE EXTERNAL TABLE

[dbo].[CarSensor_Data] (

[SensorKey] int NOT NULL, [CustomerKey] int NOT NULL,

[GeographyKey] int NULL, [Speed] float NOT NULL,

[YearMeasured] int NOT NULL )

WITH (LOCATION = '/Demo/',

DATA_SOURCE = HadoopCluster,

FILE_FORMAT = TextFileFormat );

Page 11: Trivadis TechEvent 2016 Polybase challenges Hive relational access to non-relational HDFS by Olaf Nimz

2015 © Trivadis

Query external data

SELECT DISTINCT Insured_Customers.FirstName

, Insured_Customers.LastName

, Insured_Customers.YearlyIncome

, CarSensor_Data.Speed

FROM Insured_Customers

, CarSensor_Data -- cross join

WHERE Insured_Customers.CustomerKey = CarSensor_Data.CustomerKey

and CarSensor_Data.Speed > 35

ORDER BY CarSensor_Data.Speed DESC

OPTION (FORCE EXTERNALPUSHDOWN);

-- or OPTION (DISABLE EXTERNALPUSHDOWN)

Page 12: Trivadis TechEvent 2016 Polybase challenges Hive relational access to non-relational HDFS by Olaf Nimz

2015 © Trivadis

Export Data to Hadoop

CREATE EXTERNAL TABLE [dbo].[FastCustomers2009] ( … );

Move cold data to Hadoop/Blob while keeping it query-able via an external table:

INSERT INTO dbo.FastCustomer2009

SELECT *

FROM Insured_Customers T1

JOIN CarSensor_Data T2

ON (T1.CustomerKey = T2.CustomerKey)

WHERE T2.YearMeasured = 2009

AND T2.Speed > 40;

Page 13: Trivadis TechEvent 2016 Polybase challenges Hive relational access to non-relational HDFS by Olaf Nimz

Polybase

Objects in SSMS

Page 14: Trivadis TechEvent 2016 Polybase challenges Hive relational access to non-relational HDFS by Olaf Nimz

Dynamic Management Views

Monitor and troubleshoot PolyBase queries using the DMVs.

longest running queries

longest running step of the distributed query

execution progress of the longest running step

- of a SQL step

- XML remote query plan

- of a DMS step

Find information about external DMS operations

- View the PolyBase query plan

- XML remote query plan (node properties)

Page 15: Trivadis TechEvent 2016 Polybase challenges Hive relational access to non-relational HDFS by Olaf Nimz

JSON Format

Parse JSON text and read or modify values.

Transform arrays of JSON objects into table format.

Use any Transact SQL query on the converted JSON objects.

Format the results of Transact-SQL queries in JSON format.

Page 16: Trivadis TechEvent 2016 Polybase challenges Hive relational access to non-relational HDFS by Olaf Nimz

JSON

Page 17: Trivadis TechEvent 2016 Polybase challenges Hive relational access to non-relational HDFS by Olaf Nimz

Parse «unstructured» JSON cell content

stored in the jsonCol column:

[ { "name": "John", "skills": [ "SQL", "C#", "Azure“ ] }, { "name": "Jane", "surname": "Doe" } ]

SELECT Name, Surname,

JSON_VALUE(jsonCol, '$.info.address.PostCode') as PostCode,

JSON_VALUE(jsonCol, '$.info.address."Address Line 1"') +' '+

JSON_VALUE(jsonCol, '$.info.address."Address Line 2"') as Address,

JSON_QUERY(jsonCol, '$.info.skills') as Skills

FROM PeopleCollection

WHERE ISJSON(jsonCol) > 0

AND JSON_VALUE(jsonCol, '$.info.address.town') = 'Belgrade'

AND Status = 'Active'

ORDER BY JSON_VALUE(@jsonInfo, '$.info.address.PostCode')

Page 18: Trivadis TechEvent 2016 Polybase challenges Hive relational access to non-relational HDFS by Olaf Nimz

Convert «unstructured» JSON to table

SET @json = '[

{ "id" : 2, "info": { "name": "John", "surname": "Smith" }, "age": 25 },

{ "id" : 5, "info": { "name": "Jane", "surname": "Smith" }, "dob": "2005-11-04T12:00:00" }

]'

SELECT *

FROM OPENJSON(@json)

WITH (id int 'strict $.id',

firstName nvarchar(50) '$.info.name', lastName nvarchar(50) '$.info.surname',

age int, dateOfBirth datetime2 '$.dob')

Page 19: Trivadis TechEvent 2016 Polybase challenges Hive relational access to non-relational HDFS by Olaf Nimz

Performance Scaling

Page 20: Trivadis TechEvent 2016 Polybase challenges Hive relational access to non-relational HDFS by Olaf Nimz

Take Home Message

1. Access to non-relational world is easier with Polybase

T-SQL only

Unstructured data still complex e.g. nested JSON stuctures

2. Hybrid solutions

Fact Extractor - IoT

Staging Area for DWH – keep entire history

Dirty data source files

Near real-time

3. Scenarios

Swiss Air - Flight Logs

Swisscom - Call Data Records

Archiving (c)old DWH Facts

Page 21: Trivadis TechEvent 2016 Polybase challenges Hive relational access to non-relational HDFS by Olaf Nimz

Outlook

Table definition remains challenging

Push down computation

Scale-out the SQL Server side

– using e.g. idle Fail Over Instance

see Blob Post with Code Examples

Page 22: Trivadis TechEvent 2016 Polybase challenges Hive relational access to non-relational HDFS by Olaf Nimz

BASEL BERN BRUGG LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MÜNCHEN STUTTGART WIEN

THANK YOU. Trivadis AG

Olaf Nimz

Sägereistrasse 29

8152 Glattbrugg

Tel. +41-44-808 70 20

Fax +41-44-808 70 21

[email protected]

www.trivadis.com