10
Michael Rys Principal Program Manager, Big Data @ Microsoft @MikeDoesBigData, {mrys, usql}@microsoft.com U-SQL User-Defined Operators (UDOs)

U-SQL User-Defined Operators (UDOs) (SQLBits 2016)

Embed Size (px)

Citation preview

Page 1: U-SQL User-Defined Operators (UDOs) (SQLBits 2016)

Michael RysPrincipal Program Manager, Big Data @ Microsoft@MikeDoesBigData, {mrys, usql}@microsoft.com

U-SQL User-Defined Operators (UDOs)

Page 2: U-SQL User-Defined Operators (UDOs) (SQLBits 2016)

Extend U-SQL with C#/.NET

Built-in operators, function, aggregates

C# expressions (in SELECT expressions)

User-defined aggregates (UDAGGs)

User-defined functions (UDFs)

User-defined operators (UDOs)

Page 3: U-SQL User-Defined Operators (UDOs) (SQLBits 2016)

What are UDOs?

User-Defined ExtractorsUser-Defined Outputters User-Defined Processors

• Take one row and produce one row• Pass-through versus transforming

User-Defined Appliers• Take one row and produce 0 to n rows• Used with OUTER/CROSS APPLY

User-Defined Combiners• Combines rowsets (like a user-defined join)

User-Defined Reducers• Take n rows and produce 1 row

Called with explicit U-SQL Syntax that takes a UDO instance (created as part of the execution):

• EXTRACT• OUTPUT• PROCESS • COMBINE• REDUCE

Page 4: U-SQL User-Defined Operators (UDOs) (SQLBits 2016)

UDO/UDT Tips and Warnings

• Use:• READONLY clause to allow pushing predicates through

UDOs• REQUIRED clause to allow column pruning through UDOs• PRESORT (coming)

• Use SELECT with UDFs instead of PROCESS• Use User-defined Aggregators instead of

REDUCE• Hint Cardinality if you use CROSS APPLY and

it does chose the wrong plan• Learn to use Windowing Functions (OVER

expression)• Use SQL.MAP and SQL.ARRAY instead of C#

Dictionary and array

• Some use-cases for PROCESS/REDUCE/COMBINE:

• The logic needs to dynamically access the input and/or output schema. E.g., create a JSON doc for the data in the row where the columns are not known apriori.

• Your UDF based solution creates too much memory pressure and you can write your code more memory efficient in a UDO

Page 5: U-SQL User-Defined Operators (UDOs) (SQLBits 2016)

What are UDFs and UDAGGs?

• UDFs are user-defined C# scalar functions that can be called like any scalar C# function

• UDAGGs are user-defined aggregators• Called by special syntax AGG<…>• Enables templatized user-defined

aggregators

• UDFs, UDAGGs and UDOs must be provided by a referenced assembly

Page 6: U-SQL User-Defined Operators (UDOs) (SQLBits 2016)

Show me U-SQL UDOs!

Page 7: U-SQL User-Defined Operators (UDOs) (SQLBits 2016)

UDO model• Marking UDOs• Parameterizing UDOs• UDO signature• UDO-specific

processing pattern• Rowsets and their

schemas in UDOs• Setting results

• By position• By name

[SqlUserDefinedExtractor] public class DriverExtractor : IExtractor { private byte[] _row_delim; private string _col_delim; private Encoding _encoding; // Define a non-default constructor since I want to pass in my own parameters public DriverExtractor( string row_delim = "\r\n", string col_delim = ",“ , Encoding encoding = null ) { _encoding = encoding == null ? Encoding.UTF8 : encoding; _row_delim = _encoding.GetBytes(row_delim); _col_delim = col_delim; } // DriverExtractor

// Converting text to target schema private void OutputValueAtCol_I(string c, int i, IUpdatableRow outputrow) { var schema = outputrow.Schema;

if (schema[i].Type == typeof(int)) { var tmp = Convert.ToInt32(c); outputrow.Set(i, tmp); } ... } //SerializeCol

public override IEnumerable<IRow> Extract( IUnstructuredReader input , IUpdatableRow outputrow) { foreach (var row in input.Split(_row_delim)) { using(var s = new StreamReader(row, _encoding)) { int i = 0; foreach (var c in s.ReadToEnd().Split(new[] { _col_delim }, StringSplitOptions.None)) { OutputValueAtCol_I(c, i++, outputrow); } // foreach } // using yield return outputrow.AsReadOnly(); } // foreach } // Extract } // class DriverExtractor

Page 8: U-SQL User-Defined Operators (UDOs) (SQLBits 2016)

UDAGG model• UDAGG extends

IAggregate interface• Requires

implementation of Init(), Accumulate(), and Terminate() methods

• Can have multiple arguments

• Can be generic• Called with special

syntax to provide support for generic UDAGGs

public class MyCountAggregate : IAggregate<int, long> {        private int count;         public override void Init() { count = 0; }         public override void Accumulate(int i) { count += i; }         public override long Terminate(){ return count; } } public class MyTwoArgAggregate : IAggregate<string, long, int> {        public override void Init() {…}         public override void Accumulate(string s, long l) {…}         public override int Terminate() {…} } public class GenericListAggregate<T1, TResult> : IAggregate<T1, TResult> where TResult : IList<T1>, new(){        private TResult result;         public override void Init() { this.result = new TResult(); }         public override void Accumulate(T1 t1) { this.result.Add(t1);}         public override TResult Terminate() { return this.result;}}   SELECT AGG<MyNamespace.MyCountAggregate>(a) AS ms FROM @X;

Page 9: U-SQL User-Defined Operators (UDOs) (SQLBits 2016)

Additional Resources

DocumentationU-SQL UDO Expressions: https://msdn.microsoft.com/en-us/library/azure/mt621319.aspx U-SQL OUTPUT Statement: https://msdn.microsoft.com/en-us/library/azure/mt621334.aspx U-SQL UDO Programmer’s Guide: Under development U-SQL Performance Presentation: http://www.slideshare.net/MichaelRys/usql-query-execution-and-performance-tuning

Sample Projectshttps://github.com/Azure/usql/tree/master/Examples/AmbulanceDemos/AmbulanceDemos/2-Ambulance-Structured%20Data https://github.com/Azure/usql/tree/master/Examples/TweetAnalysis

Page 10: U-SQL User-Defined Operators (UDOs) (SQLBits 2016)

http://aka.ms/AzureDataLake