Upload
michael-rys
View
406
Download
1
Embed Size (px)
Citation preview
Michael RysPrincipal Program Manager, Big Data @ Microsoft@MikeDoesBigData, {mrys, usql}@microsoft.com
U-SQL User-Defined Operators (UDOs)
Extend U-SQL with C#/.NET
Built-in operators, function, aggregates
C# expressions (in SELECT expressions)
User-defined aggregates (UDAGGs)
User-defined functions (UDFs)
User-defined operators (UDOs)
What are UDOs?
User-Defined ExtractorsUser-Defined Outputters User-Defined Processors
• Take one row and produce one row• Pass-through versus transforming
User-Defined Appliers• Take one row and produce 0 to n rows• Used with OUTER/CROSS APPLY
User-Defined Combiners• Combines rowsets (like a user-defined join)
User-Defined Reducers• Take n rows and produce 1 row
Called with explicit U-SQL Syntax that takes a UDO instance (created as part of the execution):
• EXTRACT• OUTPUT• PROCESS • COMBINE• REDUCE
UDO/UDT Tips and Warnings
• Use:• READONLY clause to allow pushing predicates through
UDOs• REQUIRED clause to allow column pruning through UDOs• PRESORT (coming)
• Use SELECT with UDFs instead of PROCESS• Use User-defined Aggregators instead of
REDUCE• Hint Cardinality if you use CROSS APPLY and
it does chose the wrong plan• Learn to use Windowing Functions (OVER
expression)• Use SQL.MAP and SQL.ARRAY instead of C#
Dictionary and array
• Some use-cases for PROCESS/REDUCE/COMBINE:
• The logic needs to dynamically access the input and/or output schema. E.g., create a JSON doc for the data in the row where the columns are not known apriori.
• Your UDF based solution creates too much memory pressure and you can write your code more memory efficient in a UDO
What are UDFs and UDAGGs?
• UDFs are user-defined C# scalar functions that can be called like any scalar C# function
• UDAGGs are user-defined aggregators• Called by special syntax AGG<…>• Enables templatized user-defined
aggregators
• UDFs, UDAGGs and UDOs must be provided by a referenced assembly
Show me U-SQL UDOs!
UDO model• Marking UDOs• Parameterizing UDOs• UDO signature• UDO-specific
processing pattern• Rowsets and their
schemas in UDOs• Setting results
• By position• By name
[SqlUserDefinedExtractor] public class DriverExtractor : IExtractor { private byte[] _row_delim; private string _col_delim; private Encoding _encoding; // Define a non-default constructor since I want to pass in my own parameters public DriverExtractor( string row_delim = "\r\n", string col_delim = ",“ , Encoding encoding = null ) { _encoding = encoding == null ? Encoding.UTF8 : encoding; _row_delim = _encoding.GetBytes(row_delim); _col_delim = col_delim; } // DriverExtractor
// Converting text to target schema private void OutputValueAtCol_I(string c, int i, IUpdatableRow outputrow) { var schema = outputrow.Schema;
if (schema[i].Type == typeof(int)) { var tmp = Convert.ToInt32(c); outputrow.Set(i, tmp); } ... } //SerializeCol
public override IEnumerable<IRow> Extract( IUnstructuredReader input , IUpdatableRow outputrow) { foreach (var row in input.Split(_row_delim)) { using(var s = new StreamReader(row, _encoding)) { int i = 0; foreach (var c in s.ReadToEnd().Split(new[] { _col_delim }, StringSplitOptions.None)) { OutputValueAtCol_I(c, i++, outputrow); } // foreach } // using yield return outputrow.AsReadOnly(); } // foreach } // Extract } // class DriverExtractor
UDAGG model• UDAGG extends
IAggregate interface• Requires
implementation of Init(), Accumulate(), and Terminate() methods
• Can have multiple arguments
• Can be generic• Called with special
syntax to provide support for generic UDAGGs
public class MyCountAggregate : IAggregate<int, long> { private int count; public override void Init() { count = 0; } public override void Accumulate(int i) { count += i; } public override long Terminate(){ return count; } } public class MyTwoArgAggregate : IAggregate<string, long, int> { public override void Init() {…} public override void Accumulate(string s, long l) {…} public override int Terminate() {…} } public class GenericListAggregate<T1, TResult> : IAggregate<T1, TResult> where TResult : IList<T1>, new(){ private TResult result; public override void Init() { this.result = new TResult(); } public override void Accumulate(T1 t1) { this.result.Add(t1);} public override TResult Terminate() { return this.result;}} SELECT AGG<MyNamespace.MyCountAggregate>(a) AS ms FROM @X;
Additional Resources
DocumentationU-SQL UDO Expressions: https://msdn.microsoft.com/en-us/library/azure/mt621319.aspx U-SQL OUTPUT Statement: https://msdn.microsoft.com/en-us/library/azure/mt621334.aspx U-SQL UDO Programmer’s Guide: Under development U-SQL Performance Presentation: http://www.slideshare.net/MichaelRys/usql-query-execution-and-performance-tuning
Sample Projectshttps://github.com/Azure/usql/tree/master/Examples/AmbulanceDemos/AmbulanceDemos/2-Ambulance-Structured%20Data https://github.com/Azure/usql/tree/master/Examples/TweetAnalysis
http://aka.ms/AzureDataLake