Upload
jordan-halterman
View
1.000
Download
48
Embed Size (px)
Citation preview
><
INTRODUCTION TO
INTRODUCTION TO APACHE CALCITE
APACHE CALCITEJORDAN HALTERMAN
1
WHAT IS APACHE CALCITE?
next 2
><INTRODUCTION TO APACHE CALCITE 3
What is Apache Calcite?
• A framework for building SQL databases • Developed over more than ten years • Written in Java • Previously known as Optiq • Previously known as Farrago • Became an Apache project in 2013 • Led by Julian Hyde at Hortonworks
><INTRODUCTION TO APACHE CALCITE 4
Projects using Calcite
• Apache Hive • Apache Drill • Apache Flink • Apache Phoenix • Apache Samza • Apache Storm • Apache everything…
><INTRODUCTION TO APACHE CALCITE 5
What is Apache Calcite?
• SQL parser • SQL validation • Query optimizer • SQL generator • Data federator
><INTRODUCTION TO APACHE CALCITE
ParseQueries are parsed using a JavaCC generated parser
ValidateQueries are validated against known database metadata
OptimizeLogical plans are optimized and converted into physical expressions
ExecuteP h y s i c a l p l a n s a r e converted into application-specific executions
01 02 03 04
Stages of query execution
6
COMPONENTS next 7
><INTRODUCTION TO APACHE CALCITE 8
Components of Calcite
• Catalog - Defines metadata and namespaces that can be accessed in SQL queries
• SQL parser - Parses valid SQL queries into an abstract syntax tree (AST)
• SQL validator - Validates abstract syntax trees against metadata provided by the catalog
• Query optimizer - Converts AST into logical plans, optimizes logical plans, and converts logical expressions into physical plans
• SQL generator - Converts physical plans to SQL
CATALOG next 9
><INTRODUCTION TO APACHE CALCITE 10
Calcite Catalog
• Defines namespaces that can be accessed in Calcite queries
• Schema
• A collection of schemas and tables
• Can be arbitrarily nested
• Table
• Represents a single data set
• Fields defined by a RelDataType
• RelDataType
• Represents fields in a data set
• Supports all SQL data types, including structs and
><INTRODUCTION TO APACHE CALCITE 11
Schema
• A collection of schemas and tables
• Schemas can be arbitrarily nested
><INTRODUCTION TO APACHE CALCITE 12
Schema
• A collection of schemas and tables
• Schemas can be arbitrarily nested
><INTRODUCTION TO APACHE CALCITE 13
Table
• Represents a single data set
• Fields are defined by a RelDataType
><INTRODUCTION TO APACHE CALCITE 14
Table
• Represents a single data set
• Fields are defined by a RelDataType
><INTRODUCTION TO APACHE CALCITE 15
RelDataType
• Represents the data type of an object
• Supports all SQL data types, including structs and arrays
• Similar to Spark’s DataType
><INTRODUCTION TO APACHE CALCITE 16
RelDataType
><INTRODUCTION TO APACHE CALCITE 17
RelDataType
data type enum
><INTRODUCTION TO APACHE CALCITE 18
Statistic
• Provide table statistics used in optimization
><INTRODUCTION TO APACHE CALCITE 19
Statistic
• Provide table statistics used in optimization
><INTRODUCTION TO APACHE CALCITE 20
Usage of the Calcite catalog
><INTRODUCTION TO APACHE CALCITE 21
Usage of the Calcite catalog
schema
><INTRODUCTION TO APACHE CALCITE 22
Usage of the Calcite catalog
schema table
><INTRODUCTION TO APACHE CALCITE 23
Usage of the Calcite catalog
schema table
data type
><INTRODUCTION TO APACHE CALCITE 24
Usage of the Calcite catalog
schema table
data typedata type field
SQL PARSER next 25
><INTRODUCTION TO APACHE CALCITE 26
Calcite SQL parser
• LL(k) parser written in JavaCC
• Input queries are parsed into an abstract syntax tree (AST)
• Tokens are represented in Calcite by SqlNode
• SqlNode can also be converted back to a SQL string via the unparse method
><INTRODUCTION TO APACHE CALCITE 27
JavaCC
• Java Compiler Compiler
• Created in 1996 at Sun Microsystems
• Generates Java code from a domain-specific language
• ANTLR is the modern alternative used in projects like Hive and Drill
• JavaCC has sparse documentation
><INTRODUCTION TO APACHE CALCITE 28
JavaCC
><INTRODUCTION TO APACHE CALCITE 29
JavaCC
><INTRODUCTION TO APACHE CALCITE 30
JavaCC
tokens
><INTRODUCTION TO APACHE CALCITE 31
JavaCC
tokens
or
><INTRODUCTION TO APACHE CALCITE 32
JavaCC
tokens
or
function call
><INTRODUCTION TO APACHE CALCITE 33
JavaCC
tokens
Java code
or
function call
><INTRODUCTION TO APACHE CALCITE 34
SqlNode
• SqlNode represents an element in an abstract syntax tree
><INTRODUCTION TO APACHE CALCITE 35
SqlNode
• SqlNode represents an element in an abstract syntax tree
select
><INTRODUCTION TO APACHE CALCITE 36
SqlNode
• SqlNode represents an element in an abstract syntax tree
identifiersselect
><INTRODUCTION TO APACHE CALCITE 37
SqlNode
• SqlNode represents an element in an abstract syntax tree
identifiersselect operator
><INTRODUCTION TO APACHE CALCITE 38
SqlNode
• SqlNode represents an element in an abstract syntax tree
identifiersselect operator identifier
><INTRODUCTION TO APACHE CALCITE 39
SqlNode
• SqlNode represents an element in an abstract syntax tree
identifiersselect operator identifier
data type
><INTRODUCTION TO APACHE CALCITE 40
SqlNode
• SqlNode represents an element in an abstract syntax tree
identifiersselect operator identifier
data typeidentifier
><INTRODUCTION TO APACHE CALCITE 41
SqlNode
• SqlNode’s unparse method converts a SQL element back into a string
><INTRODUCTION TO APACHE CALCITE 42
SqlNode
• SqlNode’s unparse method converts a SQL element back into a string
><INTRODUCTION TO APACHE CALCITE 43
SqlNode
><INTRODUCTION TO APACHE CALCITE 44
SqlNode
• SqlDialect indicates the capitalization and quoting rules of specific databases
><INTRODUCTION TO APACHE CALCITE 45
SqlNode
• SqlDialect indicates the capitalization and quoting rules of specific databases
QUERY OPTIMIZER next 46
><INTRODUCTION TO APACHE CALCITE 47
Query Plans
• Query plans represent the steps necessary to execute a query
><INTRODUCTION TO APACHE CALCITE 48
Query Plans
• Query plans represent the steps necessary to execute a query
><INTRODUCTION TO APACHE CALCITE 49
Query Plans
• Query plans represent the steps necessary to execute a query
table scan
table scan
><INTRODUCTION TO APACHE CALCITE 50
Query Plans
• Query plans represent the steps necessary to execute a query
inner join table scan
table scan
><INTRODUCTION TO APACHE CALCITE 51
Query Plans
• Query plans represent the steps necessary to execute a query
filterinner join table scan
table scan
><INTRODUCTION TO APACHE CALCITE 52
Query Plans
• Query plans represent the steps necessary to execute a query
filterinner join
project
table scan
table scan
><INTRODUCTION TO APACHE CALCITE 53
Query Plans
• Query plans represent the steps necessary to execute a query
filterinner join
project
table scan
table scan
><INTRODUCTION TO APACHE CALCITE 54
Query Optimization
• Optimize logical plan
• Goal is typically to try to reduce the amount of data that must be processed early in the plan
• Convert logical plan into a physical plan
• Physical plan is engine specific and represents the physical execution stages
><INTRODUCTION TO APACHE CALCITE 55
Query Optimization
• Prune unused fields
• Merge projections
• Convert subqueries to joins
• Reorder joins
• Push down projections
• Push down filters
><INTRODUCTION TO APACHE CALCITE 56
Query Optimization
><INTRODUCTION TO APACHE CALCITE 57
Query Optimization
><INTRODUCTION TO APACHE CALCITE 58
Query Optimization
><INTRODUCTION TO APACHE CALCITE 59
Query Optimization
push down project
><INTRODUCTION TO APACHE CALCITE 60
Query Optimization
push down project
push down filter
><INTRODUCTION TO APACHE CALCITE 61
Query Optimization
><INTRODUCTION TO APACHE CALCITE 62
Key Concepts
Relational algebra Row expressions Traits Conventions Rules Planners Programs
><INTRODUCTION TO APACHE CALCITE 63
Key Concepts
Relational algebra Row expressions Traits Conventions Rules Planners Programs
RelNode RexNode RelTrait Convention RelOptRule RelOptPlanner Program
><INTRODUCTION TO APACHE CALCITE 64
Relational Algebra
• RelNode represents a relational expression
• Largely equivalent to Spark’s DataFrame methods
• Logical algebra
• Physical algebra
><INTRODUCTION TO APACHE CALCITE 65
Relational Algebra
TableScan
Project
Filter
Aggregate
Join
Union
Intersect
Sort
><INTRODUCTION TO APACHE CALCITE 66
Relational Algebra
TableScan
Project
Filter
Aggregate
Join
Union
Intersect
Sort
SparkTableScan
SparkProject
SparkFilter
SparkAggregate
SparkJoin
SparkUnion
SparkIntersect
SparkSort
><INTRODUCTION TO APACHE CALCITE 67
Row Expressions
• RexNode represents a row-level expression
• Largely equivalent to Spark’s Column functions
• Projection fields
• Filter condition
• Join condition
• Sort fields
><INTRODUCTION TO APACHE CALCITE 68
Row Expressions
Input column ref Literal Struct field access Function call Window expression
><INTRODUCTION TO APACHE CALCITE 69
Row Expressions
Input column ref Literal Struct field access Function call Window expression
RexInputRef RexLiteral
RexFieldAccess RexCall RexOver
><INTRODUCTION TO APACHE CALCITE 70
Row Expressions
><INTRODUCTION TO APACHE CALCITE 71
Row Expressions
input ref
><INTRODUCTION TO APACHE CALCITE 72
Row Expressions
input ref
function call
><INTRODUCTION TO APACHE CALCITE 73
Traits
• Defined by the RelTrait interface
• Represent a trait of a relational expression that does not alter execution
• Traits are used to validate plan output
• Three primary trait types:
• Convention
• RelCollation
• RelDistribution
><INTRODUCTION TO APACHE CALCITE 74
Conventions
• Convention is a type of RelTrait
• A Convention is associated with a RelNode interface
• SparkConvention, JdbcConvention, EnumerableConvention, etc
• Conventions are used to represent a single data source
• Inputs to a relational expression must be in the same convention
><INTRODUCTION TO APACHE CALCITE 75
Conventions
><INTRODUCTION TO APACHE CALCITE 76
Conventions
Spark convention
><INTRODUCTION TO APACHE CALCITE 77
Conventions
Spark convention
JDBC convention
><INTRODUCTION TO APACHE CALCITE 78
Conventions
Spark convention
JDBC convention
converter
><INTRODUCTION TO APACHE CALCITE 79
Rules
• Rules are used to modify query plans
• Defined by the RelOptRule interface
• Two types of rules: converters and transformers
• Converter rules implement Converter and convert from one convention to another
• Rules are matched to elements of a query plan using pattern matching
• onMatch is called for matched rules
• Converter rules applied via convert
><INTRODUCTION TO APACHE CALCITE 80
Converter Rule
><INTRODUCTION TO APACHE CALCITE 81
Converter Rule
expression type
><INTRODUCTION TO APACHE CALCITE 82
Converter Rule
expression type
input convention
><INTRODUCTION TO APACHE CALCITE 83
Converter Rule
expression type
input convention
converted convention
><INTRODUCTION TO APACHE CALCITE 84
Converter Rule
expression type
input convention
converted convention
converter function
><INTRODUCTION TO APACHE CALCITE 85
Pattern Matching
><INTRODUCTION TO APACHE CALCITE 86
Pattern Matching
><INTRODUCTION TO APACHE CALCITE 87
Pattern Matching
no match :-(
><INTRODUCTION TO APACHE CALCITE 88
Pattern Matching
no match :-(
><INTRODUCTION TO APACHE CALCITE 89
Pattern Matching
match!
no match :-(
><INTRODUCTION TO APACHE CALCITE 90
Planners
• Planners implement the RelOptPlanner interface
• Two types of planners:
• HepPlanner
• VolcanoPlanner
><INTRODUCTION TO APACHE CALCITE 91
Heuristic Optimization
• HepPlanner is a heuristic optimizer similar to Spark’s optimizer
• Applies all matching rules until none can be applied
• Heuristic optimization is faster than cost-based optimization
• Risk of infinite recursion if rules make opposing changes to the plan
><INTRODUCTION TO APACHE CALCITE 92
Cost-based Optimization
• VolcanoPlanner is a cost-based optimizer
• Applies matching rules iteratively, selecting the plan with the cheapest cost on each iteration
• Costs are provided by relational expressions
• Not all possible plans can be computed
• Stops optimization when the cost does not significantly improve through a determinable number of iterations
><INTRODUCTION TO APACHE CALCITE 93
Cost-based Optimization
• Cost is provided by each RelNode
• Cost is represented by RelOptCost
• Cost typically includes row count, I/O, and CPU cost
• Cost estimates are relative
• Statistics are used to improve accuracy of cost estimations
• Calcite provides utilities for computing various resource-related statistics for use in cost estimations
><INTRODUCTION TO APACHE CALCITE 94
Cost-based Optimization
><INTRODUCTION TO APACHE CALCITE 95
Cost-based Optimization
><INTRODUCTION TO APACHE CALCITE 96
Cost-based Optimization
><INTRODUCTION TO APACHE CALCITE 97
Cost-based Optimization
><INTRODUCTION TO APACHE CALCITE 98
Cost-based Optimization
PUTTING IT ALL TOGETHER
next 99
><INTRODUCTION TO APACHE CALCITE 100
Putting it all together
><INTRODUCTION TO APACHE CALCITE 101
Putting it all together
><INTRODUCTION TO APACHE CALCITE 102
Putting it all together
><INTRODUCTION TO APACHE CALCITE 103
Putting it all together
><INTRODUCTION TO APACHE CALCITE 104
Putting it all together
><INTRODUCTION TO APACHE CALCITE 105
Putting it all together
><INTRODUCTION TO APACHE CALCITE 106
Putting it all together
><INTRODUCTION TO APACHE CALCITE 107
Putting it all together
><INTRODUCTION TO APACHE CALCITE 108
Putting it all together
><INTRODUCTION TO APACHE CALCITE 109
Putting it all together