27
THE PUSHDOWN OF EVERYTHING Stephan Kessler Santiago Mola

The Pushdown of Everything by Stephan Kessler and Santiago Mola

Embed Size (px)

Citation preview

Page 1: The Pushdown of Everything by Stephan Kessler and Santiago Mola

THE PUSHDOWN OF EVERYTHING

Stephan KesslerSantiago Mola

Page 2: The Pushdown of Everything by Stephan Kessler and Santiago Mola

Who we are?Stephan KesslerDeveloper @ SAP, Walldorf

o SAP HANA Vora teamo Integration of Vora query engine with

Apache Spark.o Bringing new features and performance

improvements to Apache Spark.o Before joining SAP: PhD and M.Sc. at the

Karlsruhe Institute of Technology.o Research on privacy in databases and

sensor networks.

Santiago MolaDeveloper @ Stratio, Madrid

o Working with the SAP HANA Vora teamo Focus on Apache Spark SQL extensions and data

sources implementation.o Bootstrapped Stratio Sparkta, worked on Stratio

Ingestion and helped customers to build stream processing solutions.

o Previously: CTO at Bitsnbrains, M.Sc. at Polytechnic University of Valencia.

Page 3: The Pushdown of Everything by Stephan Kessler and Santiago Mola

SAP HANA Vora• SAP HANA Vora is a SQL-on-Hadoop solution based on:

– In-Memory columnar query execution engine with built-in query compilation

– Spark SQL extensions (will be Open Source soon!):• OLAP extensions• Hierarchy queries• Extended Data Sources API (‘Push Down Everything’)

Page 4: The Pushdown of Everything by Stephan Kessler and Santiago Mola

Spark SQL

Data Sources API

Spark Core Engine

Data SourcesMLlib Streaming …

CSV HANA

HANA VORA

Page 5: The Pushdown of Everything by Stephan Kessler and Santiago Mola

Motivation• “The fastest way of processing data is not processing it at all!”

• Data Sources API allows to defer computation of filters and projects to the ‘source’

– Less I/O spent reading– Less memory spent

• But: Data Sources can also be full-blown databases– Deferring parts of the logical plan leads to

additional benefits

→ The Pushdown of Everything

Pushed down:Project: Column1 Filter: Column2 > 20

Average: Column2

Page 6: The Pushdown of Everything by Stephan Kessler and Santiago Mola

Implementing a Data Source1. Creating a ‘DefaultSource’ class that implements the trait

(Schema)RelationProvidertrait SchemaRelationProvider {

def createRelation( sqlContext: SQLContext, parameters: Map[String, String], schema: StructType): BaseRelation

}2. The returned “BaseRelation” can implement the following traits

– TableScan– PrunedScan– PrunedFilterScan

Page 7: The Pushdown of Everything by Stephan Kessler and Santiago Mola

Full Scan• The most basic form of reading data: read it all, sequentially.• Implementing trait table scan

trait TableScan {def buildScan(): RDD[Row]

}

• SQL: SELECT * FROM table

Page 8: The Pushdown of Everything by Stephan Kessler and Santiago Mola

Pruned Scan• Read all rows, only a few columns• Implementing trait PrunedScan

trait PrunedScan {def buildScan(requiredColumns: Array[String]): RDD[Row]

}• SQL: SELECT <column list> FROM table

Page 9: The Pushdown of Everything by Stephan Kessler and Santiago Mola

Pruned Filtered Scan• Can filter which rows are fetched (predicate push down).• Implement trait PrunedFilteredScan

trait PrunedFilteredScan {def buildScan(requiredColumns: Array[String], filters: Array[Filter]):

RDD[Row]}

• SQL: SELECT <column list> FROM table WHERE <predicate>• Spark SQL allows basic predicates here (e.g. EqualTo, GreaterThan).

Page 10: The Pushdown of Everything by Stephan Kessler and Santiago Mola

How does it work?Assume the following table attendees

Query: SELECT hometown, AVG(age) FROM attendeesWHERE hometown = ’Amsterdam’ GROUP BY hometown Name Age Hometown

Peter 23 LondonJohn 30 New YorkStephan 72 Karlsruhe… … …

Page 11: The Pushdown of Everything by Stephan Kessler and Santiago Mola

How does it work?Query: SELECT hometown, AVG(age) FROM attendees

WHERE hometown = ’Amsterdam’ GROUP BY hometown

The query is parsed into this Logical Plan:

Relation (datasource)Attendees

Aggregate(hometown, AVG(age))

Filterhometown = ‘Amsterdam’

Page 12: The Pushdown of Everything by Stephan Kessler and Santiago Mola

Example with TableScan

Relation (datasource)Attendees

Aggregate(hometown, AVG(age))

Filterhometown = ‘Amsterdam’

Logical plan

Planning

PhysicalRDD(full scan)

Aggregate(hometown, AVG(age))

Filterhometown = ‘Amsterdam’

Physical plan

SQL

SELECT name, age, hometownFROM attendees

SELECT hometown, AVG(age)FROM sourceWHERE hometown = ‘Amsterdam’GROUP BY hometown

SQL representation

Page 13: The Pushdown of Everything by Stephan Kessler and Santiago Mola

Example with TableScan

Relation (datasource)Attendees

Aggregate(hometown, AVG(age))

Filterhometown = ‘Amsterdam’

Logical plan

PhysicalRDD(full scan)

Aggregate(hometown, AVG(age))

Filterhometown = ‘Amsterdam’

Physical plan

SELECT name, age, hometownFROM attendees

SELECT hometown, AVG(age)FROM sourceWHERE hometown = ‘Amsterdam’GROUP BY hometown

SQL representation

Planning

SQL

Page 14: The Pushdown of Everything by Stephan Kessler and Santiago Mola

Example with PrunedScan

Relation (datasource)Attendees

Aggregate(hometown, AVG(age))

Filterhometown = ‘Amsterdam’

Logical plan

PhysicalRDD(pruned: age, hometown)

Aggregate(hometown, AVG(age))

Filterhometown = ‘Amsterdam’

Physical plan

SELECT age, hometownFROM attendees

SELECT hometown, AVG(age)FROM sourceWHERE hometown = ‘Amsterdam’GROUP BY hometown

SQL representation

Planning

SQL

Page 15: The Pushdown of Everything by Stephan Kessler and Santiago Mola

Example with PrunedFilteredScan

Relation (datasource)Attendees

Aggregate(hometown, AVG(age))

Filterhometown = ‘Amsterdam’

Logical plan

PhysicalRDD(pruned: age, hometown

filtered: hometown = ‘Amsterdam’)

Aggregate(hometown, AVG(age))

Filterhometown = ‘Amsterdam’

Physical plan

SELECT age, hometownFROM attendeesWHERE hometown = ‘Amsterdam’

SELECT hometown, AVG(age)FROM sourceWHERE hometown = ‘Amsterdam’GROUP BY hometown

SQL representation

Planning

SQL

Page 16: The Pushdown of Everything by Stephan Kessler and Santiago Mola

How can we improve this?• There are sources doing more than filtering and pruning

– aggregation, joins, ...• Some sources can execute more complex filters and functions

– Example: SELECT col1 + 1 WHERE col2 + col3 < col4.• Default Data Sources API cannot push down these things

– They might be trivial for the data source to execute.• This leads to unnecessary work

– fetching more data– Not using optimizations of the source

Page 17: The Pushdown of Everything by Stephan Kessler and Santiago Mola

Enter the Catalyst Source API• We implemented a new interface that data sources can implement to

signal that they can push down complex queries.• Complexity of pushed down queries is arbitrary

– functions, set operators, joins, deeply nested subqueries, …– even data source UDFs that are not supported in Spark).

trait CatalystSource { def isMultiplePartitionExecution(relations: Seq[CatalystSource]): Boolean def supportsLogicalPlan(plan: LogicalPlan): Boolean def supportsExpression(expr: Expression): Boolean def logicalPlanToRDD(plan: LogicalPlan): RDD[Row]}

Page 18: The Pushdown of Everything by Stephan Kessler and Santiago Mola

Partitioned and Holistic sources• Data sources that can compute queries that operate on a holistic data set

– HANA, Cassandra, PostgreSQL, MongoDB• Data sources that can compute queries that operate only over each

partition– Vora, Parquet, ORC, PostgreSQL instances in Postgres XL

• Some can do both (to some degree)• Our planner extensions allow to optimize push down for both cases if the

data source implements the Catalyst Source API.

Page 19: The Pushdown of Everything by Stephan Kessler and Santiago Mola

Partitioned vs. Holistic Sources

HDFS

Physical Node

Physical Node

Physical Node

Data Node Data Node Data Node

Vora Engine

Vora Engine

Vora Engine

Spark Worker

Spark Worker

Spark Worker

Spark Worker

SAP HANA

PostgresSQL

Page 20: The Pushdown of Everything by Stephan Kessler and Santiago Mola

Example with CatalystSource (partioned execution)

Relation (datasource)Attendees

Aggregate(hometown, AVG(age))

Filterhometown = ‘Amsterdam’

Logical plan

Planning

PhysicalRDD(CatalystSource)

Aggregate(hometown, SUM(PartialSum) / SUM(PartialCount))

Physical plan

SELECT hometown, SUM(age) AS PartialSum, COUNT(age) AS PartialCountFROM attendeesWHERE hometown = ‘Amsterdam’GROUP BY hometown

SELECT hometown, SUM(PartialSum) / SUM(PartialCount)FROM sourceGROUP BY hometown

SQL representation

SQL

Page 21: The Pushdown of Everything by Stephan Kessler and Santiago Mola

Example with CatalystSource (holistic source)

Relation (datasource)Attendees

Aggregate(hometown, AVG(age))

Filterhometown = ‘Amsterdam’

Logical plan

PhysicalRDD(CatalystSource)

Physical plan

SELECT hometown, AGE(age)FROM attendeesWHERE hometown = ‘Amsterdam’GROUP BY hometown

SQL representation

Planning

SQL

Page 22: The Pushdown of Everything by Stephan Kessler and Santiago Mola

Returned RowsAssumption: Table Size is Rows

SELECT hometown, SUM(age) AS PartialSum, COUNT(age) AS PartialCountFROM attendeesWHERE hometown = ‘Amsterdam’GROUP BY hometown

SELECT age, hometownFROM attendeesWHERE hometown = ‘Amsterdam’

SELECT name, age, hometownFROM attendees

TableScan/Pruned Scan

Pruned Filter Scan

Catalyst Source

Returns Rows

Returns Rows

Returns Rows

#distinct ‘hometowns’

Page 23: The Pushdown of Everything by Stephan Kessler and Santiago Mola

Advantages• A single interface covers all queries. • CatalystSource subsumes TableScan, PrunedScan, PrunedFilteredScan.

• Fine-grained control of features supported by the data source• Incremental implementation of a data source possible

– Start with supporting projects and filters and continue with more

• Opens the door to tighter integration with all kinds of databases.– Dramatic performance improvements possible.

Page 24: The Pushdown of Everything by Stephan Kessler and Santiago Mola

Current disadvantages and limitations

• Implementing CatalystSource for a rich data source (e.g., supporting SQL) is a considerably complex task.

• Current implementation relies on (some) Spark APIs that are unstable.– Backwards compatibility is not guaranteed.

• Pushing down a complex query could be slower than not pushing it down – Examples:

• it overloads the data source• generates a result larger than its input tables)

– CatalystSource implementors can workaround this by marking such queries as unsupported

Page 25: The Pushdown of Everything by Stephan Kessler and Santiago Mola

What are the next steps?• Improve the API to make it simpler for implementors

– add utilities to generate SQL, – matchers to simplify working with logical plans

• Provide a stable API – CatalystSource implementations should work with different Spark

versions without modification.

• Provide a common trait to reduce boilerplate code– Example: A data source implementing CatalystSource should not

need to implement TableScan, PrunedScan or PrunedFilteredScan.

Page 26: The Pushdown of Everything by Stephan Kessler and Santiago Mola

Summary• Extension of the Data Sources API to pushdown arbitrary logical plans• Leveraging functionality of source to process less data• Part of SAP Hana Vora• We will put it Open Source