15
Rapid Adoption of Cloud Data Warehouse Technology Using Datometry Hyper-Q Lyublena Antova, Derrick Bryant, Tuan Cao, Michael Duller, Mohamed A. Soliman, Florian M. Waas Datometry, Inc. [email protected] ABSTRACT The database industry is about to undergo a fundamental transfor- mation of unprecedented magnitude as enterprises start trading their well-established database stacks on premises for cloud data- base technology in order to take advantage of the economics cloud service providers have long promised. Industry experts and ana- lysts expect the next years to prove a watershed moment in this transformation, as cloud databases finally reached critical mass and maturity. Enterprises eager to move to the cloud face a significant dilemma: while moving the content of their databases to the cloud is a well- studied problem, making existing applications work with new data- base platforms is an enormously costly undertaking that calls for rewriting and adjusting of 100’s if not 1,000’s of applications. In this paper, we present a next-generation virtualization tech- nology that lets existing applications run natively on cloud-based database systems. Using this platform, enterprises can move rapidly to the cloud and innovate and create competitive advantage as a matter of months instead of years. We describe technology and ap- plication scenarios and demonstrate effectiveness and performance of the approach through actual customer use cases. CCS CONCEPTS Information systems Database query processing; Informa- tion systems Mediators and data integration; Information systems Data warehouses; KEYWORDS Adaptive Data Warehouse Virtualization, Cloud Data Warehouses, Query Processing ACM Reference Format: Lyublena Antova, Derrick Bryant, Tuan Cao, Michael Duller, Mohamed A. Soliman, Florian M. Waas. 2018. Rapid Adoption of Cloud Data Warehouse Technology Using Datometry Hyper-Q. In SIGMOD’18: 2018 International Conference on Management of Data, June 10–15, 2018, Houston, TX, USA. ACM, New York, NY, USA, 15 pages. https://doi.org/10.1145/3183713.3190652 Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. SIGMOD’18, June 10–15, 2018, Houston, TX, USA © 2018 Association for Computing Machinery. ACM ISBN 978-1-4503-4703-7/18/06. . . $15.00 https://doi.org/10.1145/3183713.3190652 1 INTRODUCTION Over the course of the next decade, the database market, a $40 billion industry, is about to face fundamental disruption [13]. Cloud databases such as Microsoft SQL Data Warehouse [23], Amazon Redshift [19], Google BigQuery [7], and others promise to be func- tionally equivalent to their counterparts on premises, yet provide enterprises with unprecedented flexibility and elasticity. All while being highly cost-effective: Instead of considerable up-front ex- penses in the form of hardware and software license costs, cloud databases offer a pay-as-you-go model that reduces databases ef- fectively to a set of APIs. They let users query or manipulate data freely, yet shield them from the burden and headaches of having to maintain and operate their own database. The prospect of this new paradigm of data management is ex- tremely powerful and well-received by IT departments across all industries [2]. However, it comes with significant adoption chal- lenges. Through applications that depend on the specific databases they were originally written for, database technology has over time established some of the strongest vendor lock-in in all of IT. Moving to a cloud database requires adjusting or even rewriting of exist- ing applications and migrations may take up to several years, cost millions, and are heavily fraught with risk. CIOs and IT leaders find themselves increasingly in a conundrum weighing the benefits of moving to a cloud database against the substantial switching cost. The situation is further complicated by a myriad of parameters and configurations to choose from when moving to the cloud: different systems offer different characteristics and represent technology at different levels of maturity. This further compounds the risk, inherent to any re-platforming initiative. In this paper, we present Adaptive Data Virtualization (ADV), based on a concept originally developed to bridge data silos in real- time data processing [14], and develop a platform approach that solves the problem of adopting cloud databases effectively and at a fraction of the cost of other approaches. ADV lets applications, orig- inally developed for a specific on-premises database, run natively in the cloud—without requiring modifications to the application. The key principle of ADV is the intercepting of queries as they are emitted by an application and the on-the-fly translation of those queries from the language used by the application (SQL-A over wire protocol WP-A) into the language provided by the cloud data- base (SQL-B), see Figure 1. The translation must handle syntactic differences, reconcile semantic differences between the systems, and compensate for missing functionality. The overwhelmingly positive receptions by customers and tech- nology partners alike underlines the benefits of ADV over a con- ventional migration approach along three dimensions: Industry 3: DB Systems in the Cloud and Open Source SIGMOD’18, June 10-15, 2018, Houston, TX, USA 825

Rapid Adoption of Cloud Data Warehouse TechnologyUsing ...static.tongtianta.site/paper_pdf/414aa74e-cde8-11e9-aa89-00163e08bb86.pdfRedshift [19], Google BigQuery [7], and others promise

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Rapid Adoption of Cloud Data Warehouse TechnologyUsing ...static.tongtianta.site/paper_pdf/414aa74e-cde8-11e9-aa89-00163e08bb86.pdfRedshift [19], Google BigQuery [7], and others promise

Rapid Adoption of Cloud Data Warehouse TechnologyUsing Datometry Hyper-Q

Lyublena Antova, Derrick Bryant, Tuan Cao, Michael Duller, Mohamed A. Soliman,Florian M. WaasDatometry, Inc.

[email protected]

ABSTRACTThe database industry is about to undergo a fundamental transfor-mation of unprecedented magnitude as enterprises start tradingtheir well-established database stacks on premises for cloud data-base technology in order to take advantage of the economics cloudservice providers have long promised. Industry experts and ana-lysts expect the next years to prove a watershed moment in thistransformation, as cloud databases finally reached critical mass andmaturity.

Enterprises eager to move to the cloud face a significant dilemma:while moving the content of their databases to the cloud is a well-studied problem, making existing applications work with new data-base platforms is an enormously costly undertaking that calls forrewriting and adjusting of 100’s if not 1,000’s of applications.

In this paper, we present a next-generation virtualization tech-nology that lets existing applications run natively on cloud-baseddatabase systems. Using this platform, enterprises can move rapidlyto the cloud and innovate and create competitive advantage as amatter of months instead of years. We describe technology and ap-plication scenarios and demonstrate effectiveness and performanceof the approach through actual customer use cases.

CCS CONCEPTS• Information systemsDatabase query processing; • Informa-tion systems Mediators and data integration; • Informationsystems Data warehouses;

KEYWORDSAdaptive Data Warehouse Virtualization, Cloud Data Warehouses,Query Processing

ACM Reference Format:Lyublena Antova, Derrick Bryant, Tuan Cao, Michael Duller, Mohamed A.Soliman, Florian M. Waas. 2018. Rapid Adoption of Cloud Data WarehouseTechnology Using Datometry Hyper-Q. In SIGMOD’18: 2018 InternationalConference on Management of Data, June 10–15, 2018, Houston, TX, USA.ACM,NewYork, NY, USA, 15 pages. https://doi.org/10.1145/3183713.3190652

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected]’18, June 10–15, 2018, Houston, TX, USA© 2018 Association for Computing Machinery.ACM ISBN 978-1-4503-4703-7/18/06. . . $15.00https://doi.org/10.1145/3183713.3190652

1 INTRODUCTIONOver the course of the next decade, the database market, a $40billion industry, is about to face fundamental disruption [13]. Clouddatabases such as Microsoft SQL Data Warehouse [23], AmazonRedshift [19], Google BigQuery [7], and others promise to be func-tionally equivalent to their counterparts on premises, yet provideenterprises with unprecedented flexibility and elasticity. All whilebeing highly cost-effective: Instead of considerable up-front ex-penses in the form of hardware and software license costs, clouddatabases offer a pay-as-you-go model that reduces databases ef-fectively to a set of APIs. They let users query or manipulate datafreely, yet shield them from the burden and headaches of having tomaintain and operate their own database.

The prospect of this new paradigm of data management is ex-tremely powerful and well-received by IT departments across allindustries [2]. However, it comes with significant adoption chal-lenges. Through applications that depend on the specific databasesthey were originally written for, database technology has over timeestablished some of the strongest vendor lock-in in all of IT. Movingto a cloud database requires adjusting or even rewriting of exist-ing applications and migrations may take up to several years, costmillions, and are heavily fraught with risk.

CIOs and IT leaders find themselves increasingly in a conundrumweighing the benefits of moving to a cloud database against thesubstantial switching cost. The situation is further complicated bya myriad of parameters and configurations to choose from whenmoving to the cloud: different systems offer different characteristicsand represent technology at different levels of maturity. This furthercompounds the risk, inherent to any re-platforming initiative.

In this paper, we present Adaptive Data Virtualization (ADV),based on a concept originally developed to bridge data silos in real-time data processing [14], and develop a platform approach thatsolves the problem of adopting cloud databases effectively and at afraction of the cost of other approaches. ADV lets applications, orig-inally developed for a specific on-premises database, run nativelyin the cloud—without requiring modifications to the application.The key principle of ADV is the intercepting of queries as they areemitted by an application and the on-the-fly translation of thosequeries from the language used by the application (SQL-A overwire protocol WP-A) into the language provided by the cloud data-base (SQL-B), see Figure 1. The translation must handle syntacticdifferences, reconcile semantic differences between the systems,and compensate for missing functionality.

The overwhelmingly positive receptions by customers and tech-nology partners alike underlines the benefits of ADV over a con-ventional migration approach along three dimensions:

Industry 3: DB Systems in the Cloud and Open Source SIGMOD’18, June 10-15, 2018, Houston, TX, USA

825

Page 2: Rapid Adoption of Cloud Data Warehouse TechnologyUsing ...static.tongtianta.site/paper_pdf/414aa74e-cde8-11e9-aa89-00163e08bb86.pdfRedshift [19], Google BigQuery [7], and others promise

Existing Application

Existing Connector

Client Applicationusing dialect SQL-A

and ODBC/JDBC connector

SQL-A overWire Protocol WP-A

Query Result overWire Protocol WP-A1 2

On-Premises Data Warehouse

(a)

Existing Application

Existing Connector

Client Applicationusing dialect SQL-A

and ODBC/JDBC connector

SQL-A overWire Protocol WP-A

SQL-B over Wire Protocol WP-B

Real-time TranslationWorkload Management

Query Result overWire Protocol

Query Result overWire Protocol WP-B

1

2 5

6

Real-time Result Conversion

3 4

Modern DW TechnologyOn-Premises or Cloud

Datometry

(b)

Figure 1: Application/database communication before re-platforming (a) and after (b). Applications remain unchanged, con-tinue to use query language SQL-A.

1. Short time to value:ADV can be deployed instantly and requireslittle or no build-out and implementation. This eliminates the timefor adjusting, rewriting or otherwise making existing applicationscompatible with the new target platform.

2. Low cost: Much of the cost in migration projects is incurred bythe manual translation of SQL embedded in applications and subse-quent iterations of testing the rewrites. Being a software solution,ADV eliminates both manual interaction and time-consuming andexpensive test phases entirely.

3. Reduced risk: SinceADV leaves existing applications unchanged,new database technology may not only be adopted rapidly, but alsowithout having to commit significant resources or time. This re-duces the risk to the enterprise substantially as projects can be fullytested in advance and, in the unlikely event that results do not meetexpectations, switching back to the original stack is instant andinexpensive.

In this paper we present use cases and technical underpinningsof ADV and discuss the trade-offs consumers of the technologyneed to be aware of when making decisions about when to deploy it.We present Hyper-Q, an industry first, which enables applicationsto run natively on a different database stack. Hyper-Q supportsa variety of source and target databases, and is able to translateautomatically the queries and results sets. We present a case studywith several real customer workloads, which exposes the versatilityof their queries in terms of query features and show that Hyper-Qsupports all of them out of the box.

Roadmap. The remainder of the paper is organized as follows.Section 2 reviews the state of the art of database re-platformingtoday and outlines standard procedures. In Section 3 we developa catalog of desiderata or requirements for ADV as a general con-cept. Section 4 provides a detailed overview of the technology andrelevant implementation choices. Sections 5 and 6 describe the pow-erful query rewriting capabilities of Hyper-Q. Section 7 reports theresults of a comprehensive experimental study of usingHyper-Qforboth real-world and synthetic workloads. We conclude the paperwith a review of related work in Section 8.

2 CONVENTIONAL DATAWAREHOUSEMIGRATION

The migrating or re-platforming of a database and its dependentapplication is a highly involved process around which an entireindustry has been established. For the reader’s benefit we detail theintricacies of what is considered state of the art in Appendix A. Wefocus here on data warehouse application migration problem.

2.1 Application MigrationAdjusting applications to work with the new database is a multi-faceted undertaking. On a technical level drivers, connectivityetc. need be established. Then, SQL be it embedded or in stand-alonescripts needs to be rewritten, and, closely related to the previous,enclosing application logic may need to be adjusted accordingly.To demonstrate some of the challenges when translating queriesfrom one SQL dialect to another, consider the following examplequery in Teradata SQL dialect:

Example 1

SELPRODUCT_NAME,SALES AS SALES_BASE ,SALES_BASE + 100 AS SALES_OFFSET

FROM PRODUCTQUALIFY

10 < SUM( SALES )OVER ( PARTITION BY STORE )

ORDER BY STORE , PRODUCT_NAMEWHERE CHARS(PRODUCT_NAME) > 4 ;

There are several vendor-specific constructs which make thequery non-portable. For example, SEL is a shortcut for SELECT.QUALIFY filters the output based on the result of a window oper-ation. Moreover, the order of the various clauses is not followingthe standard: ORDER BY precedes WHERE which is not accepted

Industry 3: DB Systems in the Cloud and Open Source SIGMOD’18, June 10-15, 2018, Houston, TX, USA

826

Page 3: Rapid Adoption of Cloud Data Warehouse TechnologyUsing ...static.tongtianta.site/paper_pdf/414aa74e-cde8-11e9-aa89-00163e08bb86.pdfRedshift [19], Google BigQuery [7], and others promise

Figure 2: Support for select Teradata features across majorcloud databases

by most other systems. Built-in functions and operators, such ascomputing the length of a string or date arithmetic, frequently dif-fer in syntax across vendors. Another widely used construct nottypically available in other systems is the ability to reference namedexpressions in the same query block. For example, the definition ofSALES_OFFSET uses the expression SALES_BASE defined in the sameblock.

Figure 2 shows a selection of query features supported by Tera-data and frequently used in analytics workloads, contrasted withthe percentage of leading cloud databases supporting them as ofthe time of writing. While this list is not meant to be complete orrepresentative of Teradata workloads, the shown features includethe ones that we have encountered while working with customersto re-platform their applications from Teradata to cloud databases.Some of these features are non-standard SQL extensions introducedby Teradata. For example, QUALIFY is a non-standard clause thatcombines window functions with predicates. Similarly, implicitjoins refers to the ability to implicitly join tables not explicitlyreferenced in the FROM clause.

Others are standard SQL features. However, they are either notimplemented at all, or only partially supported by cloud databases.Those include the ability to specify column names in a derived tablealias or theMERGE operation, which inserts and/or updates recordsin the same statement. While query rewriting may address someof these feature gaps, there are many cases where it may not bepossible to do so. Some features need extensive infrastructure onthe database side to be functional. We discuss how our solutionaddresses these different scenarios in Section 4.

We distinguish three broad classes of difficulty when it comes torewriting SQL terms and expressions:

Translation. This class contains fairly simple constructs, suchas keywords which differ between systems, slight variations be-tween standard built-in function names, date formats that needadjusting, etc. The required changes are rather straightforward andoften highly localized; many can be even addressed with textualsubstitution using regular expressions.

Transformation. The next category requires a complete un-derstanding of the structure of a query including proper nameresolution and type derivation. Examples for this category include

Teradata’s QUALIFY clause, non-standard arithmetic operations be-tween data types such as date/integer comparisons, or the use ofordinals instead of column names inGROUP BY orORDER BY clauses.While a rewrite is often possible, it can often not be detected bypurely looking at the syntax, and may involve restructuring thequery in a non-local manner, which is difficult and error-prone,specifically in the case of complex queries. In this category we alsoplace constructs that may be syntactically supported as-is on thetarget system, but have a different behavior, such as the defaultordering of NULL in ORDER BY expressions. These pose a particulardifficulty as they may go undetected during the migration project:the original SQL code compiles and executes on the new systemand may produce even appropriate results in some cases. However,correctness has been compromised and leads to subtle defects thatare hard to spot and correct.

Emulation. In some cases, the new database system lacks crit-ical functionality, e.g., control flow, which may require pushingparts of the queries to the application layer. This class of featuresalso includes proprietary vendor features which do not have directcorrespondence in the target system, such as the MERGE statement,querying the system catalog, or informational commands such asHELP SESSION which returns settings of current user session. Thosefeatures cannot be rewritten and need to be emulated. Emulationtypically uses constructs provided by the new target system aswell as state information maintained in the the application layer tosupport these features.

The rewritten SQL expression needs to be placed into the ap-plication which may lead to further complications: if a single SQLstatement leads to a rewrite consisting of multiple statements forthe next system, the control flow of the application needs to beadjusted accordingly.

2.2 DiscussionThe previous sections touched on a number of technical issues andare intended to give readers an overall sense of the complexities ofmigrations. Talking to practitioners, we noticed there are a numberof common misconceptions from a technical point; we feel twodeserve being called out explicitly.

2.2.1 Migration Paradox. Paradoxically, the most difficult taskabout database migrations is actually not the migrating of the data-base, i.e., the content transfer, but the adjusting of applications. Infact, the transfer of the content of the database is supported by avariety of tools that make the transfer of database content appearsimple per se. However, the manual adaptation of applications mayrequire rewriting thousands of lines of SQL, significant portionsof the code in which SQL scripts are embedded in etc. Once themagnitude of the changes needed and the risk associated with themis understood, the price tag for a migration appears justified, evenwhen it runs into the millions.

2.2.2 Assumed Independence. Closely related to the “MigrationParadox” is the secondmost commonmisconceptionwhich assumesindependence between transferring content and the adjusting ofapplications. However, transcribing the schema and/or data cannotbe dealt with independently from the migration of applications.Consider the case of missing support of data types, for example the

Industry 3: DB Systems in the Cloud and Open Source SIGMOD’18, June 10-15, 2018, Houston, TX, USA

827

Page 4: Rapid Adoption of Cloud Data Warehouse TechnologyUsing ...static.tongtianta.site/paper_pdf/414aa74e-cde8-11e9-aa89-00163e08bb86.pdfRedshift [19], Google BigQuery [7], and others promise

PERIOD type, which describes start and stop of a time range. Fewdatabase systems support it. A simple translation would be break-ing it into two separate fields for the two components. However, allqueries that reference the original PERIOD field need to be rewrittento deal explicitly with the start and stop fields. More complex con-structs such as Global Temporary Tables are effectively impossibleto translate. As a consequence, migration tools frequently error outin these situations and leave users with an incomplete translation.

3 DESIDERATA FOR DATABASEVIRTUALIZATION

In the following we lay out requirements such a solution needs toaddress to be successful. We base our requirements on a variety ofuse cases we have seen in practice. Those are presented in Section Bof the Appendix. We refer to the original database system as DB-Aand its language as SQL-A, to the replacement database as DB-Band SQL-B accordingly. We organize the desiderata in the followinggroups:

3.1 FunctionalityRequirements in this category refer primarily to establishing equiv-alent behavior between original and emulated database.

• Statements in SQL-A need to be translated into zero, one, ormore terms using SQL-B that produce an equivalent response.In some cases, the original statement may be eliminatedaltogether.

• DDL commands that create, drop or alter the structure of ob-jects need to be supported. DDL that does not have immedi-ately corresponding equivalents may require more elaborateworkarounds, e.g., tables originally defined with set seman-tics may get represented as tables with unique primary keysin systems that do not natively support set semantics.

• Support for variety of data types, including ODBC types,as well as user-defined types or compound data types, e.g.,PERIOD. Those may need to be represented as combinationof basic data types, or serialized as strings; in some casesadditional metadata may need to be stored to capture thefull semantics.

• Support for native wire protocols is needed to ensure theoriginal applications do not need to be modified or altered.This solves the large scale problem of having to replace tens ifnot hundreds of drivers and connectors across the enterprise.

• Representation and translation of procedural constructs suchas calling of functions or stored procedures is required inorder to support a large number of established workloads.In the context of this work, we focus on call semantics onlyand assume DB-B does have appropriate support for rep-resenting functions. Research on emulation of proceduresand functions is outside the scope of this paper. See alsoSection 9.

The above list is akin to a task list for migrating manually betweendatabases: the underlying gaps in expressivity need to be addressedin workarounds and implementations. We did not list correctnessas we consider it a basic, non-negotiable requirement.

3.2 Operational AspectsFor ADV to meet standards of modern IT, a number of operationalproperties must be established. These pertain to questions aroundefficiency, safe operations and integration with existing adjacentsystems.

• Latency and throughput may not be impacted significantlyby ADV’s being in the data path. Most virtualization tech-nologies are of small but measurable overhead, e.g., in com-pute virtualization 5-8% overhead are typically. It stands toreason that ADV should be of similar impact.

• Optimality is desirable, i.e., queries translated as a resultof ADV may not perform substantially slower or in anyother way suboptimal when compared to a hand-craftedtranslation. We note that this property is hard to define andeven harder to measure but we list it here for completenessas it does come up as a customers’ concern in practice.

• Integration with the existing IT ecosystem is required: in-cluding authentication mechanisms and systems such as Ker-beros, Active Directory, or LDAP; monitoring infrastructure,deployment frameworks, load-balancers and mechanismsfor fault-tolerance and high-availability.

• Scalability to satisfy concurrency and throughput require-ments is necessary. A successful design needs to providethe ability to scale horizontally—by replicating instancesof the ADV solution across independent machines—as wellas vertically in being able to leverage additional CPU corestransparently.

This second category of desiderata puts emphasis on integrationand some of the points above are clearly dependent on DB-B, thedatabase that is chosen as a replacement.

In summary, the desiderata above are not meant as a strict re-quirements analysis but rather guidelines for the development. Inworking with customers, we have come to understand that depend-ing on the system and application scenarios at hand, some of theseare negotiable, while others may be hard requirements.

4 HYPER-Q PLATFORMIn this section, we provide a technical deep dive into the architec-ture and capabilities of Hyper-Q and how its design addresses thedesiderata outlined in Section 3. Hyper-Q automates database func-tionality mismatch discovery as well as schema and data transferacross databases. However, we focus in this section on the applica-tion re-platforming aspect.

Hyper-Q intercepts the network traffic between applications anddatabases and translates the language and communication protocolon-the-fly to allow application to seamlessly run on a completelydifferent database. Figure 3 shows the Hyper-Q architecture. Ap-plications are typically built on top of database libraries includingODBC/JDBC drivers as well as native database vendor libraries suchas Postgres’ libpq. These libraries are needed to abstract the de-tails of database communication with the application by providingsimple APIs to submit query requests and consume responses.

Hyper-Q intercepts the incoming traffic from the applicationand rewrites it to match the expectation of a completely differenton-premises or in-the-cloud database. When an application requestis received, it gets mapped in real-time to equivalent requests that

Industry 3: DB Systems in the Cloud and Open Source SIGMOD’18, June 10-15, 2018, Houston, TX, USA

828

Page 5: Rapid Adoption of Cloud Data Warehouse TechnologyUsing ...static.tongtianta.site/paper_pdf/414aa74e-cde8-11e9-aa89-00163e08bb86.pdfRedshift [19], Google BigQuery [7], and others promise

Protocol Handler

Cross Compiler

Parser

Binder

Serializer

Gateway Manager

ODBC Server

AST

XTRA

Result Converter

Result Store

TDFDBBinary

Req Msgs

Rsp Msgs

ODBC Call

ODBC Results

Req/Rsp Req/Rsp Req/RspNative DB libraries

Applications

Transformer

XTRA*

</>SQL IN

</>SQL OUT

Algebrizer

Figure 3: Architecture of Hyper-Q

the new database can comprehend and process. After processing isdone,Hyper-Q also does the reverse translation, where the databasequery response is converted into the binary format of the originaldatabase. This is crucial to allow applications to work as expectedby providing query results that are bit-identical to the originaldatabase.

An important aspect of Hyper-Q architecture is its ability torapidly incorporate support to new frontend and backend systems.This allows the system to be easily extended without disruptingthe existing functionality. Applications written for a new frontendsystem A are supported by adding a language and wire-protocolparsers to allow Hyper-Q to communicate with the application andtranslate its commands. Once system A is supported, Hyper-Q canexecute run A applications against all supported backend systems.

A new backend system B is supported by adding a language seri-alizer and B’s ODBC driver to allow Hyper-Q to create commandsthat conform with B’s language, and receive results after execution.Once system B is supported, Hyper-Q can now run applicationswritten for any supported frontend system against system B.

By relying on its powerful abstraction of database system touch-points,Hyper-Q does not need to implement the full support matrixto re-platform M frontends against N backends. As we show in [14],Hyper-Q is also able to run non-SQL applications against MPPdata warehouses. We further extend the application/database sur-face here by showing how we run legacy SQL applications againstmodern cloud data warehouses.

In the next subsections, we describe the details of query andresult translation inside Hyper-Q.

4.1 Protocol HandlerNetwork traffic between application and database includes an ex-change of networkmessages designed according to the original data-base specifications. Different constructs need to be implementedto emulate the native communication protocol of the original data-base. This includes authentication handshake to establish secureconnection between the application and the database, network

message types and binary formats, as well as representation ofdifferent query elements, data types and query responses. Hyper-Qimplements these details providing applications with a bit-identicalinterface to data even though the application effectively runs on adifferent target database system.

The Protocol Handler component of Hyper-Q is responsible ofintercepting the network message flow submitted by the applica-tion, extracting important pieces of information on-the-fly, e.g.,application credentials or payloads of application requests, andpassing this information down to Hyper-Q engine for further pro-cessing. When a response is ready to be sent back to the application,the Protocol Handler component packages it into the binary mes-sage format expected by the application. This completely abstractsthe application-database communication details from the rest ofHyper-Q components.

The Protocol Handler component provides native support of com-munication protocols of different flavors, including ODBC/JDBCprotocols, as well as native database communication libraries. Forhistorical reasons, widely different protocol implementations maybe introduced at different points of time by the original databaseto deal with new client types. In many cases, database clients be-come non-functional with the slightest difference in behavior ofthe database server. Emulating the protocol by producing identicalnetwork traffic of the original database is therefore crucial to useclients’ functioning. The message exchange with the new databasebecomes bit-identical to the message exchange with the originaldatabase.

4.2 AlgebrizerThe statements submitted by an application are expressed accord-ing to the query language of the original database. Even when anapplication is built using standard ODBC/JDBC APIs, the SQL textsubmitted through these APIs is specified according to the syntaxand feature set of the original database.

Industry 3: DB Systems in the Cloud and Open Source SIGMOD’18, June 10-15, 2018, Houston, TX, USA

829

Page 6: Rapid Adoption of Cloud Data Warehouse TechnologyUsing ...static.tongtianta.site/paper_pdf/414aa74e-cde8-11e9-aa89-00163e08bb86.pdfRedshift [19], Google BigQuery [7], and others promise

As mentioned above, the original database language (or SQLdialect) is typically widely different from the query language sup-ported by the new database. The queries embedded in the appli-cation logic would thus be almostly certainly broken if executedwithout changes on a new database. Query rewriting is necessaryto port applications written for the original database to the newtarget database.

The Algebrizer component in Hyper-Q is a system-specific plu-gin implemented according to the language specifications of theoriginal database. It includes a rule-based parser that implementsthe full query surface of the original database and a universallanguage-agnostic query representation called eXtended RelationalAlgebra (XTRA) used to capture different query constructs. Queryalgebrization is performed place in two phases: (1) parsing incom-ing request into an Abstract Syntax Tree (AST) capturing high-levelquery structure, and (2) binding the generated AST into XTRA ex-pression, where more involved operations such as Metadata lookupand query normalization are performed.

4.3 TransformerXTRA is a powerful representation that lends itself to performingseveral query transformations transparently for query correctnessand performance purposes. The Transformer component is the dri-ver responsible for triggering different transformation rules undergiven pre-conditions. The transformations are plug-able compo-nents that could be shared across different databases and applicationrequests. For example, the original database may allow direct com-parison of two data types by implicitly converting one to the other.The new database may not have the same capability, and wouldthus require the conversion to be spelled-out explicitly in the queryrequest. In this case, a transformation that converts a comparisonof two data types into an equivalent comparison after explicitlytransforming one of the types to the other needs to be applied.

Transformations could also be used to improve the performanceof generated queries. For example, if the target database incurs alarge overhead in executing single-row DML requests, a transfor-mation that groups a large number of contiguous single-row DMLstatements into one large statement could be applied.

Controlling which transformation are triggered against a givenquery request is done bymaintaining a map for each target databasesystem associating different XTRA operators with their correspond-ing transformations. Given an algebrized XTRA expression thatincludes these operators, the Transformer automatically triggersthe relevant transformations to generate a new XTRA expressionafter applying transformations logic. Transformations could be cas-caded, where the output of one transformation represents a validinput to another transformation. The Transformer takes care ofrunning all relevant transformations repeatedly until reaching afixed point, where no further modifications to the XTRA expressionvia transformation is possible.

4.4 SerializerThe synthesizing of the target queries is implemented using theSerializer component of Hyper-Q. Each target database has itsown Serializer implementation. These different serializers share acommon interface: the input is an XTRA expression, and the output

is the serialized SQL statement of that XTRA. Serialization takesplace by walking through the XTRA expression, generating a SQLblock for each operator and then formatting the generated blocksaccording to the specific keywords and query constructs of thetarget database.

4.5 ODBC ServerThe ODBC Server component is an abstraction of ODBC APIs thatallows Hyper-Q to communicate with different target databasesystems using their corresponding ODBC drivers. The APIs providemeans to submit different kinds of requests to the target databasefor execution, ranging from simple queries/DML requests to multi-statement requests, parameterized queries, and stored procedurecalls.

The results of these requests are retrieved by ODBC Serveron demand in one or more batches depending on the result size.Result batches are packaged according to Hyper-Q binary datarepresentation, called Tabular Data Format (TDF), which is designedto be an extensible binary format that is able handle arbitrarily largenested data. By relying on the TDF abstraction, the ODBC Serverprovides means for query results retrieval in a number of non-straightforward scenarios including very wide rows and extremelylarge result sets.

4.6 Result ConverterThe query results in TDF representation cannot be consumed di-rectly by the application, since the application expects query resultsto be formatted in a specific way, as defined by the binary repre-sentation of the original database. Hyper-Q needs to do on-the-flyresult conversion from TDF into the binary representation thatapplication expects. The Result Converter is the driver for suchoperation.

TDF packets are unwrapped by Result Converter to extract resultrows and convert them into the binary format of the original data-base. This conversion operation happens in parallel by starting anumber of processes where each process handles the conversion ofa subset of the result rows. If the original database allows streamingquery results to the application, the converted results are streameddirectly to Protocol Handler which packages them into networkmessages to be sent back to the application.

Alternatively, the original database may not allow streamingthe results. For example, some databases require that the totalnumber of results is sent to the application first before sending anyactual result. In this case, the Result Converter needs to buffer allresult rows until they are fully consumed from the target database.When the result size is very large, the buffered results may not fitin memory. In this case, the Result Converter spills the bufferedresults into disk and maintains the set of generated spill files untilresult consumption is done. At this point, the converted results aresent to Protocol Handler to be packaged into messages returnedback to the application.

5 QUERY REWRITINGIn this section, we walk through the architecture of Hyper-Q toillustrate its powerful query rewriting capabilities.

Industry 3: DB Systems in the Cloud and Open Source SIGMOD’18, June 10-15, 2018, Houston, TX, USA

830

Page 7: Rapid Adoption of Cloud Data Warehouse TechnologyUsing ...static.tongtianta.site/paper_pdf/414aa74e-cde8-11e9-aa89-00163e08bb86.pdfRedshift [19], Google BigQuery [7], and others promise

For illustration, consider a query that finds SALES records withspecific dates that satisfy a given criterion on the AMOUNT column.With returned records having one of the top 10 AMOUNT valueswhile preserving ties.

Example 2

SEL *FROM SALESWHERESALES_DATE > 1140101AND (AMOUNT, AMOUNT * 0 . 8 5 ) >ANY ( SEL GROSS , NET

FROM SALES_HISTORY )QUALIFY RANK(AMOUNT DESC ) <= 1 0 ;

The previous query uses a number of proprietary features intro-duced by Teradata and do not exist in cloud databases:

• Comparison between DATE and INT data types. This canbe specified explicitly in Teradata SQL since dates are repre-sented internally as integers. For example, ‘1140101’ is theinteger representation of date value ‘2014-01-01’.

• Vectorized subquery construct that assumes the followingcomparison semantics: (AMOUNT, AMOUNT * .85) > (GROSS,NET) ⇐⇒ (AMOUNT > GROSS) ∨ (AMOUNT = GROSS ∧AMOUNT * .85 > NET). That is, the construct retains all saleswith amounts exceeding any gross amount in sales historysuch that ties are broken using the net sales values.

• Window function extensionsQUALIFY and RANK.QUALIFYcombines the evaluation of window function and compari-son expressions, while RANK is a variation of ANSI RANKwindow function, where the order expression is given as afunction argument.

Query rewriting and normalization take place at different stagesdepending on the type of rewrite that need to be carried out. In thenext subsections, we illustrate the different query rewriting stagesusing Example 2. Some features lend themselves to be implementedby multiple possible components. While there are no strict rulesand the framework allows certain flexibility when choosing whereto implement a rewrite, we base our decisions on the followingguidelines:

• Features that exist only in the source system’s language tendto be rewritten in either parser or binder.

• Names of otherwise standard features can be dealt with inthe system specific serializer (e.g. int8 vs bigint, or dateaddvs add_date).

• A feature natively supported by multiple systems often de-serves its own XTRA operator, even if it can be expressedwith existing ones.

5.1 Query Rewriting during ParsingThe parser uses different grammar rules to cover the query surfaceof the input language. The query surface includes standard featuresas well as proprietary features introduced exclusively by the sourcedatabase system. Simple query rewriting/normalization operationscan be done during the parsing phase. Special keywords are nor-malized and mapped to standard query clauses. For example, SEL

+-td_qualify

|-ansi_select

| |-ansi_get(SALES)

| +-ansi_boolexpr(AND)

| |-ansi_cmp(GT)

| | |-td_ident(SALES_DATE)

| | +-ansi_const (1140101)

| +-ansi_subq(ANY , GT, [GROSS , NET])

| |-ansi_get(SALES_HISTORY)

| +-ansi_list

| |-td_ident(AMOUNT)

| +-ansi_arith (*)

| |-td_ident(AMOUNT)

| +-ansi_const (0.85)

+-ansi_cmp(LTE)

|-td_rank(AMOUNT , DESC)

+-ansi_const (10)

Figure 4: Generated AST for Example 2

keyword is a shortcut of SELECT keyword and is used to definethe start of a query block. These rewrites belong to the category ofSyntactic Rewrites described in Section 2.

The parser generates an AST composed of a mix of generic andspecific parse nodes that delineate all characteristics of the inputquery. The generic nodes are designed to capture standard ANSISQL query elements, e.g.,WHERE clause, whereas the specific nodescapture vendor-specific query elements, e.g., Teradata QUALIFYclause, that deviate from or extend the ANSI SQL standard. EachAST node is associatedwith a codemodule that contains the bindingimplementation to convert the AST node into a correspondingXTRA node. These code modules are typed and versioned to allowcustomizing Hyper-Q parser to process queries of specific versionsof the source database language. The binding implementation ofgeneric parse nodes can be shared across other supported sourcesystems that provide the same construct. This is important to allowHyper-Q to add support to new source systems by building onthe existing implementation of standard features, and adding newimplementation only for the features that deviate from the standard.

Figure 4 shows the generated AST of the query in Example 2.Some of the AST nodes are generic ANSI nodes, e.g., ansi_select,which captures standard WHERE clause, while other nodes arevendor-specific nodes, e.g., td_qualify, which captures TeradataQUALIFY clause.

After the parsing stage is complete, the remaining query rewrit-ing operations can either be Semantic Rewrites, where a transforma-tion is needed to modify and/or re-order query elements to preservethe semantics of different constructs, or Emulation, where we needto break down query elements into smaller components and main-tain state of query execution inside the virtualization layer. Wediscuss Semantic Rewrites in Sections 5.2 and 5.3, and Emulation inSection 6.

5.2 Query Rewriting during BindingWhile binding the AST into XTRA, certain transformations can takeplace to normalize special constructs found in the input query. Oneexample is the comparison between date and integer data types inTeradata SQL. Teradata represents dates internally as integers. This

Industry 3: DB Systems in the Cloud and Open Source SIGMOD’18, June 10-15, 2018, Houston, TX, USA

831

Page 8: Rapid Adoption of Cloud Data Warehouse TechnologyUsing ...static.tongtianta.site/paper_pdf/414aa74e-cde8-11e9-aa89-00163e08bb86.pdfRedshift [19], Google BigQuery [7], and others promise

+-select

|-window(RANK , DESC , AMOUNT)

| +-select

| |-get (SALES)

| +-boolexpr(AND)

| |-comp(GT)

| | |-arith (+)

| | | |-extract(DAY , SALES_DATE)

| | | |-arith (*)

| | | | |-extract(MONTH , SALES_DATE)

| | | | +-const (100)

| | | +-arith (*)

| | | |-arith(-)

| | | | |-extract(YEAR , SALES_DATE)

| | | | +-const (1900)

| | | +-const (10000)

| | +-const (1140101)

| +-subq(ANY , GT, [GROSS , NET])

| |-get(SALES_HISTORY)

| +-list

| |-ident(AMOUNT)

| +-arith (*)

| |-ident(AMOUNT)

| +-const (0.85)

+-comp(LTE)

|-ident(AMOUNT)

+-const (10)

Figure 5: Generated XTRA for Example 2 after algebrization

is not the case with other database systems. Comparison betweendates and integers is thus not directly portable to other databasesystems. In order to port this construct, the comparison needs totrigger an implicit conversion of the date side of the comparisoninto integer value. Since the integer representation of date is uniqueto Teradata system and does not exist in any of the target clouddatabases, applying this rewrite as early as possible is importantto create a normalized representation of the query before furtherprocessing takes place. Binding is an appropriate stage for suchtransformations since it does not require knowledge of the targetdatabase system that will eventually execute the transformed query.

The comparison rewrite operation is is triggered by the Trans-former component. The Transformer component manages the ap-plication and execution of different transformations until reach-ing a fixed point. Each XTRA node can be associated with one ormore transformations. In this case, an XTRA comp node that cap-tures the comparison is associated with a transformation modulecomp_date_to_int that expands the date side of the comparisoninto the following arithmetic expression DAY + (MONTH * 100) +(YEAR - 1900) * 10000, which gives the equivalent integer value.

The full generated XTRA expression after the binding phase isgiven in Figure 5, which is a normalized relational algebra expres-sion that captures the input query in Example 2. XTRA builds onan uniform algebraic model, where the output of a given operatordepends on operator’s inputs as well as operator’s type. Differentconstructs in the query body are captured using different XTRA op-erators with well-defined semantics.

+-select

|-window(RANK , DESC , AMOUNT)

| +-select

| |-get (SALES)

| +-boolexpr(AND)

| |-comp(GT)

| | |-arith (+)

| | | |-extract(DAY , SALES_DATE)

| | | |-arith (*)

| | | | |-extract(MONTH , SALES_DATE)

| | | | +-const (100)

| | | +-arith (*)

| | | |-arith(-)

| | | | |-extract(YEAR , SALES_DATE)

| | | | +-const (1900)

| | | +-const (10000)

| | +-const (1140101)

| +-subq(EXISTS)

| +-select

| |-remap consts: (1)

| | +-get (SALES_HISTORY 'S2 ')

| +-boolexpr(OR)

| |-comp(GT)

| | |-ident (S1.AMOUNT)

| | +-ident (S2.GROSS)

| +-boolexpr(AND)

| |-comp (EQ)

| | |-ident(S1.AMOUNT)

| | +-ident(S2.GROSS)

| +-comp(GT)

| |-arith (*)

| | |-ident (S1.AMOUNT)

| | +-const (0.85)

| +-ident (S2.NET)

+-comp(LTE)

|-ident(AMOUNT)

+-const (10)

Figure 6: Final XTRA for Example 2

5.3 Query Rewriting during SerializationFor some target databases, the generated XTRA in Figure 5 cannotbe used directly to generate SQL in its current form. The reason isthat such target databases do not understand vector comparison inthe subquery construct (AMOUNT, AMOUNT * 0.85) > ANY (SELGROSS, NET FROM SALES_HISTORY).

This transformation is system specific, since it is designed tomatch the capabilities of a particular target database system andhence it needs to be triggered right before serialization of the finalXTRA into SQL. For target database systems that support vectorcomparison in subqueries, this transformation would not be trig-gered. This is why it cannot be applied earlier in the algebrizationstage and has to be deferred to serialization stage. The transforma-tion detects patterns where subq node involves vector comparison.The transformation replaces such subquery pattern with a semanti-cally equivalent existential correlated subquery.

The final XTRA is shown in Figure 6, which is now ready to beserialized into SQL compatible with the target database as given bythe following query:

Example 3

SELECT T .AMOUNT, T . SALES_DATE

Industry 3: DB Systems in the Cloud and Open Source SIGMOD’18, June 10-15, 2018, Houston, TX, USA

832

Page 9: Rapid Adoption of Cloud Data Warehouse TechnologyUsing ...static.tongtianta.site/paper_pdf/414aa74e-cde8-11e9-aa89-00163e08bb86.pdfRedshift [19], Google BigQuery [7], and others promise

FROM(

SELECT * ,RANK ( ) OVER(ORDER BY AMOUNT DESC ) AS RFROM SALES S1WHEREEXTRACT (DAY FROM SALES_DATE ) +EXTRACT (MONTH FROM SALES_DATE ) * 100 +( EXTRACT ( YEAR FROM SALES_DATE ) − 1 9 0 0 )* 10000 > 1140101 ANDEXISTS

(SELECT 1FROM SALES_HISTORY S2WHERE ( S1 .AMOUNT > S2 . GROSS )

OR ( S1 .AMOUNT = S2 . GROSSAND S1 .AMOUNT * . 8 5 > S2 . NET ) ) ) AS T

WHERE R <= 1 0 ;

We include in Figure 11 of the Appendix the full rewrite of allconstructs in Example 2 for porting the given Teradata query to atarget cloud database system. Hyper-Q is a principled frameworkthat can handle many other query rewriting scenarios using itspowerful transformation engine.

6 EMULATIONThe features used by queries in client application may be com-pletely missing in the target database. Sophisticated data warehousefeatures (e.g., stored procedures, recursive queries and updatableviews) that are missing in target databases are still supported inHyper-Q by emulation. In order to support these features, stream-lined query rewriting is not a viable option. Hyper-Q breaks downthese sophisticated features into smaller units such that runningthese units in combination gives the application exactly the samebehavior of the main feature. We describe how emulation of miss-ing features is enabled by Hyper-Q using recursive queries as anexample.

Example 4 Given an employee relation EMP(EMPNO,MGRNO),the following recursive query which computes all employeesreporting either directly or indirectly to emp10:

WITH RECURSIVE REPORTS (EMPNO,MGRNO) AS(

SELECT EMPNO,MGRNOFROM EMPWHERE MGRNO = 10

UNION ALLSELECT EMP .EMPNO,EMP .MGRNOFROM EMP, REPORTSWHERE REPORTS .EMPNO = EMP .MGRNO

)SELECT EMPNO FROM REPORTS ORDER BY EMPNO;

Figure 7 shows hierarchical employee-manager sample datafor the EMP relation: {(e1,e7), (e7,e8), (e8,e10), (e9,e10), (e10,e11)}□

Figure 7: Emulating recursive query in Example 4

Recursive queries are introduced in the SQL standard as a wayto handle queries over hierarchical model data. By introducing acommon table expression that includes self references (e.g., RE-PORTS.EMPNO inside the definition of REPORTS), the databasesystem is required to run query recursion until reaching a fixedpoint, where further recursion does not introduce any new results.

When query recursion is not natively supported by the targetdatabase, there is no direct translation that could be utilized to gen-erate a single SQL request equivalent to a recursive query. However,a deeper analysis shows that the computation of recursive queriesrelies on simpler constructs that might be readily available in thetarget database. In particular, using a sequence of temporary tableoperations, recursion can be fully emulated.

Hyper-Q emulates query recursion using two temporary tables:TempTable, which stores the results of current recursive call, andWorkTable, which stores the full results of previous recursive calls.At each recursive step, the contents of TempTable are appended toWorkTable. Recursion ends when TempTable contains no results.

Figure 7 shows the sequence of temporary table operation per-formed by Hyper-Q to emulate the execution of the previous re-cursive query:

(1) Initialize both WorkTable and TempTable with the results ofseed expression, which is the first UNION ALL input: {(e8,e10), (e9, e10)}.

(2) Execute recursive expression by joining EMP with TempT-able. Append the results {(e7, e8)} to WorkTable which be-comes {(e8, e10), (e9, e10), (e7, e8)}. TempTable= {(e7, e8)}.Recursion needs to continue.

(3) Execute recursive expression by joining EMP with TempT-able. Append the results {(e1, e7)} to WorkTable which be-comes now {(e8, e10), (e9, e10), (e7, e8), (e1, e7)}. TempTable={(e1, e7)}. Recursion needs to continue.

(4) Execute recursive expression by joining EMP with TempT-able. The result is empty and so recursion stops. Drop TempT-able.

(5) Substitute references of REPORTS withWorkTable in mainquery, and execute modified query SELECT EMPNO FROMWorkTable ORDER BY EMPNO;.

(6) Return results of main query to client application. DropWorkTable.

Industry 3: DB Systems in the Cloud and Open Source SIGMOD’18, June 10-15, 2018, Houston, TX, USA

833

Page 10: Rapid Adoption of Cloud Data Warehouse TechnologyUsing ...static.tongtianta.site/paper_pdf/414aa74e-cde8-11e9-aa89-00163e08bb86.pdfRedshift [19], Google BigQuery [7], and others promise

Customer Sector Total (Distinct) Number of Queries1 Health 39731 (3778)2 Telco 192753 (10446)

Table 1: Overview of customers and workloads

In the previous example, Hyper-Q generates multiple queryrequests, and maintains the state of recursion by inspecting theresults of these requests. The behavior from application stand pointremains the same, where the results of the recursive query areobtained with exactly the same behavior of the original databasesystem.

While the technique described in this section relies on tempo-rary table operations to provide the required recursion behavior,this is not necessarily needed in other emulation or query rewrit-ing scenarios. For example, emulation of stored procedures insideHyper-Q requires only maintaining the execution state (e.g., vari-able scopes) and driving the procedure execution by breaking itscontrol flow into multiple SQL requests.

Section C in the Appendix gives a high-level overview of theimplementation of multiple features in Hyper-Q .

7 EXPERIMENTSIn this section we report on a variety of findings when runningHyper-Q on both real-world customer as well as synthetic work-loads.

7.1 Customer Workload StudyThe first study shows the extensiveness of the application migrationproblem by exploring the various characteristics of two real cus-tomer workloads, which make them non-portable across systems.Table 1 gives an overview of the selected customers and work-loads; in both cases the customers were moving from a traditionalon-premise vendor to a cloud databases.

We instrumented Hyper-Q’s query rewrite engine to track aselection of 27 commonly used non-standard features observed incustomer workloads from each of the three categories presented inSection 2.1, (translation, transformation, and features that requireemulation in the mid tier; we chose 9 features of each class). Whilethis feature list is not meant to be exhaustive, it gives a good startingpoint to study the diversity of customer workloads, and is a goodindicator of how involved a migration project can be, if that projectwas done manually. As a matter of fact, tracking more featureswould expose even more intricacies that need to be addressed whenporting the application logic to a new database stack.

Figure 8 (a) shows how many of the tracked features appear atleast once in each of the customer workloads. One can see thateach class of features is well represented. In particular, Workloads1 and 2 have more than 77% and 66%, respectively, of the 9 trackedfeatures that require transformation.

In Figure 8 (b) we show what percentage of distinct queries ineach workload are affected by each of the tracked classes of rewrites.Within each class a query is counted at most once, even if it hasmore than one of the tracked features of that class, but a querymay belong to two different rewriting categories, for example ifit contains both a syntactic feature, as well as one that requires a

55.6%

77.8%

33.3%

22.2%

66.7%

33.3%

0% 10% 20% 30% 40% 50% 60% 70% 80% 90%

Translation

Transformation

Emulation

Workload 1 Workload 2

(a) Percentage of tracked features contained in each workload

1.4%

33.6%

0.2%

0.2%

4.0%

79.1%

0% 10% 20% 30% 40% 50% 60% 70% 80% 90%

Translation

Transformation

Emulation

Workload 1 Workload 2

(b) Percentage of queries affected by each feature class

Figure 8: Customer workload characteristics.

semantic transformation. While the two customer workloads havedifferent composition of features and classes, a key observationis that very few differences are due to keyword translation. Themajority of queries require more involved rewrites: about one thirdof the queries in the first workload require one or more semantictransformations, and almost 80% of the second workload requiresemulation. The latter is due to the fact that Customer 2 has selectedto wrap a large portion of their business logic in macros (namedparametrized sequences of SQL statement supported by the originalbut not by the new database), and queries simply call these macroswith different parameters.

The results are consistent with what we have observed in othercustomer workloads. In summary, this experiment demonstratesthat:

• Since very few of the detected features are syntactic in nature,a purely textual replacement-based solution will not workin practice.

• Query workloads often have multiple features that requirea non-local rewrite or even emulation, and a large numberof queries are affected by these. This suggests that a manualworkload migration project will be very costly.

• Hyper-Q handles all those features automatically.

7.2 Overhead of Hyper-QIn this section we provide performance evaluation of our frame-work, and show that Hyper-Q introduces negligible overhead tothe execution of analytical queries in a cloud database.

Industry 3: DB Systems in the Cloud and Open Source SIGMOD’18, June 10-15, 2018, Houston, TX, USA

834

Page 11: Rapid Adoption of Cloud Data Warehouse TechnologyUsing ...static.tongtianta.site/paper_pdf/414aa74e-cde8-11e9-aa89-00163e08bb86.pdfRedshift [19], Google BigQuery [7], and others promise

For this experiment we ran the 22 queries of the TPC-H bench-mark1 on a 1TB of data stored in one of the leading cloud databases.The cloud database was provisioned as a 2 node cluster, where eachnode had 32 virtual cores, 244GiB of main memory and 2.56TB ofSSD storage. We used Teradata’s bteq client to submit queries toHyper-Q , and we cleared the database cache before every execution.

For each query we measured the following times:• Query translation time: Time spent by Hyper-Q to trans-late the query from the original SQL dialect to the SQL di-alect used by the cloud database, including parsing, bind-ing, backend-specific transformations and emitting the finalquery into the target language

• Execution time: Time taken by the cloud database to executethe query and return the results

• Result transformation time: Time taken byHyper-Q to trans-form the results to Teradata’s native format

Figure 9(a) shows that the total overhead introduced byHyper-Qwith respect to the end-to-end execution time is less than 2%, witharound 0.5% for query translation, and around 1% for result trans-formation. While most analytical queries typically return smalleraggregated results, we are currently working on improving ourdata processing pipeline, including building an ODBC bridge toavoid redundant marshalling of data types for the cases where theoriginal application uses ODBC as well.

7.3 Stress TestThis experiment mimics a real-world application scenario for aFortune 10 customer that used Hyper-Q to re-platform applica-tions from Teradata to a leading cloud data warehouse. A customerapplication starts ten simultaneous sessions to connect to the under-lying database. Each session continuously sends queries throughHyper-Q . The cloud data warehouse was configured to use themost powerful available specifications, and was manually tunedto achieve the highest possible performance for the workload. Thetest lasted for 17.5 hours, submitting a total of 967,506 queries cap-turing customer’s peak load for generating analytical reports. Thesubmitted queries had a rich set of features, including many of thefeatures in Figure 2 such as QUALIFYclause, named expressions,implicit joins and vector subqueries.

To mimic that scenario, we used a similar setup to the one used insection 7.2. However, instead of using a single Teradata’s bteq client,we used ten clients, each repeatedly sends TPC-H queries throughHyper-Q to execute on the cloud data warehouse. We let the ex-periment run for several hours and then collected the aggregatedquery translation time, execution time and result transformationtime.

Figure 9(b) shows that in our stress test scenario, Hyper-Q over-head is a tiny fraction with respect to end-to-end execution time. Inparticular, the total overhead of Hyper-Q is between 0.1% and 0.2%of the total execution time. This is due to the fact that while thequery execution time increases substantially with the level of queryconcurrency, Hyper-Q only introduces a small constant overheadper query.

1http://www.tpc.org/tpch/

Execution

ResultTransformation

QueryTranslation

(a) Aggregated elapsed time for single sequential run

Execution

ResultTransformation

QueryTranslation

(b) Aggregated elapsed time for concurrent stress test

Figure 9: Overhead of Hyper-Q below 2% and 0.3% of totalquery response time respectively.

8 RELATEDWORKThe problem of database migration is as old as databases themselves.Naturally, a number of approaches, systems and strategies havebeen developed over the past decades.

The prevalent technique is manual migration often executed bythird party consultants. Any manual migration is inherently non-scalable, i.e., the number of person-years required is proportionateto the number of applications and lines of SQL code.

Database Migration Utilities. A variety of proprietary toolshave been developed in this space to aid in the otherwise manualprocess [3, 5, 8]. These tools address primarily the content transferand as such address the lesser of the actual challenges. Severalof these systems also provide general SQL rewrite capabilities asneeded to rewrite certain schema features like views as mechanismsto rewrite application logic. They are advertised as aiding in manualmigrations but do not claim to be actually full automations.

Remote Query Extensions. Instead of addressing the chal-lenges of migrations, a class of database or database-like systemsexists that effectively reduces the need for migrations by makingother databases accessible through the original database. That way,a new database can be integrated as a subordinate database. Thiskind of technology has long existed in most commercial databases[9, 10] and most recently in Teradata Query Grid [11]. In all cases,the old database remains the main access point. New databases are

Industry 3: DB Systems in the Cloud and Open Source SIGMOD’18, June 10-15, 2018, Houston, TX, USA

835

Page 12: Rapid Adoption of Cloud Data Warehouse TechnologyUsing ...static.tongtianta.site/paper_pdf/414aa74e-cde8-11e9-aa89-00163e08bb86.pdfRedshift [19], Google BigQuery [7], and others promise

only added but will not replace the old database. These should beviewed as a tactical move by database vendors to preserve theirfootprint. The main drawback of these systems is that they stillrequire the old database to remain functional indefinitely.

Query Federation. Similar to the previous category, query fed-eration systems address the problem of integrating new data sourcesand has seen a lot of interest in the research community since themid 1980s [17, 18, 20, 21, 24], where the focus is on source selectionand pushing computation as close to the source as possible. Com-mercial implementations are few and far between [4, 16, 22]. How-ever, they cannot close the gap between applications and federationsystem in that a migration to the federation system is necessary. Asa result, federation systems have yet to establish a credible footprintin the industry.

Database Virtualization Adopting elements of a competitor’sfeature set has a long tradition in the database industry. Severalcommercial database systems feature compatibility modes to ac-commodate language features of competitors. Examples includeEnterpriseDB which mimics Oracle [6], or ANTS, a now defunctcompany that enabled IBM DB2 to mimic Sybase [1]. These ap-proaches are limited to language features only and while they miti-gate some of the rewriting effort do not eliminate the problem at all.Most notably, the application still needs to be converted and usethe new drivers and connectivity libraries. These approaches dogenerally not go far enough and still do not break the non-scalablenature of a manual migration.

[14] laid the basic principles of our ADV approach, where weshowed how to enable the execution of kdb applications in rela-tional databases. In the current paper, we introduce the world’s firstfunctional SQL-to-SQL ADV system that fulfills a sound desiderataon database virtualization. We give full details on the architectureand query rewriting framework illustrating how different querytransformations are implemented and how feature gaps can beclosed by emulation. In addition, we describe the components ofthe data processing pipeline including ODBC abstraction, binaryencoding of the results and result spooling to disk. This papersummarizes our research and development efforts to build a fullyfunctional virtualization engine for Teradata, one of the most preva-lent data warehousing systems across Fortune 500 companies, thatallows running Teradata applications natively and instantly againstmajor cloud databases.

9 SUMMARYWe presented, Hyper-Q a complete and production-ready imple-mentation of ADV to address what has become one of the biggestobstacles in adoption of new data management technology: howto preserve long-standing investments in application developmentwhen moving to the cloud and replacing conventional databaseswith cloud-native systems. ADV is a holistic vision built on thekey insight that intercepting the communication between applica-tion and database leads to a fundamentally different and signifi-cantly more cost-effective approach compared to conventional re-platforming. Hyper-Q implements a general and extensible queryrewriting framework to support a large variety of source and targetsystems.

FutureWork. Supported by the positive reception by customersand prospects, we believe ADV has much larger application. Specif-ically, ADV is not only means to facilitate re-platforming but canevolve into an integral part of the IT stack. Becoming a universalconnector between applications and databases can break open datasilos, make data freely usable across the enterprise, and accom-plish all this without the need to restructure or rewrite applications.With a broad agenda to add functionality, support for additionaland emerging technologies ADV should be seen as a de facto insur-ance policy for the enterprise and its data strategy. We have chosendata warehousing as a starting point and first candidate based onmarket demand and availability of cloud-based database systemswith appropriate capabilities. Going forward we plan to broadenthe scope to include also transactional systems as we see customersincreasingly inquire about these systems.

AcknowledgmentsWe are grateful to Prof. Joseph Hellerstein for his valuable com-ments on an earlier draft of this paper.

REFERENCES[1] 2008. Emulate Sybase in DB2 using ANTs Software. (2008). https://ibm.co/

2eMvpUL[2] 2016. By 2020 Cloud Shift Will Affect More Than 1 Trillion in IT Spending, Press

Release. (2016). http://www.gartner.com/newsroom/id/3384720[3] 2017. AWS Database Migration Service. (2017). https://aws.amazon.com/dms/[4] 2017. Data Federation in Composite Software. (2017). http://bit.ly/2eMF79V[5] 2017. DBBest Technologies. (2017). https://www.dbbest.com/[6] 2017. EnterpriseDB. (2017). https://www.enterprisedb.com/[7] 2017. Google BigQuery. (2017). https://cloud.google.com/bigquery/[8] 2017. Inspirer. (2017). http://www.ispirer.com/[9] 2017. Linked Servers in Microsoft SQL Server. (2017). http://bit.ly/2xDy9vJ[10] 2017. Oracle Distributed Queries. (2017). http://bit.ly/2vS57pM[11] 2017. Teradata QueryGrid. (2017). http://bit.ly/2wuwuZ4[12] 2017. XtremeData. (2017). https://www.xtremedata.com/[13] 2018. Database market set for seismic shift driven by accelerated cloud and SaaS

adoption. (2018). https://451research.com/report-short?entityId=93718[14] Lyublena Antova, Rhonda Baldwin, Derrick Bryant, Tuan Cao, Michael Duller,

John Eshleman, Zhongxian Gu, Entong Shen, Mohamed A Soliman, and FMichaelWaas. 2016. Datometry Hyper-Q: Bridging the Gap Between Real-Time andHistorical Analytics. In SIGMOD.

[15] Benoît Dageville, Thierry Cruanes, Marcin Zukowski, Vadim Antonov, ArtinAvanes, Jon Bock, Jonathan Claybaugh, Daniel Engovatov, Martin Hentschel,Jiansheng Huang, AllisonW. Lee, Ashish Motivala, Abdul Q. Munir, Steven Pelley,Peter Povinec, Greg Rahn, Spyridon Triantafyllis, and Philipp Unterbrunner. 2016.The Snowflake Elastic Data Warehouse. In SIGMOD.

[16] Denodo. [n. d.]. ([n. d.]). http://www.denodo.com/[17] Amol Deshpande and JosephM. Hellerstein. 2002. Decoupled Query Optimization

for Federated Database Systems. In ICDE.[18] Aaron J. Elmore et al. 2015. A Demonstration of the BigDAWG Polystore System.

PVLDB (2015).[19] Anurag Gupta, Deepak Agarwal, Derek Tan, Jakub Kulesza, Rahul Pathak, Stefano

Stefani, and Vidhya Srinivasan. 2015. Amazon Redshift and the Case for SimplerData Warehouses. In SIGMOD.

[20] Dennis Heimbigner and Dennis McLeod. 1985. A Federated Architecture forInformation Management. ACM Transactions on Information Systems 3 (1985),253–278. Issue 3.

[21] San-Yih Hwang, Ee-Peng Lim, H.-R. Yang, S. Musukula, K. Mediratta, M. Ganesh,Dave Clements, J. Stenoien, and Jaideep Srivastava. 1994. The MYRIAD FederatedDatabase Prototype. In SIGMOD.

[22] Holger Kache, Wook-Shin Han, Volker Markl, Vijayshankar Raman, and StephanEwen. 2006. POP/FED: Progressive Query Optimization for Federated Queries inDB2. In VLDB.

[23] Srinath Shankar, Rimma V. Nehme, Josep Aguilar-Saborit, Andrew Chung,Mostafa Elhemali, Alan Halverson, Eric Robinson, Mahadevan Sankara Subrama-nian, David J. DeWitt, and César A. Galindo-Legaria. 2012. Query Optimizationin Microsoft SQL server PDW. In SIGMOD.

[24] Alkis Simitsis, Kevin Wilkinson, Malú Castellanos, and Umeshwar Dayal. 2012.Optimizing analytic data flows for multiple execution engines. In SIGMOD.

Industry 3: DB Systems in the Cloud and Open Source SIGMOD’18, June 10-15, 2018, Houston, TX, USA

836

Page 13: Rapid Adoption of Cloud Data Warehouse TechnologyUsing ...static.tongtianta.site/paper_pdf/414aa74e-cde8-11e9-aa89-00163e08bb86.pdfRedshift [19], Google BigQuery [7], and others promise

A CONVENTIONAL DATAWAREHOUSEMIGRATION

We consider data warehouses as they are the most prevalent andamong the costliest databases to deal with. Conversely they offerthe biggest benefits when moving to the cloud. Also, we focus onthe practical case of moving relational data warehouses which hasevolved as the industry’s most pressing use case.

For the purpose of the exposition, we assume the decision tomove to the new database has been made, i.e., necessary investi-gations into performance, availability etc. are complete. These aresignificant tasks in and by themselves, however, they are not in thescope of this paper.

Migration projects have multiple stages. Some of these can beexecuted in parallel, others require serialization, for the purposeof this paper, we will discuss only technical challenges and skipproject management challenges even though they are vital to thesuccess of a re-platforming project.

A.1 Discovery PhaseIn order to compile a realistic project plan, a full inventory of alltechnologies used, i.e., all applications connected to the database,is needed. This activity needs to look at every single componentand determine among other things:

(1) Inventory. Application type and technology for every singleclient needs to be evaluated including Business Intelligence(BI) and reporting, but also ad-hoc uses. This includes pro-prietary and 3rd party clients, including embedded use caseswhere seemingly innocuous business applications such asMicrosoft Excel emit queries.

(2) Connectors. Drivers and libraries used by every connectingcomponent need to be cataloged to understand if they can bereplaced. This step often discovers surprising incompatibili-ties with respect to the new target system and may requirecustom builds of connectors.

(3) SQL language. A gap analysis determines what features,specifically complex features, are being used. The under-standing of what features may not be available on the down-stream system, e.g., recursive query, as well as proprietaryfeatures that predate standardization, e.g., pre-ISO rank im-plementations, drives the design and implementation ofequivalent queries and expressions based only on primitivesof the new database system.

The discovery phase is a function of the size and complexity ofthe ecosystem. This phase of a migration typically takes up to 3-6months, or more.

A.2 Database Content TransferThe most prominent task though frequently neither the most chal-lenging nor the costliest is the transfer of the content of the originaldatabase to the cloud-native database.

SchemaConversion. Transferring the schema from one systemto another needs to replicate all relevant structures in the schemadefinition language of the new system. While most databases bynow offer equivalent structures, there are certain structural features

that may not be transferable due to limitations in the new system,e.g., Global Temporary Tables, Stored Procedures.

In contrast to logical design features, physical database design in-cluding indexes or other physical structures do not necessarily needto be transferred or require separate treatment once the database istransferred. Frequently, physical design choices are not portable ordo not even apply to the new database system and can therefore beignored. A more recent trend is to increasingly eliminate physicaldesign choices altogether, see e.g., Snowflake [15], XtremeData [12].

Treating schema conversion as a separate discipline has severeshortcomings: certain features that cannot be transcribed directlyrequire a workaround that all applications need be made aware of,i.e., simply transcribing the schema in isolation is bound to generatefunctionally incomplete results. See Section 2.1 for details.

Data Transfer. Once a schema is transferred, data needs to betransferred in bulk or incrementally. This is typically a logisticalissue rather than than a semantic one: for large amounts of data,most cloud service providers offer bulk transfer through off-linemeans such as disk arrays shipped via courier services or similar.With today’s standard bandwidth available between data centersand the widespread network of cloud data centers, data transferexpect for very large data warehouses rarely constitutes a problemthough.

B VIRTUALIZATION USE CASESDuring our engagement with customers and prospects, we haveidentified several major approaches to moving to the cloud, whichwe detail next. Those are visually depicted in Figure 10.

B.1 Complete drop-in replaceIn the past months we have seen multiple cases where enterprisesacross different verticals have undertaken campaigns to reducetheir data center footprint and vendor licensing fees and move theiroperations to the cloud. Due to the difficulties outlined in Sections 1and 2, most of those enterprises have been reluctant to pursuesuch a project as it often meant a costly multi-month and high-costinvestment, which often exceeds the cost savings of the cloud, andposes high risks to the business. A small fraction of those customershave also tried looking positively at the challenge: a rewrite ofthe applications to enable portability across databases also meansthose applications can be modernized or otherwise optimized tokeep up with the ever-changing technologies. However, even thoseoptimistic customers have more often than not given up on the“application modernization” idea after they realize the magnitudeof the migration efforts involved in simply porting the applicationwithout any change.

ADV enables instantaneous deployment of existing applicationson the cloud, as shown in Figure 10 (a). Modernization of appli-cations can happen gradually, if the enterprise wants to invest inthis. What is more, developers now have the choice what querylanguage they want to use for their new applications: if they feelmore comfortable with the old query syntax, they can keep usingthat, or they can switch to the language of the new database, whilekeeping the guarantee that their application will not only work,but also be portable in the future.

Industry 3: DB Systems in the Cloud and Open Source SIGMOD’18, June 10-15, 2018, Houston, TX, USA

837

Page 14: Rapid Adoption of Cloud Data Warehouse TechnologyUsing ...static.tongtianta.site/paper_pdf/414aa74e-cde8-11e9-aa89-00163e08bb86.pdfRedshift [19], Google BigQuery [7], and others promise

Figure 10: Database VirtualizationUseCases. (a) Drop-inReplace, (b) Disaster Recovery, (c) Scaling Out

B.2 Disaster recoveryIn several cases we have encountered customers who are happywith their existing on-premise database and want to keep it, butwant to add a cloud database to the mix for disaster recovery pur-poses, as shown in Figure 10 (b). The cloud database in those sce-narios is often not fully utilized, thus incurring significantly lowercosts compared to maintaining a second on-premise installation.This scenario poses significant challenges to the enterprise, as thesame applications now need to run on two potentially differentdatabase installations. While multiple database vendors are nowcoming up with cloud offerings of their solutions, the latter arenot always fully compatible with the on-premise distribution andmany features may have been disabled for performance reasons.So the only remaining solution used to be maintaining two differ-ent versions of the application for the different database vendors.In addition to the pure migration challenges outlined before, thishas additional drawbacks as any modification to the application(for example bug fixes or new feature development) now needsto be applied in two separate codebases. This obviously is highlyundesirable.

ADV easily enables disaster recovery deployments across dif-ferent stacks as it does not require the original application to bechanged in order to have it execute on a different platform. Applica-tion evolution can happen naturally on the original code and doesnot need to be reimplemented for the secondary database.

B.3 Scaling out applicationsWhenmoving from an on-premise to a cloud-based data warehouse,customers do not want to experience performance regressions. Inone case, a customer’s throughput requirements could not be meteven when they used the largest available instance provided by thecloud database.This could be partially attributed to the fact that theexisting on-premise solution ran on dedicated hardware, and thecustomer’s workload was highly optimized for that database.

A common solution for such case is to maintain multiple replicasof the data warehouse and load balance queries across them, asshown in Figure 10 (c). The ADV solution on top can then automat-ically route the queries to the different replicas, without sacrificingconsistency, and without requiring changes to the application logic.We are currently working on extending Hyper-Q to handle thisscenario.

B.4 Performance evaluation of cloud databasesAs discussed earlier, cloud databases come with different perfor-mance and scalability characteristics when compared to the tradi-tional on-premise solutions. While customers are often aware ofthese and are willing to sacrifice performance in return for all thebenefits a cloud database gives them, they often have performanceand throughput criteria which need to met before they can justifymigrating their applications to a new database stack. For example,in one of our engagements a customer was willing to tollerate upto 3x performance degradation as measured in completed jobs perhour, as that would still allow them to finish their reporting jobs inthe allocated time. Other customers may have other requirements,and they often have testing harnesses around those.

Hyper-Q allows customers to test a possible migration path be-fore they commit to a particular solution: the customer will needto migrate the portion of their data needed for testing, and directthe tested applications to connect via Hyper-Q to the new database,saving valuable time and money for rewriting the queries. In ad-dition, customers can compare side-by-side how their workloadsperform on a variety of potential target databases, which can beused to guide their decision of where to migrate to.

C QUERY REWRITING EXAMPLESTable 2 lists a sample of the tracked features, together with theircategory, description and short synopsis of a possible rewrite asimplemented by Hyper-Q .

Figure 11 shows the full query rewrite operations needed to portthe query given in Example 2 from Teradata SQL to a target clouddatabase system.

Industry 3: DB Systems in the Cloud and Open Source SIGMOD’18, June 10-15, 2018, Houston, TX, USA

838

Page 15: Rapid Adoption of Cloud Data Warehouse TechnologyUsing ...static.tongtianta.site/paper_pdf/414aa74e-cde8-11e9-aa89-00163e08bb86.pdfRedshift [19], Google BigQuery [7], and others promise

performed place in two phases: (1) parsing incoming re-quest into an Abstract Syntax Tree (AST) capturing high-level query structure, and (2) binding the generated ASTinto XTRA expression, where more involved operations suchas Metadata lookup and query normalization are performed.

4.1.3 TransformerXTRA is a powerful representation that lends itself to

performing several query transformations transparently forquery correctness and performance purposes. The Trans-former component is the driver responsible for triggeringdi↵erent transformation rules under given pre-conditions.The transformations are plug-able components that could beshared across di↵erent databases and application requests.For example, the original database may allow direct compar-ison of two data types by implicitly converting one to theother. The new database may not have the same capabil-ity, and would thus require the conversion to be spelled-outexplicitly in the query request. In this case, a transforma-tion that converts a comparison of two data types into anequivalent comparison after explicitly transforming one ofthe types to the other needs to be applied.

Transformations could also be used to improve the per-formance of generated queries. For example, if the targetdatabase incurs a large overhead in executing single-rowDML requests, a transformation that groups a large num-ber of contiguous single-row DML statements into one largestatement could be applied.

Controlling which transformation are triggered against agiven query request is done by maintaining a map for eachtarget database system associating di↵erent XTRA opera-tors with their corresponding transformations. Given analgebrized XTRA expression that includes these operators,the Transformer automatically triggers the relevant transfor-mations to generate a new XTRA expression after applyingtransformations logic. Transformations could be cascaded,where the output of one transformation represents a validinput to another transformation. The Transformer takescare of running all relevant transformations repeatedly untilreaching a fixed point, where no further modifications to theXTRA expression via transformation is possible.

4.1.4 SerializerDi↵erent database systems assume di↵erent SQL dialects.

This means that we typically need to generate di↵erent SQLsyntax depending on the type of target database system theapplication is re-platformed for. This is implemented usingthe Serializer component of Hyper-Q. Each target databasehas its own Serializer implementation. These di↵erent seri-alizers share a common interface: the input is an XTRA ex-pression, and the output is the serialized SQL statementof that XTRA. Serialization takes place by walking throughthe XTRA expression, generating a SQL block for each op-erator and then formatting the generated blocks accordingto the specific keywords and query constructs of the targetdatabase.

4.1.5 Data IngestorThe Data Ingestor component is an abstraction of ODBC

APIs that allows Hyper-Q to communicate with di↵erenttypes of target database systems in an abstract fashion. TheAPIs provide means to submit di↵erent kinds of requeststo the target database for execution, ranging from simple

queries/DML requests to multi-statement requests, param-eterized queries, and stored procedure calls.

The results of these requests are retrieved by Data In-gestor on demand in one or more batches depending on theresult size. Result batches are packaged according to Hyper-Q binary data representation, called Tabular Data Format(TDF), which is designed to be an extensible binary formatthat is able handle arbitrarily nested data. By relying on theTDF representation, the Data Ingestor handles data retrievalin a number of non-straightforward scenarios including han-dling very wide rows and extremely large result sets.

4.1.6 Result ConverterThe query results in TDF representation cannot be con-

sumed directly by the application, since the applicationexpects query results to be formatted in a specific way,as defined by the binary representation of the originaldatabase. Hyper-Q needs to do on-the-fly result conversionfrom TDF into the binary representation that applicationexpects. The Result Converter is the driver for such opera-tion.

TDF packets are unwrapped by Result Converter to ex-tract result rows and convert them into the binary format ofthe original database. This conversion operation happens inparallel by starting a number of processes where each pro-cess handles the conversion of a subset of the result rows.If the original database allows streaming query results tothe application, the converted results are streamed directlyto Protocol Handler which packages them into network mes-sages to be sent back to the application.

Alternatively, the original database may not allow stream-ing the results. For example, some databases require thatthe total number of results is sent to the application firstbefore sending any actual result. In this case, the ResultConverter needs to bu↵er all result rows until they are fullyconsumed from the target database. When the result size isvery large, the bu↵ered results may not fit in memory. Inthis case, the Result Converter spills the bu↵ered results intodisk and maintains the set of generated spill files until resultconsumption is done. At this point, the converted resultsare sent to Protocol Handler to be packaged into messagesreturned back to the application.

4.2 Query RewritingIn this section, we walk through the architecture of Hyper-

Q to illustrate its powerful query rewriting capabilities.We use the following simple example for illustration.

Example 2 Consider the following query:

SEL ⇤FROM SALESWHERESALES DATE > 1140101AND (AMOUNT, AMOUNT ⇤ 0 . 85 ) >ANY (SEL GROSS, NET

FROM SALES HISTORY)QUALIFY RANK(AMOUNT DESC) <= 10 ;

The query finds SALES records with specific dates

that satisfy a given criterion on the AMOUNT col-

umn. The returned records must have one of the

top-10 AMOUNT values while preserving ties. 2

The previous query uses a number of proprietary featuresintroduced by Teradata and do not exist in cloud databases:

4.2.2 Query Rewriting during AlgebrizationWhile binding the AST into XTRA, certain transforma-

tions can take place to normalize special constructs foundin the input query. One example is the comparison betweendate and integer data types in Teradata SQL. Teradata rep-resents dates internally as integers. This is not the casewith other database systems. Comparison between datesand integers is thus not directly portable to other databasesystems. In order to port this construct, the comparisonneeds to trigger an implicit conversion of the date side ofthe comparison into integer value. Since the integer rep-resentation of date is unique to Teradata system and doesnot exist in any of the target cloud databases, applying thisrewrite as early as possible is important to create a nor-malized representation of the query before further process-ing takes place. Algebrization is an appropriate stage forsuch transformations since it does not require knowledge ofthe target database system that will eventually execute thetransformed query.

The comparison rewrite operation is is triggered by theTransformer component. The Transformer component man-ages the application and execution of di↵erent transforma-tions until reaching a fixed point. Each XTRA node can beassociated with one or more transformations. In this case,an XTRA comp node that captures the comparison is asso-ciated with a transformation module comp date to int thatexpands the date side of the comparison into the followingarithmetic expression DAY + (MONTH * 100) + (YEAR- 1900) * 10000, which gives the equivalent integer value.Figure 5 shows the transformation input and output, wherethe comparison is expanded by replacing the date value withthe previous arithmetic expression.

The full generated XTRA expression after the bindingphase is given in Figure 6, which is a normalized relationalalgebra expression that captures the input query in Exam-ple 2. XTRA builds on an uniform algebraic model, wherethe output of a given operator depends on operator’s inputsas well as operator’s type. Di↵erent constructs in the querybody are captured using di↵erent XTRA operators with well-defined semantics.

4.2.3 Query Rewriting during SerializationFor some target databases, the generated XTRA in Fig-

ure 6 cannot be used directly to generate SQL in its cur-rent form. The reason is that such target databases donot understand vector comparison in the subquery construct(AMOUNT, AMOUNT * 0.85) > ANY (SEL GROSS, NETFROM SALES HISTORY).

This transformation is system specific, since it is designedto match the capabilities of a particular target database sys-tem and hence it needs to be triggered right before serializa-tion of the final XTRA into SQL. The transformation detectspatterns where subq XTRA node involves vector comparison.The transformation replaces such subquery pattern with asemantically equivalent existential correlated subquery. Fig-ure 7 shows the input and output patterns of this transfor-mation.

The final XTRA is shown in Figure 8, which is nowready to be serialized into SQL compatible with the targetdatabase as given by the following query:

Figure 9 shows the full rewrite of all constructs in Exam-ple 2 in order to port the given Teradata query to a targetcloud database system. Hyper-Q is a principled framework

+-select|-window(RANK , DESC , AMOUNT)| +-select| |-get (SALES)| +-boolexpr(AND)| |-comp(GT)| | |-arith (+)| | | |-extract(DAY , SALES_DATE)| | | |-arith (*)| | | | |-extract(MONTH , SALES_DATE)| | | | +-const (100)| | | +-arith (*)| | | |-arith(-)| | | | |-extract(YEAR , SALES_DATE)| | | | +-const (1900)| | | +-const (10000)| | +-const (1140101)| +-subq(EXISTS)| +-select| |-remap consts: (1)| | +-get (SALES_HISTORY ‘S2 ’)| +-boolexpr(OR)| |-comp(GT)| | |-ident (S1.AMOUNT)| | +-ident (S2.GROSS)| +-boolexpr(AND)| |-comp (EQ)| | |-ident(S1.AMOUNT)| | +-ident(S2.GROSS)| +-comp(GT)| |-arith (*)| | |-ident (S1.AMOUNT)| | +-const (0.85)| +-ident (S2.NET)+-comp(LTE)

|-ident(AMOUNT)+-const (10)

Figure 8: Final XTRA for Example 2

UNION ALLSELECT EMP.EMPNO,EMP.MGRNOFROM EMP,REPORTSWHERE REPORTS.EMPNO = EMP.MGRNO

)SELECT EMPNO FROM REPORTS ORDER BY EMPNO;

Figure 9 shows hierarchical employee-manager sam-

ple data for the EMP relation. 2

Recursive queries are introduced in the SQL standard asa way to handle queries over hierarchical model data. Byintroducing a common table expression that includes selfreferences (e.g., REPORTS.EMPNO inside the definition ofREPORTS), the database system is required to run query re-cursion until reaching a fixed point, where further recursiondoes not introduce any new results.

When query recursion is not natively supported by thetarget database, there is no direct translation that could beutilized to generate a single SQL request equivalent to arecursive query. However, a deeper analysis shows that thecomputation of recursive queries relies on simpler constructsthat might be readily available in the target database. Inparticular, using a sequence of temporary table operations,recursion can be fully emulated.

Hyper-Q emulates query recursion using two temporarytables: TempTable, which stores the results of current re-cursive call, and WorkTable, which stores the full results of

Figure 9: Emulating recursive query in Example 3

previous recursive calls. At each recursive step, the contentsof TempTable are appended to WorkTable. Recursion endswhen TempTable contains no results.

Figure 9 shows the sequence of temporary table opera-tion performed by Hyper-Q to emulate the execution of theprevious recursive query:

1. Initialize both WorkTable and TempTable with the re-sults of seed expression, which is the first UNION ALLinput: {(e8, e10), (e9, e10)}.

2. Execute recursive expression by joining EMP withTempTable. Append the results {(e7, e8)} to Work-Table which becomes now {(e8, e10), (e9, e10), (e7,e8)}. TempTable= {(e7, e8)}. Recursion needs to con-tinue.

3. Execute recursive expression by joining EMP withTempTable. Append the results {(e1, e7)} to Work-Table which becomes now {(e8, e10), (e9, e10), (e7, e8),(e1, e7)}. TempTable= {(e1, e7)}. Recursion needs tocontinue.

4. Execute recursive expression by joining EMP withTempTable. The result is empty and so recursion stops.Drop TempTable.

5. Substitute references of REPORTS with WorkTablein main query, and execute modified query SELECTEMPNO FROM WorkTable ORDER BY EMPNO;.

6. Return results of main query to client application.Drop WorkTable.

In the previous example, Hyper-Q generates multiplequery requests, and maintaines the state of recursion by in-specting the results of these requests. The behavior fromapplication stand point remains the same, where the resultsof the recursive query are obtained with exactly the samebehavior of the original database system.

5. VIRTUALIZATION USE CASESDuring our engagement with customers and prospects, we

have identified several major approaches to moving to thecloud, which we detail next. Those are visually depicted inFigure 10.

+-td_qualify|-ansi_select| |-ansi_get(SALES)| +-ansi_boolexpr(AND)| |-ansi_cmp(GT)| | |-td_ident(SALES_DATE)| | +-ansi_const (1140101)| +-ansi_subq(GT, ANY [GROSS , NET])| |-ansi_get(SALES_HISTORY)| +-ansi_list| |-td_ident(AMOUNT)| +-ansi_arith (*)| |-td_ident(AMOUNT)| +-ansi_const (0.85)+-ansi_cmp(LTE)

|-td_rank(AMOUNT , DESC)+-ansi_const (10)

Figure 4: Generated AST for Example 2

[6] (2017) EnterpriseDB. [Online]. Available:https://www.enterprisedb.com/

[7] (2017) Google BigQuery. [Online]. Available:https://cloud.google.com/bigquery/

[8] (2017) Inspirer. [Online]. Available:http://www.ispirer.com/

[9] (2017) Linked Servers in Microsoft SQL Server.[Online]. Available: http://bit.ly/2xDy9vJ

[10] (2017) Oracle Distributed Queries. [Online]. Available:http://bit.ly/2vS57pM

[11] (2017) Teradata QueryGrid. [Online]. Available:http://bit.ly/2wuwuZ4

[12] (2017) XtremeData. [Online]. Available:https://www.xtremedata.com/

[13] L. Antova et al., “Datometry Hyper-Q: Bridging theGap Between Real-Time and Historical Analytics,” inSIGMOD, 2016.

[14] B. Dageville et al., “The Snowflake Elastic DataWarehouse,” in SIGMOD, 2016.

[15] Denodo. [Online]. Available:http://www.denodo.com/

[16] A. Deshpande and J. M. Hellerstein, “Decoupled queryoptimization for federated database systems,” inICDE, 2002.

[17] A. J. Elmore et al., “A Demonstration of theBigDAWG Polystore System,” PVLDB, 2015.

[18] A. Gupta et al., “Amazon Redshift and the Case forSimpler Data Warehouses,” in SIGMOD, 2015.

[19] D. Heimbigner and D. McLeod, “A FederatedArchitecture for Information Management,” ACMTransactions on Information Systems, vol. 3, pp.253–278, 1985.

[20] S. Hwang et al., “The MYRIAD federated databaseprototype,” in SIGMOD, 1994.

[21] H. Kache et al., “POP/FED: progressive queryoptimization for federated queries in DB2,” in VLDB,2006.

[22] S. Shankar et al., “Query Optimization in MicrosoftSQL server PDW,” in SIGMOD, 2012.

[23] A. Simitsis et al., “Optimizing analytic data flows formultiple execution engines,” in SIGMOD, 2012.

APPENDIXTable 8.3 lists a sample of the tracked features, together withtheir category, description and short synopsis of a possiblerewrite as implemented by Hyper-Q.

+-select|-window(RANK , DESC , AMOUNT)| +-select| |-get (SALES)| +-boolexpr(AND)| |-comp(GT)| | |-arith (+)| | | |-extract(DAY , SALES_DATE)| | | |-arith (*)| | | | |-extract(MONTH , SALES_DATE)| | | | +-const (100)| | | +-arith (*)| | | |-arith(-)| | | | |-extract(YEAR , SALES_DATE)| | | | +-const (1900)| | | +-const (10000)| | +-const (1140101)| + -subq(ANY , GT, [GROSS , NET])| |-get(SALES_HISTORY)| +-list| |-ident(AMOUNT)| +-arith (*)| |-ident(AMOUNT)| +-const (0.85)+-comp(LTE)

|-ident(AMOUNT)+-const (10)

| +-comp(GT)| | |-arith (+)| | | |-extract(DAY , SALES_DATE)| | | |-arith (*)| | | | |-extract(MONTH , SALES_DATE)| | | | +-const (100)| | | +-arith (*)| | | |-arith(-)| | | | |-extract(YEAR , SALES_DATE)| | | | +-const (1900)| | | +-const (10000)| | +-const (1140101)

| +-comp(GT)| |-ident(SALES_DATE)| +-const (1140101)

Figure 5: Generated XTRA for Example 2

[6] (2017) EnterpriseDB. [Online]. Available:https://www.enterprisedb.com/

[7] (2017) Google BigQuery. [Online]. Available:https://cloud.google.com/bigquery/

[8] (2017) Inspirer. [Online]. Available:http://www.ispirer.com/

[9] (2017) Linked Servers in Microsoft SQL Server.[Online]. Available: http://bit.ly/2xDy9vJ

[10] (2017) Oracle Distributed Queries. [Online]. Available:http://bit.ly/2vS57pM

[11] (2017) Teradata QueryGrid. [Online]. Available:http://bit.ly/2wuwuZ4

[12] (2017) XtremeData. [Online]. Available:https://www.xtremedata.com/

[13] L. Antova et al., “Datometry Hyper-Q: Bridging theGap Between Real-Time and Historical Analytics,” inSIGMOD, 2016.

[14] B. Dageville et al., “The Snowflake Elastic DataWarehouse,” in SIGMOD, 2016.

[15] Denodo. [Online]. Available:http://www.denodo.com/

[16] A. Deshpande and J. M. Hellerstein, “Decoupled queryoptimization for federated database systems,” inICDE, 2002.

[17] A. J. Elmore et al., “A Demonstration of theBigDAWG Polystore System,” PVLDB, 2015.

[18] A. Gupta et al., “Amazon Redshift and the Case forSimpler Data Warehouses,” in SIGMOD, 2015.

[19] D. Heimbigner and D. McLeod, “A FederatedArchitecture for Information Management,” ACMTransactions on Information Systems, vol. 3, pp.253–278, 1985.

[20] S. Hwang et al., “The MYRIAD federated databaseprototype,” in SIGMOD, 1994.

[21] H. Kache et al., “POP/FED: progressive queryoptimization for federated queries in DB2,” in VLDB,2006.

[22] S. Shankar et al., “Query Optimization in MicrosoftSQL server PDW,” in SIGMOD, 2012.

[23] A. Simitsis et al., “Optimizing analytic data flows formultiple execution engines,” in SIGMOD, 2012.

APPENDIXTable 8.3 lists a sample of the tracked features, together withtheir category, description and short synopsis of a possiblerewrite as implemented by Hyper-Q.

+-select|-window(RANK , DESC , AMOUNT)| +-select| |-get (SALES)| +-boolexpr(AND)| |-comp(GT)| | |-arith (+)| | | |-extract(DAY , SALES_DATE)| | | |-arith (*)| | | | |-extract(MONTH , SALES_DATE)| | | | +-const (100)| | | +-arith (*)| | | |-arith(-)| | | | |-extract(YEAR , SALES_DATE)| | | | +-const (1900)| | | +-const (10000)| | +-const (1140101)| + -subq(ANY , GT, [GROSS , NET])| |-get(SALES_HISTORY)| +-list| |-ident(AMOUNT)| +-arith (*)| |-ident(AMOUNT)| +-const (0.85)+-comp(LTE)

|-ident(AMOUNT)+-const (10)

| +-comp(GT)| | |-arith (+)| | | |-extract(DAY , SALES_DATE)| | | |-arith (*)| | | | |-extract(MONTH , SALES_DATE)| | | | +-const (100)| | | +-arith (*)| | | |-arith(-)| | | | |-extract(YEAR , SALES_DATE)| | | | +-const (1900)| | | +-const (10000)| | +-const (1140101)

| +-comp(GT)| |-ident(SALES_DATE)| +-const (1140101)

Figure 5: Generated XTRA for Example 2

Figure 5: Transforming DATE-INT comparison toINT-INT comparison

• Comparison between DATE and INT data types. Thiscan be specified explicitly in Teradata SQL since datesare represented internally as integers. For example,‘1140101’ is the integer representation of date value‘2014-01-01’.

• Vectorized subquery construct that assumes the fol-lowing comparison semantics: (AMOUNT, AMOUNT* .85) > (GROSS, NET) () (AMOUNT > GROSS)_ (AMOUNT = GROSS ^ AMOUNT * .85 > NET).That is, the construct retains all sales with amountsexceeding any gross amount in sales history such thatties are broken using the net sales values.

• Window function extensions QUALIFY and RANK.QUALIFY combines the evaluation of window functionand comparison expressions, while RANK is a varia-tion of ANSI RANK window function, where the orderexpression is given as a function argument.

Query rewriting and normalization take place at di↵erentstages depending on the type of rewrite that need to be car-ried out. In the next subsections, we illustrate the di↵erentquery rewriting stages using Example 2.

+-select|-window(RANK , DESC , AMOUNT)| +-select| |-get (SALES)| +-boolexpr(AND)| |-comp(GT)| | |-arith (+)| | | |-extract(DAY , SALES_DATE)| | | |-arith (*)| | | | |-extract(MONTH , SALES_DATE)| | | | +-const (100)| | | +-arith (*)| | | |-arith(-)| | | | |-extract(YEAR , SALES_DATE)| | | | +-const (1900)| | | +-const (10000)| | +-const (1140101)| +-subq(ANY , GT, [GROSS , NET])| |-get(SALES_HISTORY)| +-list| |-ident(AMOUNT)| +-arith (*)| |-ident(AMOUNT)| +-const (0.85)+-comp(LTE)

|-ident(AMOUNT)+-const (10)

Figure 6: Generated XTRA for Example 2 after al-gebrization

4.2.1 Query Rewriting during ParsingThe parser uses di↵erent grammar rules to cover the query

surface of the input language. The query surface includesstandard features as well as proprietary features introducedexclusively by the source database system. Simple queryrewriting/normalization operations can be done during theparsing phase. Special keywords are normalized and mappedto standard query clauses. For example, SEL keyword is ashortcut of SELECT keyword and is used to define the startof a query block.

The parser generates an AST that includes a mix ofgeneric and specific nodes to delineate the characteristicsof the input query. The generic nodes are ANSI parse nodesthat correspond to standard constructs, e.g., inner join oper-ation, whereas the specific nodes correspond to special con-structs, e.g., QUALIFY clause, that deviate from or extendthe ANSI SQL standard. Each AST node is associated witha code module that contains the binding implementationto convert the AST node into a corresponding XTRA node.These code modules are typed and versioned to allow cus-tomizing Hyper-Q parser to process queries of specific ver-sions of the source database language. The binding im-plementation of generic parse nodes can be shared acrosssupported source database systems that have the same con-struct. This is important to allow Hyper-Q to add supportto new source database systems by reusing the implementa-tion of standard features, and adding new implementationonly for the features that deviate from the standard.

Figure 4 shows the generated AST of the query in Exam-ple 2. Some of the AST nodes are generic ANSI nodes, e.g.,ansi select, which capture standard constructs, while othernodes are specific nodes, e.g., td qualify, which capture Ter-adata specific constructs.

Figure 7: Transforming quantified vector compari-son into existential subquery

that can handle genaral query rewriting cases by its powerfultransformation engine.

SELECT T.AMOUNT, T. SALES DATEFROM(

SELECT ⇤ ,RANK( ) OVER(ORDER BY AMOUNT DESC) AS RFROM SALES S1WHEREEXTRACT(DAY FROM SALES DATE) +EXTRACT(MONTH FROM SALES DATE) ⇤ 100 +(EXTRACT(YEAR FROM SALES DATE) � 1900) ⇤ 10000> 1140101 ANDEXISTS

(SELECT 1FROM SALES HISTORY S2WHERE (S1 .AMOUNT > S2 .GROSS)

OR (S1 .AMOUNT = S2 .GROSSAND S1 .AMOUNT ⇤ .85> S2 .NET

))

) AS TWHERE R <= 10 ;

Figure 9 shows the full rewrite of all constructs in Exam-ple 2 in order to port the given Teradata query to a targetcloud database system. Hyper-Q is a principled frameworkthat can handle genaral query rewriting cases by its powerfultransformation engine.

4.3 Feature AugmentationThe features used by queries in client application may be

completely missing in the target database. Sophisticateddata warehouse features (e.g., stored procedures, recursivequeries and updatable views) that are missing in target

Figure 11: Full Query Rewrite in Example 2

Feature Category Description Hyper-Q implementation ComponentSEL/DEL/INS/UPD Translation Shortcut for SELECT

/DELETE/INSERT/UPDATEReplace by the coorrespondingnon-abbreviated keyword

Parser

QUALIFY Transformation Query clause specifying a pred-icate involving window func-tions

Add a Project node to computethewindow functions and trans-form the original predicate intoa one which refers to the com-puted columns

Parser

Implicit joins Transformation Tables referenced by the differ-ent clauses in the query but notappearing in the FROM clause

Expand FROM clause with ref-erenced tables

Binder

Chained projections Transformation Named scalar expressions de-fined and used in the same level

Replace the referenced name byits definition

Binder

Ordinal GROUP BY Transformation GROUP BY column position list Replace column positions bythe corresponding expression

Binder

OLAP grouping extensions Transformation GROUP BY rollup, cube andgrouping sets

Expand to a union all over sim-ple GROUP BYs

Transformer

Date arithmetics Transformation Arithmetic operations betweendates and integer or interval val-ues

Replace by DATEADD function Transformer

Date-Integer Comparison Transformation Comparison between dates andinteger

Expand the date side of the com-parison into an arithmetic ex-pression that returns the equiv-alent integer value

Transformer

Macros Emulation Execution of procedural objects Emulate macro code executionin the mid-tier

Binder

DML on views Emulation DML operations on view ob-jects

Express DML operation on thebase table of the view

Binder

Unsupported column properties Emulation Column properties not sup-ported on target system, suchas non-constant default values,case-insensitive text columns,etc.

Store column properties inDTM catalog and use themwhen corresponding column isreferenced in a query

Binder/Transformer

Table 2: Implementation details for a selection of features in Hyper-Q, including choice of implementing component.

Industry 3: DB Systems in the Cloud and Open Source SIGMOD’18, June 10-15, 2018, Houston, TX, USA

839