23
9/9/2015 1 with the support of National Science Foundation, Google, Informatica, Couchbase and its use in the FORWARD framework SQL++ Query Language PIG Jaql CQL N1QL AQL SQL-on-Hadoop NoSQL Others Semistructured Databases have Arrived in the form of JSON… { location: 'Alpine', readings: [ { time: timestamp('2014-03-12T20:00:00'), ozone: 0.035, no2: 0.0050 }, { time: timestamp('2014-03-12T22:00:00'), ozone: 'm', co: 0.4 } ] }

SQL++ Query Language · 9/9/2015 1 with the support of National Science Foundation, Google, Informatica, Couchbase and its use in the FORWARD framework SQL++ Query Language

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: SQL++ Query Language · 9/9/2015 1 with the support of National Science Foundation, Google, Informatica, Couchbase and its use in the FORWARD framework SQL++ Query Language

9/9/2015

1

with the support of National Science Foundation, Google, Informatica, Couchbase

and its use in the FORWARD

framework

SQL++ Query Language

SQL-on-Hadoop

PIG

Jaql

CQL

N1QL

AQL

SQL-on-Hadoop NoSQL

Others

Semistructured Databases have Arrived

in the form of JSON…

{ location: 'Alpine', readings: [ { time: timestamp('2014-03-12T20:00:00'), ozone: 0.035, no2: 0.0050 }, { time: timestamp('2014-03-12T22:00:00'), ozone: 'm', co: 0.4 } ] }

Page 2: SQL++ Query Language · 9/9/2015 1 with the support of National Science Foundation, Google, Informatica, Couchbase and its use in the FORWARD framework SQL++ Query Language

9/9/2015

2

Semistructured Query Languages come along…

JSON: Semistructured

Data

SQL: Expressive Power and Adoption

but in a Tower of Babel…

SELECT AVG(temp) AS tavg FROM readings GROUP BY sid

SQL

db.readings.aggregate( {$group: {_id: "$sid", tavg: {$avg:"$temp"}}})

MongoDB

readings -> group by sid = $.sid into { tavg: avg($.temp) };

Jaql

a = LOAD 'readings' AS (sid:int, temp:float); b = GROUP a BY sid; c = FOREACH b GENERATE AVG(temp); DUMP c;

Pig

for $r in collection("readings") group by $r.sid return { tavg: avg($r.temp) }

JSONiq

with limited query abilities…

Growing out of key-value databases or SQL with special type UDFs

Far from the full-fledged power of prior non-relational query languages – OQL, Xquery *academic projects being the exception (eg, ASTERIX, FORWARD)

Page 3: SQL++ Query Language · 9/9/2015 1 with the support of National Science Foundation, Google, Informatica, Couchbase and its use in the FORWARD framework SQL++ Query Language

9/9/2015

3

in lieu of formal semantics…

Query languages were not meant for Ninjas

• What is SQL++ ?

• How we use SQL++ in UCSD’s FORWARD ?

• Introduction to the formal survey of NoSQL, NewSQL and SQL-on-Hadoop

Outline

SQL++ : Data model + query language for

querying both structured data (SQL) and

semistructured data (native JSON)

Full fledged, declarative ,

aggressively backwards-compatible with SQL

Formal Schemaless Semantics:

adjusting tuple calculus to the

richness of JSON

Configurable SQL++: Formal configuration

options towards choosing semantics, understanding

repercussions

Unifying: 11 popular databases/QLs captured by appropriate settings of Configurable

SQL++

Page 4: SQL++ Query Language · 9/9/2015 1 with the support of National Science Foundation, Google, Informatica, Couchbase and its use in the FORWARD framework SQL++ Query Language

9/9/2015

4

How is it used in the Big Data industry • ASTERIXDB

• CouchDB

heading towards SQL++

based on a schemaless SQL++ algebra

Automatic Incremental View Maintenance

Middleware Query Optimization & Execution: FORWARD virtual database

query plans based on SQL++ algebra

How do we use it in UCSD

{ location: 'Alpine', readings: [ { time: timestamp('2014-03-12T20:00:00'), ozone: 0.035, no2: 0.0050 }, { time: timestamp('2014-03-12T22:00:00'), ozone: 'm', co: 0.4 } ] }

• Array nested inside a tuple

• Heterogeneous tuples in collections

• Arbitrary compositions of array, bag, tuple

SQL++ Data Model (also JSON++ data model)

A superset of SQL and JSON

• Extend SQL with arrays + nesting + heterogeneity

Page 5: SQL++ Query Language · 9/9/2015 1 with the support of National Science Foundation, Google, Informatica, Couchbase and its use in the FORWARD framework SQL++ Query Language

9/9/2015

5

[ [5, 10, 3], [21, 2, 15, 6], [4, {lo: 3, exp: 4, hi: 7}, 2, 13, 6] ]

• Arbitrary compositions of array, bag, tuple

SQL++ Data Model (also JSON++ data model)

A superset of SQL and JSON

• Extend SQL with arrays + nesting + heterogeneity

+ permissive structures

A superset of SQL and JSON

• Extend SQL with arrays + nesting + heterogeneity

+ permissive structures

• Extend JSON with bags and enriched types

{ location: 'Alpine', readings: {{ { time: timestamp('2014-03-12T20:00:00'), ozone: 0.035, no2: 0.0050 }, { time: timestamp('2014-03-12T22:00:00'), ozone: 'm', co: 0.4 } }} }

• Bags {{ … }}

• Enriched types

SQL++ Data Model (also JSON++ data model)

Exercise: Why the SQL++ data model is a superset of the SQL data model?

• a SQL table corresponds to a JSON bag that has homogeneous tuples

Page 6: SQL++ Query Language · 9/9/2015 1 with the support of National Science Foundation, Google, Informatica, Couchbase and its use in the FORWARD framework SQL++ Query Language

9/9/2015

6

A superset of SQL and JSON

• Extend SQL with arrays + nesting + heterogeneity

+ permissive structures + permissive schema

• Extend JSON with bags and enriched types

{ location: 'Alpine', readings: {{ { time: timestamp('2014-03-12T20:00:00'), ozone: 0.035, no2: 0.0050 }, { time: timestamp('2014-03-12T22:00:00'), ozone: 'm', co: 0.4 } }} }

• With schema

• Or, schemaless

• Or, partial schema

SQL++ Data Model (also JSON++ data model)

{ location: string, readings: {{ { time: timestamp, ozone: any, * } }} }

• Goal: Queries that input/output any JSON type, yet backwards compatible with SQL

From SQL to SQL++

{{ { sid : 2 } }}

Backwards Compatibility with SQL

SELECT DISTINCT r.sid FROM readings AS r WHERE r.temp < 50

Find sensors that recorded a temperature below 50

readings : {{ { sid: 2, temp: 70.1 }, { sid: 2, temp: 49.2 }, { sid: 1, temp: null } }}

sid temp

2 70.1

2 49.2

1 null

sid

2

Page 7: SQL++ Query Language · 9/9/2015 1 with the support of National Science Foundation, Google, Informatica, Couchbase and its use in the FORWARD framework SQL++ Query Language

9/9/2015

7

• Goal: Queries that input/output any JSON type, yet backwards compatible with SQL

• Adjust SQL syntax/semantics for heterogeneity, complex input/output values

• Achieved with few extensions and mostly by SQL limitation removals

From SQL to SQL++

[ { co: 0.8 }, { co: 0.7 } ]

readings : [ 1.3, 0.7, 0.3, 0.8 ]

From Tuple Variables to Element Variables

SELECT r AS co FROM readings AS r WHERE r < 1.0 ORDER BY r DESC LIMIT 2

Find the highest two sensor readings that are below 1.0

FROM readings AS r

SELECT r AS co

WHERE r < 1.0

BoutFROM = Bin

WHERE = {{ ⟨ r : 1.3 ⟩, ⟨ r : 0.7 ⟩, ⟨ r : 0.3 ⟩, ⟨ r : 0.8 ⟩ }}

BoutWHERE = Bin

ORDERBY = {{ ⟨ r : 0.7 ⟩, ⟨ r : 0.3 ⟩, ⟨ r : 0.8 ⟩ }}

[ { co: 0.8 }, { co: 0.7 } ]

readings : [ 1.3, 0.7, 0.3, 0.8 ]

ORDER BY r DESC

LIMIT 2

BoutORDERBY = Bin

LIMIT = [ ⟨ r : 0.8 ⟩, ⟨ r : 0.7 ⟩, ⟨ r : 0.3 ⟩ ]

BoutLIMIT = Bin

SELECT = [ ⟨ r : 0.8 ⟩ ⟨ r : 0.7 ⟩ ]

Element Variables

Page 8: SQL++ Query Language · 9/9/2015 1 with the support of National Science Foundation, Google, Informatica, Couchbase and its use in the FORWARD framework SQL++ Query Language

9/9/2015

8

[ { co: 0.9 }, { co: 0.8 } ]

sensors : [ [1.3, 2] [0.7, 0.7, 0.9] [0.3, 0.8, 1.1] [0.7, 1.4] ]

Correlated Subqueries Everywhere

SELECT r AS co FROM sensors AS s, s AS r WHERE r < 1.0 ORDER BY r DESC LIMIT 2

Find the highest two sensor readings that are below 1.0

FROM sensors AS s, s AS r

BoutFROM = Bin

WHERE = {{ ⟨ s : [1.3, 2] , r : 1.3 ⟩, ⟨ s : [1.3, 2] , r : 2 ⟩, ⟨ s : [0.7, 0.7, 0.9] , r : 0.7 ⟩, ⟨ s : [0.7, 0.7, 0.9] , r : 0.7 ⟩, ⟨ s : [0.7, 0.7, 0.9] , r : 0.9 ⟩, ⟨ s : [0.3, 0.8, 1.1] , r : 0.3 ⟩, ⟨ s : [0.3, 0.8, 1.1] , r : 0.8 ⟩, ⟨ s : [0.3, 0.8, 1.1] , r : 1.1 ⟩, ⟨ s : [0.7, 1.4] , r : 0.7 ⟩, ⟨ s : [0.7, 1.4] , r : 1.4 ⟩ }}

Correlated Subqueries Everywhere

WHERE r < 1.0

SELECT r AS co

ORDER BY r DESC

LIMIT 2

sensors : [ [1.3, 2] [0.7, 0.7, 0.9] [0.3, 0.8, 1.1] [0.7, 1.4] ]

FROM sensors AS s, s AS r

BoutWHERE = Bin

ORDERBY = {{ ⟨ s : [0.7, 0.7, 0.9] , r : 0.7 ⟩, ⟨ s : [0.7, 0.7, 0.9] , r : 0.7 ⟩, ⟨ s : [0.7, 0.7, 0.9] , r : 0.9 ⟩, ⟨ s : [0.3, 0.8, 1.1] , r : 0.3 ⟩, ⟨ s : [0.3, 0.8, 1.1] , r : 0.8 ⟩, ⟨ s : [0.7, 1.4] , r : 0.7 ⟩ }}

Correlated Subqueries Everywhere

WHERE r < 1.0

SELECT r AS co

ORDER BY r DESC

LIMIT 2

sensors : [ [1.3, 2] [0.7, 0.7, 0.9] [0.3, 0.8, 1.1] [0.7, 1.4] ]

Page 9: SQL++ Query Language · 9/9/2015 1 with the support of National Science Foundation, Google, Informatica, Couchbase and its use in the FORWARD framework SQL++ Query Language

9/9/2015

9

sensors : [ [1.3, 2] [0.7, 0.7, 0.9] [0.3, 0.8, 1.1] [0.7, 1.4] ]

FROM sensors AS s, s AS r

[ { co: 0.9 }, { co: 0.8 } ]

Correlated Subqueries Everywhere

WHERE r < 1.0

SELECT r AS co

ORDER BY r DESC

LIMIT 2

BoutORDERBY = Bin

LIMIT = [ ⟨ s : [0.7, 0.7, 0.9] , r : 0.9 ⟩, ⟨ s : [0.3, 0.8, 1.1] , r : 0.8 ⟩, ⟨ s : [0.7, 0.7, 0.9] , r : 0.7 ⟩, ⟨ s : [0.7, 0.7, 0.9] , r : 0.7 ⟩, ⟨ s : [0.7, 1.4] , r : 0.7 ⟩, ⟨ s : [0.3, 0.8, 1.1] , r : 0.3 ⟩ ]

BoutLIMIT = Bin

SELECT = [ ⟨ s : [0.7, 0.7, 0.9] , r : 0.9 ⟩, ⟨ s : [0.3, 0.8, 1.1] , r : 0.8 ⟩ ]

[ { co: 0.9, x: 3, y: 2 }, { co: 0.8, x: 2, y: 3 } ]

sensors : [ [1.3, 2] [0.7, 0.7, 0.9] [0.3, 0.8, 1.1] [0.7, 1.4] ]

Utilization of Input Order

SELECT r AS co, rp AS x, sp AS y FROM sensors AS s AT sp, s AS r AT rp WHERE r < 1.0 ORDER BY r DESC LIMIT 2

Find the highest two sensor readings that are below 1.0

x

y

FROM sensors AS s

{{ { sensor : 1, readings: {{ {co:0.4}, {co:0.2} }} }, { sensor : 2, readings: {{ {co:0.3} }} } }}

ℾ0 = ⟨ sensors : {{ {sensor: 1}, {sensor: 2} }} , logs: {{ {sensor: 1, co: 0.4}, {sensor: 1, co: 0.2}, {sensor: 2, co: 0.3}, }} ⟩

SELECT s.sensor AS sensor, ( SELECT l.co AS co FROM logs AS l WHERE l.sensor = s.sensor ) AS readings

BoutFROM = Bin

SELECT = {{ ⟨ s : {sensor: 1} ⟩, ⟨ s : {sensor: 2} ⟩ }}

Constructing Nested Results – Subqueries Again

Page 10: SQL++ Query Language · 9/9/2015 1 with the support of National Science Foundation, Google, Informatica, Couchbase and its use in the FORWARD framework SQL++ Query Language

9/9/2015

10

{{ {co:0.4}, {co:0.2} }}

ℾ1 = ⟨ s : {sensor : 1}, sensors : {{ {sensor: 1}, {sensor: 2} }} , logs: {{ {sensor: 1, co: 0.4}, {sensor: 1, co: 0.2}, {sensor: 2, co: 0.3}, }} ⟩

BoutFROM = Bin

SELECT = {{ ⟨ l : {sensor: 1, co: 0.4} ⟩, ⟨ l : {sensor: 1, co: 0.2} ⟩, ⟨ l : {sensor: 2, co: 0.3} ⟩ }}

Constructing Nested Results

FROM logs AS l

WHERE l.sensor = s.sensor

SELECT l.co AS co

BoutFROM = Bin

SELECT = {{ ⟨ l : {sensor: 1, co: 0.4} ⟩, ⟨ l : {sensor: 1, co: 0.2} ⟩ }}

Iterating over Attribute-Value pairs

INNER Vs OUTER Correlation: Capturing Outerjoins, OUTER FLATTEN

ℾ = ⟨ readings : { co : [], no2: [0.7], so2: [0.5, 2] } ⟩

FROM readings AS {g:v}, v AS n

SELECT ELEMENT { gas: g, val: v }

{{ ⟨ g : "no2", v : [0.7], n : 0.7 ⟩, ⟨ g : "so2", v : [0.5, 2], n : 0.5 ⟩, ⟨ g : "co", v : [0.5, 2], n : 2 ⟩ }}

[ { gas: "co", num: null }, { gas: "no2", num: 0.7 }, { gas: "so2", num: 0.5, }, { gas: "so2", num: 2 } ]

Page 11: SQL++ Query Language · 9/9/2015 1 with the support of National Science Foundation, Google, Informatica, Couchbase and its use in the FORWARD framework SQL++ Query Language

9/9/2015

11

INNER Vs OUTER Correlation: Capturing Outerjoins, OUTER FLATTEN

ℾ = ⟨ readings : { co : [], no2: [0.7], so2: [0.5, 2] } ⟩

FROM readings AS {g:v} INNER CORRELATE v AS n

SELECT ELEMENT { gas: g, val: v }

{{ ⟨ g : "no2", v : [0.7], n : 0.7 ⟩, ⟨ g : "so2", v : [0.5, 2], n : 0.5 ⟩, ⟨ g : "co", v : [0.5, 2], n : 2 ⟩ }}

[ { gas: "co", num: null }, { gas: "no2", num: 0.7 }, { gas: "so2", num: 0.5, }, { gas: "so2", num: 2 } ]

INNER Vs OUTER Correlation: Capturing Outerjoins, OUTER FLATTEN

ℾ = ⟨ readings : { co : [], no2: [0.7], so2: [0.5, 2] } ⟩

FROM readings AS {g:v} OUTER CORRELATE v AS n

SELECT ELEMENT { gas: g, val: v }

{{ ⟨ g : "co", v : [], n : missing ⟩, ⟨ g : "no2", v : [0.7], n : 0.7 ⟩, ⟨ g : "so2", v : [0.5, 2], n : 0.5 ⟩, ⟨ g : "co", v : [0.5, 2], n : 2 ⟩ }}

[ { gas: "co", num: null }, { gas: "no2", num: 0.7 }, { gas: "so2", num: 0.5, }, { gas: "so2", num: 2 } ]

• Goal: Queries that input/output any JSON type, yet backwards compatible with SQL

• Adjust SQL syntax/semantics for heterogeneity, complex values

• Achieved with few extensions and SQL limitation removals • Element Variables

• Correlated Subqueries (in FROM and SELECT clauses)

• Optional utilization of Input Order

• Grouping indendent of Aggregation Functions

• ... and a few more details

• Configurability: making SQL++ a unifying and configurable language

The SQL++ Extensions to SQL

Page 12: SQL++ Query Language · 9/9/2015 1 with the support of National Science Foundation, Google, Informatica, Couchbase and its use in the FORWARD framework SQL++ Query Language

9/9/2015

12

{{ {co:0.4}, {co:0.2} }}

ℾ1 = ⟨ s : {sensor : 1}, sensors : {{ {sensor: 1}, {sensor: 2} }} , logs: {{ {sensor: 1, co: 0.4}, {sensor: 1, co: 0.2}, {sensor: 2, co: 0.3}, }} ⟩

BoutFROM = Bin

SELECT = {{ ⟨ l : {sensor: 1, co: 0.4} ⟩, ⟨ l : {sensor: 1, co: 0.2} ⟩, ⟨ l : {sensor: 2, co: 0.3} ⟩ }}

Algebras do not need schemaful inputs

FROM logs AS l

WHERE l.sensor = s.sensor

SELECT l.co AS co

BoutFROM = Bin

SELECT = {{ ⟨ l : {sensor: 1, co: 0.4} ⟩, ⟨ l : {sensor: 1, co: 0.2} ⟩ }}

sensor co

1 0.4

1 0.2

2 0.3

• What is SQL++ ?

• How we use SQL++ in UCSD’s FORWARD ? • SQL++ uses in integration

• Introduction to the formal survey of NoSQL, NewSQL and SQL-on-Hadoop

Outline

SQL++ Query Processor

SQL Database

NewSQL Database

NoSQL Database

SQL-on-Hadoop

Database

Client

Java In-Memory

Objects

SQL++ Virtual Views of Sources

Federated SQL++ Queries SQL++ Results

FORWARD Virtual Database

SQL++ based Virtual Database

SQL Wrapper

NewSQL Wrapper

NoSQL Wrapper

SQL-on-Hadoop Wrapper

Java Objects Wrapper

SQL++ Virtual View

SQL++ Virtual View

SQL++ Virtual View

SQL++ Virtual View

SQL++ Virtual View

Native queries

Native results

Page 13: SQL++ Query Language · 9/9/2015 1 with the support of National Science Foundation, Google, Informatica, Couchbase and its use in the FORWARD framework SQL++ Query Language

9/9/2015

13

SQL++ Query Processor

SQL Database

NewSQL Database

NoSQL Database

SQL-on-Hadoop

Database

Client

Java In-Memory

Objects

SQL++ Queries SQL++ Results

FORWARD Integration Database

SQL++ Virtual and/or Materialized Views

SQL++ Virtual View

SQL++ Virtual View

SQL++ Virtual View

SQL++ Virtual View

SQL++ Virtual View

SQL++ Integrated Views

SQL++ Integrated

View

SQL++ Integrated

View

SQL++ Added Value

View

SQL++ View Definitions

measurements: {{ { sid: 2, temperatures: [70.1, 70.2] }, { sid: 1, temperatures: [71.0] } }}

BI Tool / Report Writer

Couchbase

FORWARD Virtual Database

or its SQL++ equivalent

SQL query SQL result

sid temp 2 70.1

2 70.2

1 71.0

Couchbase query Couchbase results

measurements: {{ {sid: 2, temp: 70.1}, {sid: 2, temp: 70.2}, {sid: 1, temp: 71.0} }}

Use Case 1: Hide the ++ source features behind SQL views

SQL View

measurements:

SELECT m.sid, t AS temp FROM measurements AS m, m.temperatures AS t

measurements: {{ { sid: 2, temp: 70.1 }, { sid: 2, temp: 49.2 }, { sid: 1, temp: null } }}

MongoDB

SQL++ query SQL++ result

MongoDB query MongoDB results

Use Case 2: Semantics-Aware Pushdown

MongoDB Virtual Database

Virtual View (Identical)

MongoDB Wrapper

SELECT DISTINCT m.sid FROM measurements AS m WHERE m.temp < 50

"Sensors that recorded a temperature below 50"

SQL++ semantics for m.temp < 50

... { $match: {temp: {$lt: 50}}}, { $match: {temp: {$not: null}}} ...

MongoDB semantics for temp $lt 50

Page 14: SQL++ Query Language · 9/9/2015 1 with the support of National Science Foundation, Google, Informatica, Couchbase and its use in the FORWARD framework SQL++ Query Language

9/9/2015

14

• How to simulate incompatible semantics/missing features?

• Sources like SQL and MongoDB cannot execute the entirety of SQL++

• How to efficiently push down computation?

Issues automatically handled by the query processor.

Plenty of query rewriting problems, including novel ones on semi-structured operators

Push-Down Challenges: Limited capabilities & semantic variations

• Semantics of "less-than" comparison are different across sources:

<sql , <mongodb , etc.

• Config Parameters to capture these variations

SQL++ captures Semantics Variations

@lt { MongoDB } ( x < y )

@lt { complex : deep, type_mismatch: false, null_lt_null : false, null_lt_value: false, ... } ( x < y )

In NoSQL, NewSQL, SQL-on-Hadoop variation of semantics for:

• Paths

• Equality

• Comparisons

And all the operators that use them:

• Selections

• Grouping

• Ordering

• Set operations

Each of these features has a set of config parameters in SQL++

Semantics Variations

Page 15: SQL++ Query Language · 9/9/2015 1 with the support of National Science Foundation, Google, Informatica, Couchbase and its use in the FORWARD framework SQL++ Query Language

9/9/2015

15

Don’t give a standard Teach how to reach a standard

• Fast evolving space

• Standard would be premature

• Configuration options itemize and explain the space options

• Rationalize discussion

Advise on consistent settings of the configuration options

• Not every configuration option setting is sensible

• Configuration options have to subscribe to common sense

iff x = y then [x] = y

if x = y and y = z then x = z

• which enables rewritings

• Have to be consistent with each other Eg, if true AND missing is missing,

then false OR missing is missing

FORWARD: SQL++ Incremental View Maintenance and Application Visualization layer

• The Incremental View Maintenance functionality: • SQL++ (Materialized) View Definition

• Stream of inserts, deletes, updates on base data

• The Incremental View Maintenance module • Automatically and efficiently updates the materialized view to reflect the

stream of changes

• SQL++ can also enable automatic Incremental View Maintenance! • With attention to replication of data in views

• Opportunities by keys

Eg, Couchbase has a JSON web log showing {{ {user, list of displayed products } }}

and FORWARD produces materialized view {{ {product category, count, products: [{product, count}]} }}

DB Before

DB After

View Before

View After

Stream

IVM Stream

Page 16: SQL++ Query Language · 9/9/2015 1 with the support of National Science Foundation, Google, Informatica, Couchbase and its use in the FORWARD framework SQL++ Query Language

9/9/2015

16

Custom dashboards, interactive pages & apps

• The data models of visualization components (e.g. Google Maps) can be nicely captured with JSON models

• The pages are SQL++ (JSON) views! • Mashups of the components views

• SQL++ feeds and incrementally updates the page views

Use case

• From data to visualization with just SQL++ & markup • Ajax/Javascript visuals with no Ajax/Javascript mess

• How to easily connect to today’s JS libraries

• Custom Ajax visualizations & interfaces for IT personnel

FORWARD: SQL++ Incremental View Maintenance and Application Visualization layer

(part of) the Google Map model

<% unit google.maps.Maps %> { markers: [ { position: { latitude : number, longitude: number } ... } ] } <% end unit %>

FORWARD: SQL++ Incremental View Maintenance and Application Visualization layer

• What is SQL++ ?

• How we use SQL++ in UCSD’s FORWARD ? • SQL++ uses in integration and (live) analytics applications

• Introduction to the formal survey of NoSQL, NewSQL and SQL-on-Hadoop

Outline

Page 17: SQL++ Query Language · 9/9/2015 1 with the support of National Science Foundation, Google, Informatica, Couchbase and its use in the FORWARD framework SQL++ Query Language

9/9/2015

17

SQL-on-Hadoop

PIG

Jaql

CQL

N1QL

AQL

MongoDB driver

SQL-on-Hadoop

SQL & NewSQL

NoSQL

Others

Hive • SQL-like syntax • Hadoop clusters

Pig • Algebraic syntax • Hadoop clusters

Jaql (IBM) • Algebraic syntax • Hadoop cluster(s)

CQL (Cassandra) • Created at Facebook • SQL-like syntax

MongoDB • JSON-like syntax • Most used NoSQL database

JSONiq • Based on XQuery • 28msec

N1QL (Couchbase) • SQL-like syntax • No production release

SQL • SQL standard • Also comprises NewSQL

AQL (AsterixDB) • FLWOR syntax • Research database (UCI)

BigQuery (Google) • Previously "Dremel" • SQL-like syntax

MongoJDBC • SQL-like syntax • MongoDB via JDBC driver

Surveyed Databases

SELECT AVG(temp) AS tavg FROM readings GROUP BY sid

SQL

db.readings.aggregate( {$group: {_id: "$sid", tavg: {$avg:"$temp"}}})

MongoDB

readings -> group by sid = $.sid into { tavg: avg($.temp) };

Jaql

a = LOAD 'readings' AS (sid:int, temp:float); b = GROUP a BY sid; c = FOREACH b GENERATE AVG(temp); DUMP c;

Pig

for $r in collection("readings") group by $r.sid return { tavg: avg($r.temp) }

JSONiq

• SQL++ covers SQL, N1QL and QL research prototypes (e.g., UCI’s ASTERIX)

• Removing the current “Tower of Babel” effect

• Providing formal syntax and semantics

SQL++ Removes Superficial Differences

15 feature matrices (1-11 dimensions each) classifying:

• Data values

• Schemas

• Access and construct nested data

• Missing information

• Equality semantics

• Ordering semantics

• Aggregation

• Joins

• Set operators

• Extensibility

Surveyed features

Page 18: SQL++ Query Language · 9/9/2015 1 with the support of National Science Foundation, Google, Informatica, Couchbase and its use in the FORWARD framework SQL++ Query Language

9/9/2015

18

• What is SQL++ ?

• How we use SQL++ in UCSD’s FORWARD ?

• Introduction to the formal survey of NoSQL, NewSQL and SQL-on-Hadoop • Methodology

• Example 1: data model (data values)

• Example 2: query language (SELECT clause)

• Example 3: semantics (path)

• Example 4: semantics (equality function)

Outline

Methodology

For each feature:

1. A formal definition of the feature in SQL++

2. A SQL++ example

3. A feature matrix that classifies each dimension of the feature

4. A discussion of the results, partial support and unexpected behaviors

All the results are empirically validated

Example: Data values

1. SQL++ example:

{ location: 'Alpine', readings: [ { time: timestamp('2014-03-12T20:00:00'), ozone: 0.035, no2: 0.0050 }, { time: timestamp('2014-03-12T22:00:00'), ozone: 'm', co: 0.4 } ] }

1 2 3 4 5 6 7 8 9

10 11 12 13 14

Page 19: SQL++ Query Language · 9/9/2015 1 with the support of National Science Foundation, Google, Informatica, Couchbase and its use in the FORWARD framework SQL++ Query Language

9/9/2015

19

Example: Data values

2. SQL++ BNF for values:

Example: Data values

3. Feature matrix:

Composability (top-level values) Heterogeneity Arrays Bags Sets Maps Tuples Primitives

Hive Bag of tuples No Yes No No Partial Yes Yes

Jaql Any Value Yes Yes No No No Yes Yes

Pig Bag of tuples Partial No Partial No Partial Yes Yes

CQL Bag of tuples No Partial No Partial Partial No Yes

JSONiq Any Value Yes Yes No No No Yes Yes

MongoDB Bag of tuples Yes Yes No No No Yes Yes

N1QL Bag of tuples Yes Yes No No No Yes Yes

SQL Bag of tuples No No No No No No Yes

AQL Any Value Yes Yes Yes No No Yes Yes

BigQuery Bag of tuples No No No No No Yes Yes

MongoJDBC Bag of tuples Yes Yes No No No Yes Yes

SQL++ Any Value Yes Yes Yes Partial Yes Yes Yes

4. Discussion of the results:

Composability (top-level values) Heterogeneity Arrays Bags Sets Maps Tuples Primitives

Hive Bag of tuples No Yes No No Partial Yes Yes

Jaql Any Value Yes Yes No No No Yes Yes

Pig Bag of tuples Partial No Partial No Partial Yes Yes

CQL Bag of tuples No Partial No Partial Partial No Yes

JSONiq Any Value Yes Yes No No No Yes Yes

MongoDB Bag of tuples Yes Yes No No No Yes Yes

N1QL Bag of tuples Yes Yes No No No Yes Yes

SQL Bag of tuples No No No No No No Yes

AQL Any Value Yes Yes Yes No No Yes Yes

BigQuery Bag of tuples No No No No No Yes Yes

MongoJDBC Bag of tuples Yes Yes No No No Yes Yes

SQL++ Any Value Yes Yes Yes Partial Yes Yes Yes

• Column-by-column comparison • Partial support (65k scalar elements) • Identify clusters (who supports JSON?)

Example: Data values

Page 20: SQL++ Query Language · 9/9/2015 1 with the support of National Science Foundation, Google, Informatica, Couchbase and its use in the FORWARD framework SQL++ Query Language

9/9/2015

20

1. SQL++ example: • Projecting nested collections:

• Projecting non-tuples:

SELECT ELEMENT ozone FROM readings

SELECT TUPLE s.lat, s.long, (SELECT r.ozone FROM readings AS r WHERE r.location = s.location) FROM sensors AS s

"Position and (nested) ozone readings of each sensor?"

"Bag of all the (scalar) ozone readings?"

Example: SELECT clause

Projecting tuples containing nested collections Projecting non-tuples

Hive Partial No

Jaql Yes Yes

Pig Partial No

CQL No No

JSONiq Yes Yes

MongoDB Partial Partial

N1QL Partial Partial

SQL No No

AQL Yes Yes

BigQuery No No

MongoJDBC No No

SQL++ Yes Yes

Example: SELECT clause

3. Feature matrix:

Projecting tuples containing nested collections Projecting non-tuples

Hive Partial No

Jaql Yes Yes

Pig Partial No

CQL No No

JSONiq Yes Yes

MongoDB Partial Partial

N1QL Partial Partial

SQL No No

AQL Yes Yes

BigQuery No No

MongoJDBC No No

SQL++ Yes Yes

• Not well supported features

• 3 languages support them

entirely (same cluster as for

data values)

4. Discussion of the results:

Example: SELECT clause

Page 21: SQL++ Query Language · 9/9/2015 1 with the support of National Science Foundation, Google, Informatica, Couchbase and its use in the FORWARD framework SQL++ Query Language

9/9/2015

21

We use config parameters to encompass and compare various semantics of a feature:

• Minimal number of independent dimensions

• 1 dimension = 1 config parameter

• SQL++ formalism parametrized by the config parameters

• Feature matrix classifies the values of each config parameter

Config Parameters

• Config parameters for tuple navigation:

@tuple_nav { absent : missing, type_mismatch: error, } ( x.y )

Example: Paths

• The feature matrix classifies, for each language, the value of each config parameter:

Missing Type mismatch

Hive Error Error

Jaql Null Error#

Pig Error Error

CQL Error Error

JSONiq Missing# Missing#

MongoDB Missing Missing

N1QL Missing Missing

SQL Error Error

AQL Null Error

BigQuery Error Error

MongoJDBC Missing# Missing#

SQL++ @path @path

Example: Paths

Page 22: SQL++ Query Language · 9/9/2015 1 with the support of National Science Foundation, Google, Informatica, Couchbase and its use in the FORWARD framework SQL++ Query Language

9/9/2015

22

• Config parameters for equality:

Example: Equality Function

@eq { complex : error, type_mismatch : false, null_eq_null : null, null_eq_value : null, missing_eq_missing: missing, missing_eq_value : missing, missing_eq_null : missing } ( x = y )

• The feature matrix classifies, for each language, the value of each config parameter:

Complex Type

mismatch Null = Null Null = Value

Missing =

Missing

Missing =

Value

Missing =

Null

Hive (=, <=>) Err, Err Err, Err Null, True Null, False N/A N/A N/A

Jaql Boolean Null Null Null N/A N/A N/A

Pig Boolean partial Err Null Null N/A N/A N/A

CQL Err Err Err False/Null N/A N/A N/A

JSONiq (=, deep-equal()) Err, Boolean Err, False True, True False, False Missing, True Missing, False Missing, False

MongoDB Boolean False True False True False False

N1QL Boolean False Null Null Missing Missing Missing

SQL N/A Err Null Null N/A N/A N/A

AQL Err Err Null Null N/A N/A N/A

BigQuery Err Err Null Null N/A N/A N/A

MongoJDBC Boolean False True False N/A False False

SQL++ @equal @equal @equal @equal @equal @equal @equal

• No real cluster • Some languages have multiple (incompatible) equality functions • Some edge cases cannot happen due to other limitations (SQL has no complex values)

Example: Equality Function

• As a database user: • Understand the semantics of a (often underspecified) query language / be

aware of the limitation of a database

• As a designer/architect of a database • Produce formal specification of your query language

• Align semantics with SQL's

• As a database researcher • The results might change, but the survey methodology stays

• As a designer/architect of database middleware • Understand what capability variations need to be encapsulated and

simulated

How to use this survey?

Page 23: SQL++ Query Language · 9/9/2015 1 with the support of National Science Foundation, Google, Informatica, Couchbase and its use in the FORWARD framework SQL++ Query Language

9/9/2015

23

• The marketing clusters do not correspond to real capabilities

• Limited capabilities: matrices are sparse and fragmented (more pressure on source-specific rewriters and distributor)

The survey shows:

The Future is Semi-Structured and Declarative

• Scalability

• Flexibility of Semistructured Data

• Power of Declarative:

Automatic Optimization, View Maintenance

• The primary operational and the secondary analytics application out of semistructured, declarative platforms