Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
9/9/2015
1
with the support of National Science Foundation, Google, Informatica, Couchbase
and its use in the FORWARD
framework
SQL++ Query Language
SQL-on-Hadoop
PIG
Jaql
CQL
N1QL
AQL
SQL-on-Hadoop NoSQL
Others
Semistructured Databases have Arrived
in the form of JSON…
{ location: 'Alpine', readings: [ { time: timestamp('2014-03-12T20:00:00'), ozone: 0.035, no2: 0.0050 }, { time: timestamp('2014-03-12T22:00:00'), ozone: 'm', co: 0.4 } ] }
9/9/2015
2
Semistructured Query Languages come along…
JSON: Semistructured
Data
SQL: Expressive Power and Adoption
but in a Tower of Babel…
SELECT AVG(temp) AS tavg FROM readings GROUP BY sid
SQL
db.readings.aggregate( {$group: {_id: "$sid", tavg: {$avg:"$temp"}}})
MongoDB
readings -> group by sid = $.sid into { tavg: avg($.temp) };
Jaql
a = LOAD 'readings' AS (sid:int, temp:float); b = GROUP a BY sid; c = FOREACH b GENERATE AVG(temp); DUMP c;
Pig
for $r in collection("readings") group by $r.sid return { tavg: avg($r.temp) }
JSONiq
with limited query abilities…
Growing out of key-value databases or SQL with special type UDFs
Far from the full-fledged power of prior non-relational query languages – OQL, Xquery *academic projects being the exception (eg, ASTERIX, FORWARD)
9/9/2015
3
in lieu of formal semantics…
Query languages were not meant for Ninjas
• What is SQL++ ?
• How we use SQL++ in UCSD’s FORWARD ?
• Introduction to the formal survey of NoSQL, NewSQL and SQL-on-Hadoop
Outline
SQL++ : Data model + query language for
querying both structured data (SQL) and
semistructured data (native JSON)
Full fledged, declarative ,
aggressively backwards-compatible with SQL
Formal Schemaless Semantics:
adjusting tuple calculus to the
richness of JSON
Configurable SQL++: Formal configuration
options towards choosing semantics, understanding
repercussions
Unifying: 11 popular databases/QLs captured by appropriate settings of Configurable
SQL++
9/9/2015
4
How is it used in the Big Data industry • ASTERIXDB
• CouchDB
heading towards SQL++
based on a schemaless SQL++ algebra
Automatic Incremental View Maintenance
Middleware Query Optimization & Execution: FORWARD virtual database
query plans based on SQL++ algebra
How do we use it in UCSD
{ location: 'Alpine', readings: [ { time: timestamp('2014-03-12T20:00:00'), ozone: 0.035, no2: 0.0050 }, { time: timestamp('2014-03-12T22:00:00'), ozone: 'm', co: 0.4 } ] }
• Array nested inside a tuple
• Heterogeneous tuples in collections
• Arbitrary compositions of array, bag, tuple
SQL++ Data Model (also JSON++ data model)
A superset of SQL and JSON
• Extend SQL with arrays + nesting + heterogeneity
9/9/2015
5
[ [5, 10, 3], [21, 2, 15, 6], [4, {lo: 3, exp: 4, hi: 7}, 2, 13, 6] ]
• Arbitrary compositions of array, bag, tuple
SQL++ Data Model (also JSON++ data model)
A superset of SQL and JSON
• Extend SQL with arrays + nesting + heterogeneity
+ permissive structures
A superset of SQL and JSON
• Extend SQL with arrays + nesting + heterogeneity
+ permissive structures
• Extend JSON with bags and enriched types
{ location: 'Alpine', readings: {{ { time: timestamp('2014-03-12T20:00:00'), ozone: 0.035, no2: 0.0050 }, { time: timestamp('2014-03-12T22:00:00'), ozone: 'm', co: 0.4 } }} }
• Bags {{ … }}
• Enriched types
SQL++ Data Model (also JSON++ data model)
Exercise: Why the SQL++ data model is a superset of the SQL data model?
• a SQL table corresponds to a JSON bag that has homogeneous tuples
9/9/2015
6
A superset of SQL and JSON
• Extend SQL with arrays + nesting + heterogeneity
+ permissive structures + permissive schema
• Extend JSON with bags and enriched types
{ location: 'Alpine', readings: {{ { time: timestamp('2014-03-12T20:00:00'), ozone: 0.035, no2: 0.0050 }, { time: timestamp('2014-03-12T22:00:00'), ozone: 'm', co: 0.4 } }} }
• With schema
• Or, schemaless
• Or, partial schema
SQL++ Data Model (also JSON++ data model)
{ location: string, readings: {{ { time: timestamp, ozone: any, * } }} }
• Goal: Queries that input/output any JSON type, yet backwards compatible with SQL
From SQL to SQL++
{{ { sid : 2 } }}
Backwards Compatibility with SQL
SELECT DISTINCT r.sid FROM readings AS r WHERE r.temp < 50
Find sensors that recorded a temperature below 50
readings : {{ { sid: 2, temp: 70.1 }, { sid: 2, temp: 49.2 }, { sid: 1, temp: null } }}
sid temp
2 70.1
2 49.2
1 null
sid
2
9/9/2015
7
• Goal: Queries that input/output any JSON type, yet backwards compatible with SQL
• Adjust SQL syntax/semantics for heterogeneity, complex input/output values
• Achieved with few extensions and mostly by SQL limitation removals
From SQL to SQL++
[ { co: 0.8 }, { co: 0.7 } ]
readings : [ 1.3, 0.7, 0.3, 0.8 ]
From Tuple Variables to Element Variables
SELECT r AS co FROM readings AS r WHERE r < 1.0 ORDER BY r DESC LIMIT 2
Find the highest two sensor readings that are below 1.0
FROM readings AS r
SELECT r AS co
WHERE r < 1.0
BoutFROM = Bin
WHERE = {{ ⟨ r : 1.3 ⟩, ⟨ r : 0.7 ⟩, ⟨ r : 0.3 ⟩, ⟨ r : 0.8 ⟩ }}
BoutWHERE = Bin
ORDERBY = {{ ⟨ r : 0.7 ⟩, ⟨ r : 0.3 ⟩, ⟨ r : 0.8 ⟩ }}
[ { co: 0.8 }, { co: 0.7 } ]
readings : [ 1.3, 0.7, 0.3, 0.8 ]
ORDER BY r DESC
LIMIT 2
BoutORDERBY = Bin
LIMIT = [ ⟨ r : 0.8 ⟩, ⟨ r : 0.7 ⟩, ⟨ r : 0.3 ⟩ ]
BoutLIMIT = Bin
SELECT = [ ⟨ r : 0.8 ⟩ ⟨ r : 0.7 ⟩ ]
Element Variables
9/9/2015
8
[ { co: 0.9 }, { co: 0.8 } ]
sensors : [ [1.3, 2] [0.7, 0.7, 0.9] [0.3, 0.8, 1.1] [0.7, 1.4] ]
Correlated Subqueries Everywhere
SELECT r AS co FROM sensors AS s, s AS r WHERE r < 1.0 ORDER BY r DESC LIMIT 2
Find the highest two sensor readings that are below 1.0
FROM sensors AS s, s AS r
BoutFROM = Bin
WHERE = {{ ⟨ s : [1.3, 2] , r : 1.3 ⟩, ⟨ s : [1.3, 2] , r : 2 ⟩, ⟨ s : [0.7, 0.7, 0.9] , r : 0.7 ⟩, ⟨ s : [0.7, 0.7, 0.9] , r : 0.7 ⟩, ⟨ s : [0.7, 0.7, 0.9] , r : 0.9 ⟩, ⟨ s : [0.3, 0.8, 1.1] , r : 0.3 ⟩, ⟨ s : [0.3, 0.8, 1.1] , r : 0.8 ⟩, ⟨ s : [0.3, 0.8, 1.1] , r : 1.1 ⟩, ⟨ s : [0.7, 1.4] , r : 0.7 ⟩, ⟨ s : [0.7, 1.4] , r : 1.4 ⟩ }}
Correlated Subqueries Everywhere
WHERE r < 1.0
SELECT r AS co
ORDER BY r DESC
LIMIT 2
sensors : [ [1.3, 2] [0.7, 0.7, 0.9] [0.3, 0.8, 1.1] [0.7, 1.4] ]
FROM sensors AS s, s AS r
BoutWHERE = Bin
ORDERBY = {{ ⟨ s : [0.7, 0.7, 0.9] , r : 0.7 ⟩, ⟨ s : [0.7, 0.7, 0.9] , r : 0.7 ⟩, ⟨ s : [0.7, 0.7, 0.9] , r : 0.9 ⟩, ⟨ s : [0.3, 0.8, 1.1] , r : 0.3 ⟩, ⟨ s : [0.3, 0.8, 1.1] , r : 0.8 ⟩, ⟨ s : [0.7, 1.4] , r : 0.7 ⟩ }}
Correlated Subqueries Everywhere
WHERE r < 1.0
SELECT r AS co
ORDER BY r DESC
LIMIT 2
sensors : [ [1.3, 2] [0.7, 0.7, 0.9] [0.3, 0.8, 1.1] [0.7, 1.4] ]
9/9/2015
9
sensors : [ [1.3, 2] [0.7, 0.7, 0.9] [0.3, 0.8, 1.1] [0.7, 1.4] ]
FROM sensors AS s, s AS r
[ { co: 0.9 }, { co: 0.8 } ]
Correlated Subqueries Everywhere
WHERE r < 1.0
SELECT r AS co
ORDER BY r DESC
LIMIT 2
BoutORDERBY = Bin
LIMIT = [ ⟨ s : [0.7, 0.7, 0.9] , r : 0.9 ⟩, ⟨ s : [0.3, 0.8, 1.1] , r : 0.8 ⟩, ⟨ s : [0.7, 0.7, 0.9] , r : 0.7 ⟩, ⟨ s : [0.7, 0.7, 0.9] , r : 0.7 ⟩, ⟨ s : [0.7, 1.4] , r : 0.7 ⟩, ⟨ s : [0.3, 0.8, 1.1] , r : 0.3 ⟩ ]
BoutLIMIT = Bin
SELECT = [ ⟨ s : [0.7, 0.7, 0.9] , r : 0.9 ⟩, ⟨ s : [0.3, 0.8, 1.1] , r : 0.8 ⟩ ]
[ { co: 0.9, x: 3, y: 2 }, { co: 0.8, x: 2, y: 3 } ]
sensors : [ [1.3, 2] [0.7, 0.7, 0.9] [0.3, 0.8, 1.1] [0.7, 1.4] ]
Utilization of Input Order
SELECT r AS co, rp AS x, sp AS y FROM sensors AS s AT sp, s AS r AT rp WHERE r < 1.0 ORDER BY r DESC LIMIT 2
Find the highest two sensor readings that are below 1.0
x
y
FROM sensors AS s
{{ { sensor : 1, readings: {{ {co:0.4}, {co:0.2} }} }, { sensor : 2, readings: {{ {co:0.3} }} } }}
ℾ0 = ⟨ sensors : {{ {sensor: 1}, {sensor: 2} }} , logs: {{ {sensor: 1, co: 0.4}, {sensor: 1, co: 0.2}, {sensor: 2, co: 0.3}, }} ⟩
SELECT s.sensor AS sensor, ( SELECT l.co AS co FROM logs AS l WHERE l.sensor = s.sensor ) AS readings
BoutFROM = Bin
SELECT = {{ ⟨ s : {sensor: 1} ⟩, ⟨ s : {sensor: 2} ⟩ }}
Constructing Nested Results – Subqueries Again
9/9/2015
10
{{ {co:0.4}, {co:0.2} }}
ℾ1 = ⟨ s : {sensor : 1}, sensors : {{ {sensor: 1}, {sensor: 2} }} , logs: {{ {sensor: 1, co: 0.4}, {sensor: 1, co: 0.2}, {sensor: 2, co: 0.3}, }} ⟩
BoutFROM = Bin
SELECT = {{ ⟨ l : {sensor: 1, co: 0.4} ⟩, ⟨ l : {sensor: 1, co: 0.2} ⟩, ⟨ l : {sensor: 2, co: 0.3} ⟩ }}
Constructing Nested Results
FROM logs AS l
WHERE l.sensor = s.sensor
SELECT l.co AS co
BoutFROM = Bin
SELECT = {{ ⟨ l : {sensor: 1, co: 0.4} ⟩, ⟨ l : {sensor: 1, co: 0.2} ⟩ }}
Iterating over Attribute-Value pairs
INNER Vs OUTER Correlation: Capturing Outerjoins, OUTER FLATTEN
ℾ = ⟨ readings : { co : [], no2: [0.7], so2: [0.5, 2] } ⟩
FROM readings AS {g:v}, v AS n
SELECT ELEMENT { gas: g, val: v }
{{ ⟨ g : "no2", v : [0.7], n : 0.7 ⟩, ⟨ g : "so2", v : [0.5, 2], n : 0.5 ⟩, ⟨ g : "co", v : [0.5, 2], n : 2 ⟩ }}
[ { gas: "co", num: null }, { gas: "no2", num: 0.7 }, { gas: "so2", num: 0.5, }, { gas: "so2", num: 2 } ]
9/9/2015
11
INNER Vs OUTER Correlation: Capturing Outerjoins, OUTER FLATTEN
ℾ = ⟨ readings : { co : [], no2: [0.7], so2: [0.5, 2] } ⟩
FROM readings AS {g:v} INNER CORRELATE v AS n
SELECT ELEMENT { gas: g, val: v }
{{ ⟨ g : "no2", v : [0.7], n : 0.7 ⟩, ⟨ g : "so2", v : [0.5, 2], n : 0.5 ⟩, ⟨ g : "co", v : [0.5, 2], n : 2 ⟩ }}
[ { gas: "co", num: null }, { gas: "no2", num: 0.7 }, { gas: "so2", num: 0.5, }, { gas: "so2", num: 2 } ]
INNER Vs OUTER Correlation: Capturing Outerjoins, OUTER FLATTEN
ℾ = ⟨ readings : { co : [], no2: [0.7], so2: [0.5, 2] } ⟩
FROM readings AS {g:v} OUTER CORRELATE v AS n
SELECT ELEMENT { gas: g, val: v }
{{ ⟨ g : "co", v : [], n : missing ⟩, ⟨ g : "no2", v : [0.7], n : 0.7 ⟩, ⟨ g : "so2", v : [0.5, 2], n : 0.5 ⟩, ⟨ g : "co", v : [0.5, 2], n : 2 ⟩ }}
[ { gas: "co", num: null }, { gas: "no2", num: 0.7 }, { gas: "so2", num: 0.5, }, { gas: "so2", num: 2 } ]
• Goal: Queries that input/output any JSON type, yet backwards compatible with SQL
• Adjust SQL syntax/semantics for heterogeneity, complex values
• Achieved with few extensions and SQL limitation removals • Element Variables
• Correlated Subqueries (in FROM and SELECT clauses)
• Optional utilization of Input Order
• Grouping indendent of Aggregation Functions
• ... and a few more details
• Configurability: making SQL++ a unifying and configurable language
The SQL++ Extensions to SQL
9/9/2015
12
{{ {co:0.4}, {co:0.2} }}
ℾ1 = ⟨ s : {sensor : 1}, sensors : {{ {sensor: 1}, {sensor: 2} }} , logs: {{ {sensor: 1, co: 0.4}, {sensor: 1, co: 0.2}, {sensor: 2, co: 0.3}, }} ⟩
BoutFROM = Bin
SELECT = {{ ⟨ l : {sensor: 1, co: 0.4} ⟩, ⟨ l : {sensor: 1, co: 0.2} ⟩, ⟨ l : {sensor: 2, co: 0.3} ⟩ }}
Algebras do not need schemaful inputs
FROM logs AS l
WHERE l.sensor = s.sensor
SELECT l.co AS co
BoutFROM = Bin
SELECT = {{ ⟨ l : {sensor: 1, co: 0.4} ⟩, ⟨ l : {sensor: 1, co: 0.2} ⟩ }}
sensor co
1 0.4
1 0.2
2 0.3
• What is SQL++ ?
• How we use SQL++ in UCSD’s FORWARD ? • SQL++ uses in integration
• Introduction to the formal survey of NoSQL, NewSQL and SQL-on-Hadoop
Outline
SQL++ Query Processor
SQL Database
NewSQL Database
NoSQL Database
SQL-on-Hadoop
Database
Client
Java In-Memory
Objects
SQL++ Virtual Views of Sources
Federated SQL++ Queries SQL++ Results
FORWARD Virtual Database
SQL++ based Virtual Database
SQL Wrapper
NewSQL Wrapper
NoSQL Wrapper
SQL-on-Hadoop Wrapper
Java Objects Wrapper
SQL++ Virtual View
SQL++ Virtual View
SQL++ Virtual View
SQL++ Virtual View
SQL++ Virtual View
Native queries
Native results
9/9/2015
13
SQL++ Query Processor
SQL Database
NewSQL Database
NoSQL Database
SQL-on-Hadoop
Database
Client
Java In-Memory
Objects
SQL++ Queries SQL++ Results
FORWARD Integration Database
SQL++ Virtual and/or Materialized Views
SQL++ Virtual View
SQL++ Virtual View
SQL++ Virtual View
SQL++ Virtual View
SQL++ Virtual View
SQL++ Integrated Views
SQL++ Integrated
View
SQL++ Integrated
View
SQL++ Added Value
View
SQL++ View Definitions
measurements: {{ { sid: 2, temperatures: [70.1, 70.2] }, { sid: 1, temperatures: [71.0] } }}
BI Tool / Report Writer
Couchbase
FORWARD Virtual Database
or its SQL++ equivalent
SQL query SQL result
sid temp 2 70.1
2 70.2
1 71.0
Couchbase query Couchbase results
measurements: {{ {sid: 2, temp: 70.1}, {sid: 2, temp: 70.2}, {sid: 1, temp: 71.0} }}
Use Case 1: Hide the ++ source features behind SQL views
SQL View
measurements:
SELECT m.sid, t AS temp FROM measurements AS m, m.temperatures AS t
measurements: {{ { sid: 2, temp: 70.1 }, { sid: 2, temp: 49.2 }, { sid: 1, temp: null } }}
MongoDB
SQL++ query SQL++ result
MongoDB query MongoDB results
Use Case 2: Semantics-Aware Pushdown
MongoDB Virtual Database
Virtual View (Identical)
MongoDB Wrapper
SELECT DISTINCT m.sid FROM measurements AS m WHERE m.temp < 50
"Sensors that recorded a temperature below 50"
SQL++ semantics for m.temp < 50
... { $match: {temp: {$lt: 50}}}, { $match: {temp: {$not: null}}} ...
MongoDB semantics for temp $lt 50
9/9/2015
14
• How to simulate incompatible semantics/missing features?
• Sources like SQL and MongoDB cannot execute the entirety of SQL++
• How to efficiently push down computation?
Issues automatically handled by the query processor.
Plenty of query rewriting problems, including novel ones on semi-structured operators
Push-Down Challenges: Limited capabilities & semantic variations
• Semantics of "less-than" comparison are different across sources:
<sql , <mongodb , etc.
• Config Parameters to capture these variations
SQL++ captures Semantics Variations
@lt { MongoDB } ( x < y )
@lt { complex : deep, type_mismatch: false, null_lt_null : false, null_lt_value: false, ... } ( x < y )
In NoSQL, NewSQL, SQL-on-Hadoop variation of semantics for:
• Paths
• Equality
• Comparisons
And all the operators that use them:
• Selections
• Grouping
• Ordering
• Set operations
Each of these features has a set of config parameters in SQL++
Semantics Variations
9/9/2015
15
Don’t give a standard Teach how to reach a standard
• Fast evolving space
• Standard would be premature
• Configuration options itemize and explain the space options
• Rationalize discussion
Advise on consistent settings of the configuration options
• Not every configuration option setting is sensible
• Configuration options have to subscribe to common sense
iff x = y then [x] = y
if x = y and y = z then x = z
• which enables rewritings
• Have to be consistent with each other Eg, if true AND missing is missing,
then false OR missing is missing
FORWARD: SQL++ Incremental View Maintenance and Application Visualization layer
• The Incremental View Maintenance functionality: • SQL++ (Materialized) View Definition
• Stream of inserts, deletes, updates on base data
• The Incremental View Maintenance module • Automatically and efficiently updates the materialized view to reflect the
stream of changes
• SQL++ can also enable automatic Incremental View Maintenance! • With attention to replication of data in views
• Opportunities by keys
Eg, Couchbase has a JSON web log showing {{ {user, list of displayed products } }}
and FORWARD produces materialized view {{ {product category, count, products: [{product, count}]} }}
DB Before
DB After
View Before
View After
Stream
IVM Stream
9/9/2015
16
Custom dashboards, interactive pages & apps
• The data models of visualization components (e.g. Google Maps) can be nicely captured with JSON models
• The pages are SQL++ (JSON) views! • Mashups of the components views
• SQL++ feeds and incrementally updates the page views
Use case
• From data to visualization with just SQL++ & markup • Ajax/Javascript visuals with no Ajax/Javascript mess
• How to easily connect to today’s JS libraries
• Custom Ajax visualizations & interfaces for IT personnel
FORWARD: SQL++ Incremental View Maintenance and Application Visualization layer
(part of) the Google Map model
<% unit google.maps.Maps %> { markers: [ { position: { latitude : number, longitude: number } ... } ] } <% end unit %>
FORWARD: SQL++ Incremental View Maintenance and Application Visualization layer
• What is SQL++ ?
• How we use SQL++ in UCSD’s FORWARD ? • SQL++ uses in integration and (live) analytics applications
• Introduction to the formal survey of NoSQL, NewSQL and SQL-on-Hadoop
Outline
9/9/2015
17
SQL-on-Hadoop
PIG
Jaql
CQL
N1QL
AQL
MongoDB driver
SQL-on-Hadoop
SQL & NewSQL
NoSQL
Others
Hive • SQL-like syntax • Hadoop clusters
Pig • Algebraic syntax • Hadoop clusters
Jaql (IBM) • Algebraic syntax • Hadoop cluster(s)
CQL (Cassandra) • Created at Facebook • SQL-like syntax
MongoDB • JSON-like syntax • Most used NoSQL database
JSONiq • Based on XQuery • 28msec
N1QL (Couchbase) • SQL-like syntax • No production release
SQL • SQL standard • Also comprises NewSQL
AQL (AsterixDB) • FLWOR syntax • Research database (UCI)
BigQuery (Google) • Previously "Dremel" • SQL-like syntax
MongoJDBC • SQL-like syntax • MongoDB via JDBC driver
Surveyed Databases
SELECT AVG(temp) AS tavg FROM readings GROUP BY sid
SQL
db.readings.aggregate( {$group: {_id: "$sid", tavg: {$avg:"$temp"}}})
MongoDB
readings -> group by sid = $.sid into { tavg: avg($.temp) };
Jaql
a = LOAD 'readings' AS (sid:int, temp:float); b = GROUP a BY sid; c = FOREACH b GENERATE AVG(temp); DUMP c;
Pig
for $r in collection("readings") group by $r.sid return { tavg: avg($r.temp) }
JSONiq
• SQL++ covers SQL, N1QL and QL research prototypes (e.g., UCI’s ASTERIX)
• Removing the current “Tower of Babel” effect
• Providing formal syntax and semantics
SQL++ Removes Superficial Differences
15 feature matrices (1-11 dimensions each) classifying:
• Data values
• Schemas
• Access and construct nested data
• Missing information
• Equality semantics
• Ordering semantics
• Aggregation
• Joins
• Set operators
• Extensibility
Surveyed features
9/9/2015
18
• What is SQL++ ?
• How we use SQL++ in UCSD’s FORWARD ?
• Introduction to the formal survey of NoSQL, NewSQL and SQL-on-Hadoop • Methodology
• Example 1: data model (data values)
• Example 2: query language (SELECT clause)
• Example 3: semantics (path)
• Example 4: semantics (equality function)
Outline
Methodology
For each feature:
1. A formal definition of the feature in SQL++
2. A SQL++ example
3. A feature matrix that classifies each dimension of the feature
4. A discussion of the results, partial support and unexpected behaviors
All the results are empirically validated
Example: Data values
1. SQL++ example:
{ location: 'Alpine', readings: [ { time: timestamp('2014-03-12T20:00:00'), ozone: 0.035, no2: 0.0050 }, { time: timestamp('2014-03-12T22:00:00'), ozone: 'm', co: 0.4 } ] }
1 2 3 4 5 6 7 8 9
10 11 12 13 14
9/9/2015
19
Example: Data values
2. SQL++ BNF for values:
Example: Data values
3. Feature matrix:
Composability (top-level values) Heterogeneity Arrays Bags Sets Maps Tuples Primitives
Hive Bag of tuples No Yes No No Partial Yes Yes
Jaql Any Value Yes Yes No No No Yes Yes
Pig Bag of tuples Partial No Partial No Partial Yes Yes
CQL Bag of tuples No Partial No Partial Partial No Yes
JSONiq Any Value Yes Yes No No No Yes Yes
MongoDB Bag of tuples Yes Yes No No No Yes Yes
N1QL Bag of tuples Yes Yes No No No Yes Yes
SQL Bag of tuples No No No No No No Yes
AQL Any Value Yes Yes Yes No No Yes Yes
BigQuery Bag of tuples No No No No No Yes Yes
MongoJDBC Bag of tuples Yes Yes No No No Yes Yes
SQL++ Any Value Yes Yes Yes Partial Yes Yes Yes
4. Discussion of the results:
Composability (top-level values) Heterogeneity Arrays Bags Sets Maps Tuples Primitives
Hive Bag of tuples No Yes No No Partial Yes Yes
Jaql Any Value Yes Yes No No No Yes Yes
Pig Bag of tuples Partial No Partial No Partial Yes Yes
CQL Bag of tuples No Partial No Partial Partial No Yes
JSONiq Any Value Yes Yes No No No Yes Yes
MongoDB Bag of tuples Yes Yes No No No Yes Yes
N1QL Bag of tuples Yes Yes No No No Yes Yes
SQL Bag of tuples No No No No No No Yes
AQL Any Value Yes Yes Yes No No Yes Yes
BigQuery Bag of tuples No No No No No Yes Yes
MongoJDBC Bag of tuples Yes Yes No No No Yes Yes
SQL++ Any Value Yes Yes Yes Partial Yes Yes Yes
• Column-by-column comparison • Partial support (65k scalar elements) • Identify clusters (who supports JSON?)
Example: Data values
9/9/2015
20
1. SQL++ example: • Projecting nested collections:
• Projecting non-tuples:
SELECT ELEMENT ozone FROM readings
SELECT TUPLE s.lat, s.long, (SELECT r.ozone FROM readings AS r WHERE r.location = s.location) FROM sensors AS s
"Position and (nested) ozone readings of each sensor?"
"Bag of all the (scalar) ozone readings?"
Example: SELECT clause
Projecting tuples containing nested collections Projecting non-tuples
Hive Partial No
Jaql Yes Yes
Pig Partial No
CQL No No
JSONiq Yes Yes
MongoDB Partial Partial
N1QL Partial Partial
SQL No No
AQL Yes Yes
BigQuery No No
MongoJDBC No No
SQL++ Yes Yes
Example: SELECT clause
3. Feature matrix:
Projecting tuples containing nested collections Projecting non-tuples
Hive Partial No
Jaql Yes Yes
Pig Partial No
CQL No No
JSONiq Yes Yes
MongoDB Partial Partial
N1QL Partial Partial
SQL No No
AQL Yes Yes
BigQuery No No
MongoJDBC No No
SQL++ Yes Yes
• Not well supported features
• 3 languages support them
entirely (same cluster as for
data values)
4. Discussion of the results:
Example: SELECT clause
9/9/2015
21
We use config parameters to encompass and compare various semantics of a feature:
• Minimal number of independent dimensions
• 1 dimension = 1 config parameter
• SQL++ formalism parametrized by the config parameters
• Feature matrix classifies the values of each config parameter
Config Parameters
• Config parameters for tuple navigation:
@tuple_nav { absent : missing, type_mismatch: error, } ( x.y )
Example: Paths
• The feature matrix classifies, for each language, the value of each config parameter:
Missing Type mismatch
Hive Error Error
Jaql Null Error#
Pig Error Error
CQL Error Error
JSONiq Missing# Missing#
MongoDB Missing Missing
N1QL Missing Missing
SQL Error Error
AQL Null Error
BigQuery Error Error
MongoJDBC Missing# Missing#
SQL++ @path @path
Example: Paths
9/9/2015
22
• Config parameters for equality:
Example: Equality Function
@eq { complex : error, type_mismatch : false, null_eq_null : null, null_eq_value : null, missing_eq_missing: missing, missing_eq_value : missing, missing_eq_null : missing } ( x = y )
• The feature matrix classifies, for each language, the value of each config parameter:
Complex Type
mismatch Null = Null Null = Value
Missing =
Missing
Missing =
Value
Missing =
Null
Hive (=, <=>) Err, Err Err, Err Null, True Null, False N/A N/A N/A
Jaql Boolean Null Null Null N/A N/A N/A
Pig Boolean partial Err Null Null N/A N/A N/A
CQL Err Err Err False/Null N/A N/A N/A
JSONiq (=, deep-equal()) Err, Boolean Err, False True, True False, False Missing, True Missing, False Missing, False
MongoDB Boolean False True False True False False
N1QL Boolean False Null Null Missing Missing Missing
SQL N/A Err Null Null N/A N/A N/A
AQL Err Err Null Null N/A N/A N/A
BigQuery Err Err Null Null N/A N/A N/A
MongoJDBC Boolean False True False N/A False False
SQL++ @equal @equal @equal @equal @equal @equal @equal
• No real cluster • Some languages have multiple (incompatible) equality functions • Some edge cases cannot happen due to other limitations (SQL has no complex values)
Example: Equality Function
• As a database user: • Understand the semantics of a (often underspecified) query language / be
aware of the limitation of a database
• As a designer/architect of a database • Produce formal specification of your query language
• Align semantics with SQL's
• As a database researcher • The results might change, but the survey methodology stays
• As a designer/architect of database middleware • Understand what capability variations need to be encapsulated and
simulated
How to use this survey?
9/9/2015
23
• The marketing clusters do not correspond to real capabilities
• Limited capabilities: matrices are sparse and fragmented (more pressure on source-specific rewriters and distributor)
The survey shows:
The Future is Semi-Structured and Declarative
• Scalability
• Flexibility of Semistructured Data
• Power of Declarative:
Automatic Optimization, View Maintenance
• The primary operational and the secondary analytics application out of semistructured, declarative platforms