55
ISO/IEC JTC 1 N 11698 ISO/IEC JTC 1 Information technology Secretariat: ANSI (United States) Document type: National Body Contribution Title: US National Body Contribution on Big Data Status: This document is circulated for review and consideration at the November 2013 JTC 1 Plenary meeting in France. Date of document: 2013-09-19 Source: US Expected action: ACT Action due date: 2013-11-04 Email of secretary: [email protected] Committee URL: http://isotc.iso.org/livelink/livelink/open/jtc1

N 11698 - ISO/IEC JTC1 SC32 Home Pagejtc1sc32.org/doc/N2401-2450/32N2442-ISO:IEC_JTC1_N11698_US... · The data management industry is not standing still. ... Search engine results

  • Upload
    ngonhu

  • View
    215

  • Download
    1

Embed Size (px)

Citation preview

Page 1: N 11698 - ISO/IEC JTC1 SC32 Home Pagejtc1sc32.org/doc/N2401-2450/32N2442-ISO:IEC_JTC1_N11698_US... · The data management industry is not standing still. ... Search engine results

ISO/IEC JTC 1 N 11698

ISO/IEC JTC 1Information technology

Secretariat: ANSI (United States)

Document type: National Body Contribution

Title: US National Body Contribution on Big Data

Status: This document is circulated for review and consideration at the November 2013 JTC 1 Plenarymeeting in France.

Date of document: 2013-09-19

Source: US

Expected action: ACT

Action due date: 2013-11-04

Email of secretary: [email protected]

Committee URL: http://isotc.iso.org/livelink/livelink/open/jtc1

Page 2: N 11698 - ISO/IEC JTC1 SC32 Home Pagejtc1sc32.org/doc/N2401-2450/32N2442-ISO:IEC_JTC1_N11698_US... · The data management industry is not standing still. ... Search engine results

US NATIONAL BODY CONTRIBUTION ON BIG DATA

Recognizing that Big Data: • Has been identified by SWG Planning as an important future area for JTC 1 focus, • Is a topic of consideration within SC 32 as reported to the Plenary, and • Continues to be of interest to other JTC 1 Subcommittees including SC 27 and SC 38

the US proposes the establishment of a Study Group for planning and orchestrating Big Data activities across all of JTC 1 with the following terms of reference:

In order to establish JTC 1 as a global leader for Big Data standardization, the Study Group on Big Data shall:

1. Survey the existing ICT landscape for relevant standards / models / studies / use cases / and existing technologies for Big Data,

2. Develop definitions and a high-level, vendor neutral reference architecture for Big Data, 3. Analyze standards gaps, and propose standardization priorities to serve as a basis for future

JTC 1 work in support of Big Data, and 4. Propose mechanisms and new work item assignments for ongoing and future JTC 1

standardization efforts. The Study Group shall report its results to the 2014 JTC 1 Plenary.

The US offers Wo Chang to serve as Convenor for the JTC 1 Study Group on Big Data.

Page 3: N 11698 - ISO/IEC JTC1 SC32 Home Pagejtc1sc32.org/doc/N2401-2450/32N2442-ISO:IEC_JTC1_N11698_US... · The data management industry is not standing still. ... Search engine results

ISO/IEC JTC 1 N11819 2013-10-11

Secretariat, ISO/IEC JTC 1, American National Standards Institute, 25 West 43rd Street, New York, NY 10036; Telephone: 1 212 642 4932; Facsimile: 1 212 840 2298; Email: [email protected]

Replaces: ISO/IEC JTC 1 Information Technology Document Type: officer’s contribution

Document Title: SC 32 Chairman's Presentation to the November 2013 JTC 1 Plenary - An Interim SC 32 Viewpoint of Big Data & Next-Generation Analytics

Document Source: SC 32 Chairman Project Number:

Document Status: This document is circulated for information and consideration at the November 2013 JTC 1 Plenary meeting in France.

Action ID: ACT

Due Date:

Pages:

Page 4: N 11698 - ISO/IEC JTC1 SC32 Home Pagejtc1sc32.org/doc/N2401-2450/32N2442-ISO:IEC_JTC1_N11698_US... · The data management industry is not standing still. ... Search engine results

An Interim SC 32 Viewpoint of Big Data &

Next-Generation Analytics

Jim Melton JTC 1/SC 32 Chair [email protected]

November 2013 JTC 1 Plenary

Perros-Guirec, FRA

Page 5: N 11698 - ISO/IEC JTC1 SC32 Home Pagejtc1sc32.org/doc/N2401-2450/32N2442-ISO:IEC_JTC1_N11698_US... · The data management industry is not standing still. ... Search engine results

Abstract The data management industry is not standing still. There

are new capabilities in the SQL standard, new data management technologies, and new applications.

Big Data Analytics Support for Big Data

SQL Standard Other types of databases

NoSQL Databases New SQL Databases

2 2013 ISO/IEC JTC 1 Plenary

Page 6: N 11698 - ISO/IEC JTC1 SC32 Home Pagejtc1sc32.org/doc/N2401-2450/32N2442-ISO:IEC_JTC1_N11698_US... · The data management industry is not standing still. ... Search engine results

3

Your Humble Servant Architect at Oracle

Data management standardization

SQL Standards committees since 1986 Editor of all parts of ISO/IEC 9075 and TR 19075 since 1987 Major author of proposals for many years Chair of SC 32 since 2011 (Acting Chair in 2011)

XQuery Standardization since 1998 Editor/Co-Editor: Functions & Operators, XQueryX Chair of W3C WG since 2004 (Co-Chair until 2008)

Many other standards activities and interests

2013 ISO/IEC JTC 1 Plenary

Page 7: N 11698 - ISO/IEC JTC1 SC32 Home Pagejtc1sc32.org/doc/N2401-2450/32N2442-ISO:IEC_JTC1_N11698_US... · The data management industry is not standing still. ... Search engine results

Dimensions of Big Data Characteristics:

Quantity, size Complexity Rate of change Varieties Availability Persistence Integrity Location Relevance Etc.

Aspects Data, per se Metadata*, models* Privacy & Security Storage, reliability Query & Analysis* Transport, interchange* Life cycle Accessibility Integration Etc.

4 2013 ISO/IEC JTC 1 Plenary

* Areas where SC 32 has expertise

Page 8: N 11698 - ISO/IEC JTC1 SC32 Home Pagejtc1sc32.org/doc/N2401-2450/32N2442-ISO:IEC_JTC1_N11698_US... · The data management industry is not standing still. ... Search engine results

Paradigm Shift in Database Industry Many database users are attempting to escape the restrictions of the current SQL databases and database vendors Distribution Replication High Availability Large data volumes Reduced up-front development costs Minimal upfront licensing costs

5 2013 ISO/IEC JTC 1 Plenary

Page 9: N 11698 - ISO/IEC JTC1 SC32 Home Pagejtc1sc32.org/doc/N2401-2450/32N2442-ISO:IEC_JTC1_N11698_US... · The data management industry is not standing still. ... Search engine results

Driving Forces Big Data

Inexpensive storage of large volumes of data Inexpensive compute power High bandwidth networks

Next Generation Analytics Today’s Responses

SQL Databases NoSQL Databases NewSQL Databases

6 2013 ISO/IEC JTC 1 Plenary

Page 10: N 11698 - ISO/IEC JTC1 SC32 Home Pagejtc1sc32.org/doc/N2401-2450/32N2442-ISO:IEC_JTC1_N11698_US... · The data management industry is not standing still. ... Search engine results

Big Data Definition Gartner's 3V definition of big data

Volume – terabytes, petabytes, … Velocity – frequent inserts/updates, streaming Variety – textual, geospatial, images, etc.

Additional Vs Value – data is useful to someone Veracity – validity of data can be assessed

7 2013 ISO/IEC JTC 1 Plenary

Page 11: N 11698 - ISO/IEC JTC1 SC32 Home Pagejtc1sc32.org/doc/N2401-2450/32N2442-ISO:IEC_JTC1_N11698_US... · The data management industry is not standing still. ... Search engine results

8 2013 ISO/IEC JTC 1 Plenary

Diagram from NBD-WG M0055, Big Data Architecture Framework

Page 12: N 11698 - ISO/IEC JTC1 SC32 Home Pagejtc1sc32.org/doc/N2401-2450/32N2442-ISO:IEC_JTC1_N11698_US... · The data management industry is not standing still. ... Search engine results

How big is big? Data Volume

Terabytes – 1000**4 Petabytes – 1000**5 Exabyte – 1000**6 Zettabyte – 1000**7 Yottabyte – 1000**8

Data Distribution Server Cluster Datacenter Continent World Solar System

9 2013 ISO/IEC JTC 1 Plenary

Page 13: N 11698 - ISO/IEC JTC1 SC32 Home Pagejtc1sc32.org/doc/N2401-2450/32N2442-ISO:IEC_JTC1_N11698_US... · The data management industry is not standing still. ... Search engine results

Big Data Examples Big Science – e.g., Large Hadron Collider, Sky

Survey Search Engines – e.g., Google, Bing Web page click streams Sensor networks, Internet of Things Medical Research & Healthcare Global Security Agencies

10 2013 ISO/IEC JTC 1 Plenary

Page 14: N 11698 - ISO/IEC JTC1 SC32 Home Pagejtc1sc32.org/doc/N2401-2450/32N2442-ISO:IEC_JTC1_N11698_US... · The data management industry is not standing still. ... Search engine results

Next Generation Analytics Analytics is moving from:

Off-line ⇨ in-line embedded analytics Explaining what happened ⇨ predicting what will happen

Operating on: Data at rest – stored someplace Data in motion – streaming

Examples: Targeted web site advertising, real-time advertising Search engine results Identifying best time to purchase tickets Identifying cancer factors

11 2013 ISO/IEC JTC 1 Plenary

Page 15: N 11698 - ISO/IEC JTC1 SC32 Home Pagejtc1sc32.org/doc/N2401-2450/32N2442-ISO:IEC_JTC1_N11698_US... · The data management industry is not standing still. ... Search engine results

Common Analytical Techniques Analytical Functions

Logistic Regression Random Forests Naive Bayesian classifiers

Clustering K-means clustering Canopy clustering LDA (Latent Dirichlet Analysis) for text analysis

Functions that can be naïvely parallelized

12 2013 ISO/IEC JTC 1 Plenary

Page 16: N 11698 - ISO/IEC JTC1 SC32 Home Pagejtc1sc32.org/doc/N2401-2450/32N2442-ISO:IEC_JTC1_N11698_US... · The data management industry is not standing still. ... Search engine results

Big Data Complications Too much data to process sequentially

Process in parallel

Too much data to fit on one server Distribute to multiple servers

Too much data to fit in one computer room Distribute to multiple computer rooms

Too much data to move across network Distribute queries to process in parallel on multiple

servers in multiple computer rooms

13 2013 ISO/IEC JTC 1 Plenary

Page 17: N 11698 - ISO/IEC JTC1 SC32 Home Pagejtc1sc32.org/doc/N2401-2450/32N2442-ISO:IEC_JTC1_N11698_US... · The data management industry is not standing still. ... Search engine results

Even More Complications Availability

Replicate Not all applications need all of the data all of the

time to provide useful responses

Network Latency and Bandwidth Another reason to distribute the query Process query on replica closest to query source C: Not just a good idea; it’s the law

Database technologies and tools are being constructed to handle the complications

14 2013 ISO/IEC JTC 1 Plenary

Page 18: N 11698 - ISO/IEC JTC1 SC32 Home Pagejtc1sc32.org/doc/N2401-2450/32N2442-ISO:IEC_JTC1_N11698_US... · The data management industry is not standing still. ... Search engine results

Big Data Logical View Service Layer

Analysis & Prediction Platform Layer

Data Integration Data Semantic Intellectualization

Data Layer Data Identification Data Collection Data Registry Data Repository

15 2013 ISO/IEC JTC 1 Plenary

Page 19: N 11698 - ISO/IEC JTC1 SC32 Home Pagejtc1sc32.org/doc/N2401-2450/32N2442-ISO:IEC_JTC1_N11698_US... · The data management industry is not standing still. ... Search engine results

Big Data Silo

16 2013 ISO/IEC JTC 1 Plenary

Platform Layer

Data Semantic Intellectualization

Data Integration

Data Layer

Data Collection Data Registry

Data Repository

Data Identification (Data Mining &

Metadata Extraction)

Analysis & Prediction

Service Layer

Page 20: N 11698 - ISO/IEC JTC1 SC32 Home Pagejtc1sc32.org/doc/N2401-2450/32N2442-ISO:IEC_JTC1_N11698_US... · The data management industry is not standing still. ... Search engine results

Integrating Big Data – Silos

17 2013 ISO/IEC JTC 1 Plenary

Platform Layer

Data Semantic Intellectualization

Data Integration

Data Layer

Data Collection Data Registry

Data Repository

Data Identification (Data Mining &

Metadata Extraction)

Analysis & Prediction

Service Layer

Platform Layer

Data Semantic Intellectualization

Data Integration

Data Layer

Data Collection Data Registry

Data Repository

Data Identification (Data Mining &

Metadata Extraction)

Analysis & Prediction

Service Layer

Platform Layer

Data Semantic Intellectualization

Data Integration

Data Layer

Data Collection Data Registry

Data Repository

Data Identification (Data Mining &

Metadata Extraction)

Analysis & Prediction

Service Layer

Page 21: N 11698 - ISO/IEC JTC1 SC32 Home Pagejtc1sc32.org/doc/N2401-2450/32N2442-ISO:IEC_JTC1_N11698_US... · The data management industry is not standing still. ... Search engine results

Integrated Big Data Silos

18 2013 ISO/IEC JTC 1 Plenary

Platform Layer

Data Semantic Intellectualization

Data Integration

Data Layer

Data Collection Data Registry

Data Repository

Data Identification (Data Mining &

Metadata Extraction)

Platform Layer

Data Semantic Intellectualization

Data Integration

Data Layer

Data Collection Data Registry

Data Repository

Data Identification (Data Mining &

Metadata Extraction)

Platform Layer

Data Semantic Intellectualization

Data Integration

Data Layer

Data Collection Data Registry

Data Repository

Data Identification (Data Mining &

Metadata Extraction)

Analysis & Prediction Service Layer

Page 22: N 11698 - ISO/IEC JTC1 SC32 Home Pagejtc1sc32.org/doc/N2401-2450/32N2442-ISO:IEC_JTC1_N11698_US... · The data management industry is not standing still. ... Search engine results

Big Data Integrated Silos

19 2013 ISO/IEC JTC 1 Plenary

Platform Layer

Data Semantic Intellectualization

Data Integration

Data Layer

Data Collection Data Registry

Data Repository

Data Identification (Data Mining &

Metadata Extraction)

Platform Layer

Data Semantic Intellectualization

Data Integration

Data Layer

Data Collection Data Registry

Data Repository

Data Identification (Data Mining &

Metadata Extraction)

Platform Layer

Data Semantic Intellectualization

Data Integration

Data Layer

Data Collection Data Registry

Data Repository

Data Identification (Data Mining &

Metadata Extraction)

Analysis & Prediction Service Layer

Data Quality Management

Data Visualization

Workflow Management

Service Support Layer

Page 23: N 11698 - ISO/IEC JTC1 SC32 Home Pagejtc1sc32.org/doc/N2401-2450/32N2442-ISO:IEC_JTC1_N11698_US... · The data management industry is not standing still. ... Search engine results

Big Data Integrated Silos

20 2013 ISO/IEC JTC 1 Plenary

Platform Layer

Data Semantic Intellectualization

Data Integration

Data Layer

Data Collection Data Registry

Data Repository

Data Identification (Data Mining &

Metadata Extraction)

Platform Layer

Data Semantic Intellectualization

Data Integration

Data Layer

Data Collection Data Registry

Data Repository

Data Identification (Data Mining &

Metadata Extraction)

Platform Layer

Data Semantic Intellectualization

Data Integration

Data Layer

Data Collection Data Registry

Data Repository

Data Identification (Data Mining &

Metadata Extraction)

Analysis & Prediction Service Layer

Data Quality Management

Data Visualization

Workflow Management

Service Support Layer

Big Data Management

Data Curation

Security

Privacy

Diagram from 32N2386 & 32N2388

Page 24: N 11698 - ISO/IEC JTC1 SC32 Home Pagejtc1sc32.org/doc/N2401-2450/32N2442-ISO:IEC_JTC1_N11698_US... · The data management industry is not standing still. ... Search engine results

Terminology Workflow Management – scheduling queries,

reports, etc. Data Quality Management – minimize garbage Data Visualization –displaying the results of

querying a trillion data points Data Curation – where does it come from, where

does it go, provenance, lifetime Security – define and enforce access policies Privacy – prevent release of personal data

21 2013 ISO/IEC JTC 1 Plenary

Page 25: N 11698 - ISO/IEC JTC1 SC32 Home Pagejtc1sc32.org/doc/N2401-2450/32N2442-ISO:IEC_JTC1_N11698_US... · The data management industry is not standing still. ... Search engine results

Terminology Data Semantic Intellectualization –

Semantic Data Integration based on technologies such as Ontology, Reasoning, and so on

22 2013 ISO/IEC JTC 1 Plenary

Page 26: N 11698 - ISO/IEC JTC1 SC32 Home Pagejtc1sc32.org/doc/N2401-2450/32N2442-ISO:IEC_JTC1_N11698_US... · The data management industry is not standing still. ... Search engine results

Data Layer SQL Databases (SQL Classic) NoSQL Databases NewSQL Databases Spatial Data Video/Image/Sound Satellite/Radar/Seismic/Sonar (sensors) Streaming Data Etc…

23 2013 ISO/IEC JTC 1 Plenary

Page 27: N 11698 - ISO/IEC JTC1 SC32 Home Pagejtc1sc32.org/doc/N2401-2450/32N2442-ISO:IEC_JTC1_N11698_US... · The data management industry is not standing still. ... Search engine results

Data Layer Characteristics Volume Storage Structure

Row Store Column Store Document Store Key-Value Store Graph Streaming

Metadata Can be queried Known beforehand (or

not)

Distribution Interface

SQL JDBC ODBC (SQL/CLI) Custom

Transactions ACID Transactions BASE Transactions No Transactions

24 2013 ISO/IEC JTC 1 Plenary

Page 28: N 11698 - ISO/IEC JTC1 SC32 Home Pagejtc1sc32.org/doc/N2401-2450/32N2442-ISO:IEC_JTC1_N11698_US... · The data management industry is not standing still. ... Search engine results

Challenges Distribute queries across disparate data sources Integrate query results Security Privacy For each data source, need to understand

Structure Types of queries needed & supported

25 2013 ISO/IEC JTC 1 Plenary

Page 29: N 11698 - ISO/IEC JTC1 SC32 Home Pagejtc1sc32.org/doc/N2401-2450/32N2442-ISO:IEC_JTC1_N11698_US... · The data management industry is not standing still. ... Search engine results

Multi-Data Source Queries What is the correlation between mosquito borne

diseases, precipitation, and temperatures? What correlations exist between the genomes of

cancer patients and the effectiveness of cancer treatments?

Based on recordings of vehicle sounds and previous maintenance histories, what preventative maintenance is needed?

26 2013 ISO/IEC JTC 1 Plenary

Page 30: N 11698 - ISO/IEC JTC1 SC32 Home Pagejtc1sc32.org/doc/N2401-2450/32N2442-ISO:IEC_JTC1_N11698_US... · The data management industry is not standing still. ... Search engine results

Seriously? Provide names, photos, and travel history for all

individuals taller than 200 cm with blond hair below their shoulders and weighing between about 85 kg and 115 kg who flew between New York City and any destination in southeast Asia during any period between 2007 and 2010 when the weather in southern Brazil included rain > 2 cm/hour during the same month when the Prime Minister of Japan was on holiday and any southwest Asian nation experienced a slip-fault earthquake of magnitude 5.5 to 6.5.

27 2013 ISO/IEC JTC 1 Plenary

Page 31: N 11698 - ISO/IEC JTC1 SC32 Home Pagejtc1sc32.org/doc/N2401-2450/32N2442-ISO:IEC_JTC1_N11698_US... · The data management industry is not standing still. ... Search engine results

Industry & Standards Efforts NoSQL Projects & Products NewSQL Projects & Products Standards – ISO/IEC JTC 1/SC 32 NIST Big Data Working Group (NBD-WG)

28 2013 ISO/IEC JTC 1 Plenary

Page 32: N 11698 - ISO/IEC JTC1 SC32 Home Pagejtc1sc32.org/doc/N2401-2450/32N2442-ISO:IEC_JTC1_N11698_US... · The data management industry is not standing still. ... Search engine results

NoSQL Products/Projects http://www.nosql-database.org/ lists 150 NoSQL Databases. Some examples: Cassandra CouchDB Hadoop & Hbase MongoDB StupidDB Etc.

2013 ISO/IEC JTC 1 Plenary 29

Page 33: N 11698 - ISO/IEC JTC1 SC32 Home Pagejtc1sc32.org/doc/N2401-2450/32N2442-ISO:IEC_JTC1_N11698_US... · The data management industry is not standing still. ... Search engine results

NoSQL – Distributed Storage Distribute across multiple servers potentially in

multiple computer rooms Replicate across multiple servers potentially in

multiple computer rooms Details depend on products & eco-system Infrastructure to distribute queries

Map/Reduce Automatic

30 2013 ISO/IEC JTC 1 Plenary

Page 34: N 11698 - ISO/IEC JTC1 SC32 Home Pagejtc1sc32.org/doc/N2401-2450/32N2442-ISO:IEC_JTC1_N11698_US... · The data management industry is not standing still. ... Search engine results

NoSQL – Map Reduce Indexing and searching large data volumes Two Phases: Map and Reduce

Map Extract sets of Key-Value pairs from underlying data Potentially in Parallel on multiple machines

Reduce Merge and sort sets of Key-Value pairs Results may be useful for other searches

Techniques differ across products Application developers, underlying software Must understand distribution scheme Today: Mostly application responsibility

2013 ISO/IEC JTC 1 Plenary 31

Page 35: N 11698 - ISO/IEC JTC1 SC32 Home Pagejtc1sc32.org/doc/N2401-2450/32N2442-ISO:IEC_JTC1_N11698_US... · The data management industry is not standing still. ... Search engine results

Automated Query Distribution Some products automate query distribution &

execution Isolate application from underlying distribution Key requirement for new optimizer technology

2013 ISO/IEC JTC 1 Plenary 32

Page 36: N 11698 - ISO/IEC JTC1 SC32 Home Pagejtc1sc32.org/doc/N2401-2450/32N2442-ISO:IEC_JTC1_N11698_US... · The data management industry is not standing still. ... Search engine results

NoSQL – Retrieving Data Syntax Varies

No set-based or declarative query language Procedural program languages such as Java, C, etc.

Application specifies retrieval path No query optimizer Quick answer is important May not be a single “right” answer

2013 ISO/IEC JTC 1 Plenary 33

Page 37: N 11698 - ISO/IEC JTC1 SC32 Home Pagejtc1sc32.org/doc/N2401-2450/32N2442-ISO:IEC_JTC1_N11698_US... · The data management industry is not standing still. ... Search engine results

NoSQL – Updating Data BASE transactions “Eventually correct” Lazy updates Replicas rarely fully synchronized

34 2013 ISO/IEC JTC 1 Plenary

Page 38: N 11698 - ISO/IEC JTC1 SC32 Home Pagejtc1sc32.org/doc/N2401-2450/32N2442-ISO:IEC_JTC1_N11698_US... · The data management industry is not standing still. ... Search engine results

NewSQL Projects & Products Scalable performance of NoSQL products

Distributed storage – sharding Distributed queries In-memory techniques Etc.

Support for Online Transaction Processing – ACID transaction guarantees

35 2013 ISO/IEC JTC 1 Plenary

Page 39: N 11698 - ISO/IEC JTC1 SC32 Home Pagejtc1sc32.org/doc/N2401-2450/32N2442-ISO:IEC_JTC1_N11698_US... · The data management industry is not standing still. ... Search engine results

What is ISO/IEC JTC 1/SC 32 Doing? WG2 – Metadata WG3 – Database Languages (SQL) WG4 – SQL/MM

36 2013 ISO/IEC JTC 1 Plenary

Page 40: N 11698 - ISO/IEC JTC1 SC32 Home Pagejtc1sc32.org/doc/N2401-2450/32N2442-ISO:IEC_JTC1_N11698_US... · The data management industry is not standing still. ... Search engine results

WG2 – Metadata ISO/IEC 11179 Metadata Registry

Metadata Registry structure and procedures Semantics, data representation, & data descriptions

ISO/IEC 19763 Metamodel Framework for Interoperability (MFI) Communicate, execute programs, or transfer data

among various functional units Requires little or no knowledge of the unique

characteristics of those units

37 2013 ISO/IEC JTC 1 Plenary

Page 41: N 11698 - ISO/IEC JTC1 SC32 Home Pagejtc1sc32.org/doc/N2401-2450/32N2442-ISO:IEC_JTC1_N11698_US... · The data management industry is not standing still. ... Search engine results

WG3 – Database Languages Next version of the SQL standards (ISO/IEC

9075) publication in late 2015 or early 2016 Row Pattern Recognition – already added Bi-Temporal suppoprt – already added Other possible additions

JavaScript Object Notation (JSON) documents User Defined Aggregation Functions Dynamic Table Functions Polymorphic Table Functions

38 2013 ISO/IEC JTC 1 Plenary

Page 42: N 11698 - ISO/IEC JTC1 SC32 Home Pagejtc1sc32.org/doc/N2401-2450/32N2442-ISO:IEC_JTC1_N11698_US... · The data management industry is not standing still. ... Search engine results

Row Pattern Recognition Adds MATCH_RECOGNIZE clause to FROM Specifies pattern across a sequence of rows New Syntax:

ONE ROW PER MATCH Returns single summary row for each match of the

pattern Default

ALL ROWS PER MATCH Returns one row for each row of each match

39 2013 ISO/IEC JTC 1 Plenary

Page 43: N 11698 - ISO/IEC JTC1 SC32 Home Pagejtc1sc32.org/doc/N2401-2450/32N2442-ISO:IEC_JTC1_N11698_US... · The data management industry is not standing still. ... Search engine results

Row Pattern Recognition Example SELECT M.Symbol, /* ticker symbol */

M.Matchno, /* sequential match number */

M.Tradeday, /* day of trading */

M.Price, /* price on day of trading */

M.Classy, /* classifier */

M.Startp, /* starting price */

M.Bottomp, /* bottom price */

M.Endp, /* ending price */

M.Avgp /* average price */

FROM Ticker

MATCH_RECOGNIZE (

PARTITION BY Symbol

ORDER BY Tradeday

MEASURES MATCH_NUMBER () AS Matchno,

CLASSIFIER () AS Classy,

A.Price AS Startp,

FINAL LAST (B.Price) AS Bottomp,

FINAL LAST (C.Price) AS Endp,

FINAL AVG (U.Price) AS Avgp

ALL ROWS PER MATCH

AFTER MATCH SKIP PAST LAST ROW

PATTERN (A B+ C+)

SUBSET U = (A, B, C)

DEFINE /* A defaults to True, matches any row */

B AS B.Price < PREV (B.Price),

C AS C.Price > PREV (C.Price)

) AS M /* From SQL/RPR Technical Report, Fred Zemke, February 2013 */

40 2013 ISO/IEC JTC 1 Plenary

Page 44: N 11698 - ISO/IEC JTC1 SC32 Home Pagejtc1sc32.org/doc/N2401-2450/32N2442-ISO:IEC_JTC1_N11698_US... · The data management industry is not standing still. ... Search engine results

Other Possible Additions JSON – Just Some Other Notation

User-Defined Data Aggregation Functions Select Col1, MyFunction(col2), SUM(col3) From Table1 Group by Col1;

Dynamic/Polymorphic Table Functions Parameter is one arbitrary table Function result can be another arbitrary table

Insert into T1 Select * from MyInferenceEngine(ExternalDataStream.T2);

Result table “shape” not known until run-time

41 2013 ISO/IEC JTC 1 Plenary

Page 45: N 11698 - ISO/IEC JTC1 SC32 Home Pagejtc1sc32.org/doc/N2401-2450/32N2442-ISO:IEC_JTC1_N11698_US... · The data management industry is not standing still. ... Search engine results

WG4 – SQL/MM Support for extended datatypes within SQL databases ISO/IEC 13249-2 Full Text – content-based

retrieval. ISO/IEC 13249-5 Still Image – basic functions for

image data management. ISO/IEC 13249-3 Spatial – functions to support

geo-spatial applications ISO/IEC 13249-6 Data Mining – support for

statistical data mining

42 2013 ISO/IEC JTC 1 Plenary

Page 46: N 11698 - ISO/IEC JTC1 SC32 Home Pagejtc1sc32.org/doc/N2401-2450/32N2442-ISO:IEC_JTC1_N11698_US... · The data management industry is not standing still. ... Search engine results

Big Data Gaps Data source registry

Location, contents, and semantics Ability to discover and utilize data

Common interface to disparate data sources Better support for

Queries against images, video, & sound Streaming data Security & privacy Integration of analytical functions

43 2013 ISO/IEC JTC 1 Plenary

Page 47: N 11698 - ISO/IEC JTC1 SC32 Home Pagejtc1sc32.org/doc/N2401-2450/32N2442-ISO:IEC_JTC1_N11698_US... · The data management industry is not standing still. ... Search engine results

44 2013 ISO/IEC JTC 1 Plenary

Summary Big Data has arrived Significant hype but practical applications emerging Hype tends to focus on data store capabilities

It’s just another data store SQL Standards development is ongoing

Next version in 2015/2016 Temporal data Row Pattern Recognition Additional temporal support? Multi-dimensional Data Type? Additional support for big data?

Page 48: N 11698 - ISO/IEC JTC1 SC32 Home Pagejtc1sc32.org/doc/N2401-2450/32N2442-ISO:IEC_JTC1_N11698_US... · The data management industry is not standing still. ... Search engine results

SC 32 Opportunities WG 2 – Metadata for Big Data

Many kinds of metadata (structural, semantic, catalog, integration, correlation)

Automatic discovery of & reasoning over metadata

WG 3 – Big Data Querying & Manipulation Tables, trees, etc. necessary, but wholly inadequate New query paradigms, analysis, transaction models

WG 4 – Specialized Big Data Types Support for other forms of data (e.g., seismic) “Helper” functionality

45 2013 ISO/IEC JTC 1 Plenary

Page 49: N 11698 - ISO/IEC JTC1 SC32 Home Pagejtc1sc32.org/doc/N2401-2450/32N2442-ISO:IEC_JTC1_N11698_US... · The data management industry is not standing still. ... Search engine results

Acknowledgements All errors, misunderstandings, misleading

statements, and idiotic comments are mine and mine alone.

Keith Hare (JTC 1/SC 32/WG 3 Convenor) JTC 1/SC 32 Study Group on Next-Generation

Analytics and Big Data Jörn Bartels, 정성재 (Jung, Sung Jae), Keith

Gordon, Krishna Kulkarni, numerous others

46 2013 ISO/IEC JTC 1 Plenary

Page 50: N 11698 - ISO/IEC JTC1 SC32 Home Pagejtc1sc32.org/doc/N2401-2450/32N2442-ISO:IEC_JTC1_N11698_US... · The data management industry is not standing still. ... Search engine results

Questions?

47 2013 ISO/IEC JTC 1 Plenary

Page 51: N 11698 - ISO/IEC JTC1 SC32 Home Pagejtc1sc32.org/doc/N2401-2450/32N2442-ISO:IEC_JTC1_N11698_US... · The data management industry is not standing still. ... Search engine results

Big Data Analysis Challenges A number of challenges in both data management and data analysis require new approaches to support the big data era. These challenges span generation of the data, preparation for analysis, and policy-related challenges in its sharing and use, including the following: Dealing with highly distributed data sources, Tracking data provenance, from data generation through data preparation, Validating data, Coping with sampling biases and heterogeneity, Working with different data formats and structures, Developing algorithms that exploit parallel and distributed architectures, Ensuring data integrity, Ensuring data security, Enabling data discovery and integration, Enabling data sharing, Developing methods for visualizing massive data, Developing scalable and incremental algorithms, and Coping with the need for real-time analysis and decision-making. National Research Council. 2013. Frontiers in Massive Data Analysis. Washington, D.C.: The National Academies Press.

48 2013 ISO/IEC JTC 1 Plenary

Page 52: N 11698 - ISO/IEC JTC1 SC32 Home Pagejtc1sc32.org/doc/N2401-2450/32N2442-ISO:IEC_JTC1_N11698_US... · The data management industry is not standing still. ... Search engine results

References “Big Data”, Viktor Mayer-Schönberger &

Kenneth Cukier, Houghton Mifflin Harcourt, New York, NY, 2013.

“Big Data Now: 2012 Edition”, http://oreilly.com/data/radarreports/big-data-now-2012.csp

NIST Big Data Working Group http://bigdatawg.nist.gov/home.php

49 2013 ISO/IEC JTC 1 Plenary

Page 53: N 11698 - ISO/IEC JTC1 SC32 Home Pagejtc1sc32.org/doc/N2401-2450/32N2442-ISO:IEC_JTC1_N11698_US... · The data management industry is not standing still. ... Search engine results

References “Cassandra vs MongoDB vs CouchDB vs Redis

vs Riak vs HBase vs Couchbase vs Neo4j vs Hypertable vs ElasticSearch vs Accumulo vs VoltDB vs Scalaris comparison,” Kristóf Kovács, viewed 2013-03-16 http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis?imm_mid=0a2ec6&cmp=em-velocity-newsletters-vlny-cfp-20130307-direct

50 2013 ISO/IEC JTC 1 Plenary

Page 54: N 11698 - ISO/IEC JTC1 SC32 Home Pagejtc1sc32.org/doc/N2401-2450/32N2442-ISO:IEC_JTC1_N11698_US... · The data management industry is not standing still. ... Search engine results

References National Research Council. 2013. Frontiers in

Massive Data Analysis. Washington, D.C.: The National Academies Press. http://www.nap.edu/catalog.php?record_id=18374

51 2013 ISO/IEC JTC 1 Plenary

Page 55: N 11698 - ISO/IEC JTC1 SC32 Home Pagejtc1sc32.org/doc/N2401-2450/32N2442-ISO:IEC_JTC1_N11698_US... · The data management industry is not standing still. ... Search engine results

“What Does 'Big Data' Mean?” Michael Stonebraker, Communications of the

ACM Blogs Part 1: http://cacm.acm.org/blogs/blog-cacm/155468-what-

does-big-data-mean/fulltext Part 2: http://cacm.acm.org/blogs/blog-cacm/156102-what-

does-big-data-mean-part-2/fulltext Part 3: http://cacm.acm.org/blogs/blog-cacm/157589-what-

does-big-data-mean-part-3/fulltext Part 4: http://cacm.acm.org/blogs/blog-cacm/162095-what-

does-big-data-mean-part-4/fulltext

52 2013 ISO/IEC JTC 1 Plenary