Upload
ngonhu
View
215
Download
1
Embed Size (px)
Citation preview
ISO/IEC JTC 1 N 11698
ISO/IEC JTC 1Information technology
Secretariat: ANSI (United States)
Document type: National Body Contribution
Title: US National Body Contribution on Big Data
Status: This document is circulated for review and consideration at the November 2013 JTC 1 Plenarymeeting in France.
Date of document: 2013-09-19
Source: US
Expected action: ACT
Action due date: 2013-11-04
Email of secretary: [email protected]
Committee URL: http://isotc.iso.org/livelink/livelink/open/jtc1
US NATIONAL BODY CONTRIBUTION ON BIG DATA
Recognizing that Big Data: • Has been identified by SWG Planning as an important future area for JTC 1 focus, • Is a topic of consideration within SC 32 as reported to the Plenary, and • Continues to be of interest to other JTC 1 Subcommittees including SC 27 and SC 38
the US proposes the establishment of a Study Group for planning and orchestrating Big Data activities across all of JTC 1 with the following terms of reference:
In order to establish JTC 1 as a global leader for Big Data standardization, the Study Group on Big Data shall:
1. Survey the existing ICT landscape for relevant standards / models / studies / use cases / and existing technologies for Big Data,
2. Develop definitions and a high-level, vendor neutral reference architecture for Big Data, 3. Analyze standards gaps, and propose standardization priorities to serve as a basis for future
JTC 1 work in support of Big Data, and 4. Propose mechanisms and new work item assignments for ongoing and future JTC 1
standardization efforts. The Study Group shall report its results to the 2014 JTC 1 Plenary.
The US offers Wo Chang to serve as Convenor for the JTC 1 Study Group on Big Data.
ISO/IEC JTC 1 N11819 2013-10-11
Secretariat, ISO/IEC JTC 1, American National Standards Institute, 25 West 43rd Street, New York, NY 10036; Telephone: 1 212 642 4932; Facsimile: 1 212 840 2298; Email: [email protected]
Replaces: ISO/IEC JTC 1 Information Technology Document Type: officer’s contribution
Document Title: SC 32 Chairman's Presentation to the November 2013 JTC 1 Plenary - An Interim SC 32 Viewpoint of Big Data & Next-Generation Analytics
Document Source: SC 32 Chairman Project Number:
Document Status: This document is circulated for information and consideration at the November 2013 JTC 1 Plenary meeting in France.
Action ID: ACT
Due Date:
Pages:
An Interim SC 32 Viewpoint of Big Data &
Next-Generation Analytics
Jim Melton JTC 1/SC 32 Chair [email protected]
November 2013 JTC 1 Plenary
Perros-Guirec, FRA
Abstract The data management industry is not standing still. There
are new capabilities in the SQL standard, new data management technologies, and new applications.
Big Data Analytics Support for Big Data
SQL Standard Other types of databases
NoSQL Databases New SQL Databases
2 2013 ISO/IEC JTC 1 Plenary
3
Your Humble Servant Architect at Oracle
Data management standardization
SQL Standards committees since 1986 Editor of all parts of ISO/IEC 9075 and TR 19075 since 1987 Major author of proposals for many years Chair of SC 32 since 2011 (Acting Chair in 2011)
XQuery Standardization since 1998 Editor/Co-Editor: Functions & Operators, XQueryX Chair of W3C WG since 2004 (Co-Chair until 2008)
Many other standards activities and interests
2013 ISO/IEC JTC 1 Plenary
Dimensions of Big Data Characteristics:
Quantity, size Complexity Rate of change Varieties Availability Persistence Integrity Location Relevance Etc.
Aspects Data, per se Metadata*, models* Privacy & Security Storage, reliability Query & Analysis* Transport, interchange* Life cycle Accessibility Integration Etc.
4 2013 ISO/IEC JTC 1 Plenary
* Areas where SC 32 has expertise
Paradigm Shift in Database Industry Many database users are attempting to escape the restrictions of the current SQL databases and database vendors Distribution Replication High Availability Large data volumes Reduced up-front development costs Minimal upfront licensing costs
5 2013 ISO/IEC JTC 1 Plenary
Driving Forces Big Data
Inexpensive storage of large volumes of data Inexpensive compute power High bandwidth networks
Next Generation Analytics Today’s Responses
SQL Databases NoSQL Databases NewSQL Databases
6 2013 ISO/IEC JTC 1 Plenary
Big Data Definition Gartner's 3V definition of big data
Volume – terabytes, petabytes, … Velocity – frequent inserts/updates, streaming Variety – textual, geospatial, images, etc.
Additional Vs Value – data is useful to someone Veracity – validity of data can be assessed
7 2013 ISO/IEC JTC 1 Plenary
8 2013 ISO/IEC JTC 1 Plenary
Diagram from NBD-WG M0055, Big Data Architecture Framework
How big is big? Data Volume
Terabytes – 1000**4 Petabytes – 1000**5 Exabyte – 1000**6 Zettabyte – 1000**7 Yottabyte – 1000**8
Data Distribution Server Cluster Datacenter Continent World Solar System
9 2013 ISO/IEC JTC 1 Plenary
Big Data Examples Big Science – e.g., Large Hadron Collider, Sky
Survey Search Engines – e.g., Google, Bing Web page click streams Sensor networks, Internet of Things Medical Research & Healthcare Global Security Agencies
10 2013 ISO/IEC JTC 1 Plenary
Next Generation Analytics Analytics is moving from:
Off-line ⇨ in-line embedded analytics Explaining what happened ⇨ predicting what will happen
Operating on: Data at rest – stored someplace Data in motion – streaming
Examples: Targeted web site advertising, real-time advertising Search engine results Identifying best time to purchase tickets Identifying cancer factors
11 2013 ISO/IEC JTC 1 Plenary
Common Analytical Techniques Analytical Functions
Logistic Regression Random Forests Naive Bayesian classifiers
Clustering K-means clustering Canopy clustering LDA (Latent Dirichlet Analysis) for text analysis
Functions that can be naïvely parallelized
12 2013 ISO/IEC JTC 1 Plenary
Big Data Complications Too much data to process sequentially
Process in parallel
Too much data to fit on one server Distribute to multiple servers
Too much data to fit in one computer room Distribute to multiple computer rooms
Too much data to move across network Distribute queries to process in parallel on multiple
servers in multiple computer rooms
13 2013 ISO/IEC JTC 1 Plenary
Even More Complications Availability
Replicate Not all applications need all of the data all of the
time to provide useful responses
Network Latency and Bandwidth Another reason to distribute the query Process query on replica closest to query source C: Not just a good idea; it’s the law
Database technologies and tools are being constructed to handle the complications
14 2013 ISO/IEC JTC 1 Plenary
Big Data Logical View Service Layer
Analysis & Prediction Platform Layer
Data Integration Data Semantic Intellectualization
Data Layer Data Identification Data Collection Data Registry Data Repository
15 2013 ISO/IEC JTC 1 Plenary
Big Data Silo
16 2013 ISO/IEC JTC 1 Plenary
Platform Layer
Data Semantic Intellectualization
Data Integration
Data Layer
Data Collection Data Registry
Data Repository
Data Identification (Data Mining &
Metadata Extraction)
Analysis & Prediction
Service Layer
Integrating Big Data – Silos
17 2013 ISO/IEC JTC 1 Plenary
Platform Layer
Data Semantic Intellectualization
Data Integration
Data Layer
Data Collection Data Registry
Data Repository
Data Identification (Data Mining &
Metadata Extraction)
Analysis & Prediction
Service Layer
Platform Layer
Data Semantic Intellectualization
Data Integration
Data Layer
Data Collection Data Registry
Data Repository
Data Identification (Data Mining &
Metadata Extraction)
Analysis & Prediction
Service Layer
Platform Layer
Data Semantic Intellectualization
Data Integration
Data Layer
Data Collection Data Registry
Data Repository
Data Identification (Data Mining &
Metadata Extraction)
Analysis & Prediction
Service Layer
…
Integrated Big Data Silos
18 2013 ISO/IEC JTC 1 Plenary
Platform Layer
Data Semantic Intellectualization
Data Integration
Data Layer
Data Collection Data Registry
Data Repository
Data Identification (Data Mining &
Metadata Extraction)
Platform Layer
Data Semantic Intellectualization
Data Integration
Data Layer
Data Collection Data Registry
Data Repository
Data Identification (Data Mining &
Metadata Extraction)
Platform Layer
Data Semantic Intellectualization
Data Integration
Data Layer
Data Collection Data Registry
Data Repository
Data Identification (Data Mining &
Metadata Extraction)
…
Analysis & Prediction Service Layer
Big Data Integrated Silos
19 2013 ISO/IEC JTC 1 Plenary
Platform Layer
Data Semantic Intellectualization
Data Integration
Data Layer
Data Collection Data Registry
Data Repository
Data Identification (Data Mining &
Metadata Extraction)
Platform Layer
Data Semantic Intellectualization
Data Integration
Data Layer
Data Collection Data Registry
Data Repository
Data Identification (Data Mining &
Metadata Extraction)
Platform Layer
Data Semantic Intellectualization
Data Integration
Data Layer
Data Collection Data Registry
Data Repository
Data Identification (Data Mining &
Metadata Extraction)
…
Analysis & Prediction Service Layer
Data Quality Management
Data Visualization
Workflow Management
Service Support Layer
Big Data Integrated Silos
20 2013 ISO/IEC JTC 1 Plenary
Platform Layer
Data Semantic Intellectualization
Data Integration
Data Layer
Data Collection Data Registry
Data Repository
Data Identification (Data Mining &
Metadata Extraction)
Platform Layer
Data Semantic Intellectualization
Data Integration
Data Layer
Data Collection Data Registry
Data Repository
Data Identification (Data Mining &
Metadata Extraction)
Platform Layer
Data Semantic Intellectualization
Data Integration
Data Layer
Data Collection Data Registry
Data Repository
Data Identification (Data Mining &
Metadata Extraction)
…
Analysis & Prediction Service Layer
Data Quality Management
Data Visualization
Workflow Management
Service Support Layer
Big Data Management
Data Curation
Security
…
Privacy
Diagram from 32N2386 & 32N2388
Terminology Workflow Management – scheduling queries,
reports, etc. Data Quality Management – minimize garbage Data Visualization –displaying the results of
querying a trillion data points Data Curation – where does it come from, where
does it go, provenance, lifetime Security – define and enforce access policies Privacy – prevent release of personal data
21 2013 ISO/IEC JTC 1 Plenary
Terminology Data Semantic Intellectualization –
Semantic Data Integration based on technologies such as Ontology, Reasoning, and so on
22 2013 ISO/IEC JTC 1 Plenary
Data Layer SQL Databases (SQL Classic) NoSQL Databases NewSQL Databases Spatial Data Video/Image/Sound Satellite/Radar/Seismic/Sonar (sensors) Streaming Data Etc…
23 2013 ISO/IEC JTC 1 Plenary
Data Layer Characteristics Volume Storage Structure
Row Store Column Store Document Store Key-Value Store Graph Streaming
Metadata Can be queried Known beforehand (or
not)
Distribution Interface
SQL JDBC ODBC (SQL/CLI) Custom
Transactions ACID Transactions BASE Transactions No Transactions
24 2013 ISO/IEC JTC 1 Plenary
Challenges Distribute queries across disparate data sources Integrate query results Security Privacy For each data source, need to understand
Structure Types of queries needed & supported
25 2013 ISO/IEC JTC 1 Plenary
Multi-Data Source Queries What is the correlation between mosquito borne
diseases, precipitation, and temperatures? What correlations exist between the genomes of
cancer patients and the effectiveness of cancer treatments?
Based on recordings of vehicle sounds and previous maintenance histories, what preventative maintenance is needed?
26 2013 ISO/IEC JTC 1 Plenary
Seriously? Provide names, photos, and travel history for all
individuals taller than 200 cm with blond hair below their shoulders and weighing between about 85 kg and 115 kg who flew between New York City and any destination in southeast Asia during any period between 2007 and 2010 when the weather in southern Brazil included rain > 2 cm/hour during the same month when the Prime Minister of Japan was on holiday and any southwest Asian nation experienced a slip-fault earthquake of magnitude 5.5 to 6.5.
27 2013 ISO/IEC JTC 1 Plenary
Industry & Standards Efforts NoSQL Projects & Products NewSQL Projects & Products Standards – ISO/IEC JTC 1/SC 32 NIST Big Data Working Group (NBD-WG)
28 2013 ISO/IEC JTC 1 Plenary
NoSQL Products/Projects http://www.nosql-database.org/ lists 150 NoSQL Databases. Some examples: Cassandra CouchDB Hadoop & Hbase MongoDB StupidDB Etc.
2013 ISO/IEC JTC 1 Plenary 29
NoSQL – Distributed Storage Distribute across multiple servers potentially in
multiple computer rooms Replicate across multiple servers potentially in
multiple computer rooms Details depend on products & eco-system Infrastructure to distribute queries
Map/Reduce Automatic
30 2013 ISO/IEC JTC 1 Plenary
NoSQL – Map Reduce Indexing and searching large data volumes Two Phases: Map and Reduce
Map Extract sets of Key-Value pairs from underlying data Potentially in Parallel on multiple machines
Reduce Merge and sort sets of Key-Value pairs Results may be useful for other searches
Techniques differ across products Application developers, underlying software Must understand distribution scheme Today: Mostly application responsibility
2013 ISO/IEC JTC 1 Plenary 31
Automated Query Distribution Some products automate query distribution &
execution Isolate application from underlying distribution Key requirement for new optimizer technology
2013 ISO/IEC JTC 1 Plenary 32
NoSQL – Retrieving Data Syntax Varies
No set-based or declarative query language Procedural program languages such as Java, C, etc.
Application specifies retrieval path No query optimizer Quick answer is important May not be a single “right” answer
2013 ISO/IEC JTC 1 Plenary 33
NoSQL – Updating Data BASE transactions “Eventually correct” Lazy updates Replicas rarely fully synchronized
34 2013 ISO/IEC JTC 1 Plenary
NewSQL Projects & Products Scalable performance of NoSQL products
Distributed storage – sharding Distributed queries In-memory techniques Etc.
Support for Online Transaction Processing – ACID transaction guarantees
35 2013 ISO/IEC JTC 1 Plenary
What is ISO/IEC JTC 1/SC 32 Doing? WG2 – Metadata WG3 – Database Languages (SQL) WG4 – SQL/MM
36 2013 ISO/IEC JTC 1 Plenary
WG2 – Metadata ISO/IEC 11179 Metadata Registry
Metadata Registry structure and procedures Semantics, data representation, & data descriptions
ISO/IEC 19763 Metamodel Framework for Interoperability (MFI) Communicate, execute programs, or transfer data
among various functional units Requires little or no knowledge of the unique
characteristics of those units
37 2013 ISO/IEC JTC 1 Plenary
WG3 – Database Languages Next version of the SQL standards (ISO/IEC
9075) publication in late 2015 or early 2016 Row Pattern Recognition – already added Bi-Temporal suppoprt – already added Other possible additions
JavaScript Object Notation (JSON) documents User Defined Aggregation Functions Dynamic Table Functions Polymorphic Table Functions
38 2013 ISO/IEC JTC 1 Plenary
Row Pattern Recognition Adds MATCH_RECOGNIZE clause to FROM Specifies pattern across a sequence of rows New Syntax:
ONE ROW PER MATCH Returns single summary row for each match of the
pattern Default
ALL ROWS PER MATCH Returns one row for each row of each match
39 2013 ISO/IEC JTC 1 Plenary
Row Pattern Recognition Example SELECT M.Symbol, /* ticker symbol */
M.Matchno, /* sequential match number */
M.Tradeday, /* day of trading */
M.Price, /* price on day of trading */
M.Classy, /* classifier */
M.Startp, /* starting price */
M.Bottomp, /* bottom price */
M.Endp, /* ending price */
M.Avgp /* average price */
FROM Ticker
MATCH_RECOGNIZE (
PARTITION BY Symbol
ORDER BY Tradeday
MEASURES MATCH_NUMBER () AS Matchno,
CLASSIFIER () AS Classy,
A.Price AS Startp,
FINAL LAST (B.Price) AS Bottomp,
FINAL LAST (C.Price) AS Endp,
FINAL AVG (U.Price) AS Avgp
ALL ROWS PER MATCH
AFTER MATCH SKIP PAST LAST ROW
PATTERN (A B+ C+)
SUBSET U = (A, B, C)
DEFINE /* A defaults to True, matches any row */
B AS B.Price < PREV (B.Price),
C AS C.Price > PREV (C.Price)
) AS M /* From SQL/RPR Technical Report, Fred Zemke, February 2013 */
40 2013 ISO/IEC JTC 1 Plenary
Other Possible Additions JSON – Just Some Other Notation
User-Defined Data Aggregation Functions Select Col1, MyFunction(col2), SUM(col3) From Table1 Group by Col1;
Dynamic/Polymorphic Table Functions Parameter is one arbitrary table Function result can be another arbitrary table
Insert into T1 Select * from MyInferenceEngine(ExternalDataStream.T2);
Result table “shape” not known until run-time
41 2013 ISO/IEC JTC 1 Plenary
WG4 – SQL/MM Support for extended datatypes within SQL databases ISO/IEC 13249-2 Full Text – content-based
retrieval. ISO/IEC 13249-5 Still Image – basic functions for
image data management. ISO/IEC 13249-3 Spatial – functions to support
geo-spatial applications ISO/IEC 13249-6 Data Mining – support for
statistical data mining
42 2013 ISO/IEC JTC 1 Plenary
Big Data Gaps Data source registry
Location, contents, and semantics Ability to discover and utilize data
Common interface to disparate data sources Better support for
Queries against images, video, & sound Streaming data Security & privacy Integration of analytical functions
43 2013 ISO/IEC JTC 1 Plenary
44 2013 ISO/IEC JTC 1 Plenary
Summary Big Data has arrived Significant hype but practical applications emerging Hype tends to focus on data store capabilities
It’s just another data store SQL Standards development is ongoing
Next version in 2015/2016 Temporal data Row Pattern Recognition Additional temporal support? Multi-dimensional Data Type? Additional support for big data?
SC 32 Opportunities WG 2 – Metadata for Big Data
Many kinds of metadata (structural, semantic, catalog, integration, correlation)
Automatic discovery of & reasoning over metadata
WG 3 – Big Data Querying & Manipulation Tables, trees, etc. necessary, but wholly inadequate New query paradigms, analysis, transaction models
WG 4 – Specialized Big Data Types Support for other forms of data (e.g., seismic) “Helper” functionality
45 2013 ISO/IEC JTC 1 Plenary
Acknowledgements All errors, misunderstandings, misleading
statements, and idiotic comments are mine and mine alone.
Keith Hare (JTC 1/SC 32/WG 3 Convenor) JTC 1/SC 32 Study Group on Next-Generation
Analytics and Big Data Jörn Bartels, 정성재 (Jung, Sung Jae), Keith
Gordon, Krishna Kulkarni, numerous others
46 2013 ISO/IEC JTC 1 Plenary
Questions?
47 2013 ISO/IEC JTC 1 Plenary
Big Data Analysis Challenges A number of challenges in both data management and data analysis require new approaches to support the big data era. These challenges span generation of the data, preparation for analysis, and policy-related challenges in its sharing and use, including the following: Dealing with highly distributed data sources, Tracking data provenance, from data generation through data preparation, Validating data, Coping with sampling biases and heterogeneity, Working with different data formats and structures, Developing algorithms that exploit parallel and distributed architectures, Ensuring data integrity, Ensuring data security, Enabling data discovery and integration, Enabling data sharing, Developing methods for visualizing massive data, Developing scalable and incremental algorithms, and Coping with the need for real-time analysis and decision-making. National Research Council. 2013. Frontiers in Massive Data Analysis. Washington, D.C.: The National Academies Press.
48 2013 ISO/IEC JTC 1 Plenary
References “Big Data”, Viktor Mayer-Schönberger &
Kenneth Cukier, Houghton Mifflin Harcourt, New York, NY, 2013.
“Big Data Now: 2012 Edition”, http://oreilly.com/data/radarreports/big-data-now-2012.csp
NIST Big Data Working Group http://bigdatawg.nist.gov/home.php
49 2013 ISO/IEC JTC 1 Plenary
References “Cassandra vs MongoDB vs CouchDB vs Redis
vs Riak vs HBase vs Couchbase vs Neo4j vs Hypertable vs ElasticSearch vs Accumulo vs VoltDB vs Scalaris comparison,” Kristóf Kovács, viewed 2013-03-16 http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis?imm_mid=0a2ec6&cmp=em-velocity-newsletters-vlny-cfp-20130307-direct
50 2013 ISO/IEC JTC 1 Plenary
References National Research Council. 2013. Frontiers in
Massive Data Analysis. Washington, D.C.: The National Academies Press. http://www.nap.edu/catalog.php?record_id=18374
51 2013 ISO/IEC JTC 1 Plenary
“What Does 'Big Data' Mean?” Michael Stonebraker, Communications of the
ACM Blogs Part 1: http://cacm.acm.org/blogs/blog-cacm/155468-what-
does-big-data-mean/fulltext Part 2: http://cacm.acm.org/blogs/blog-cacm/156102-what-
does-big-data-mean-part-2/fulltext Part 3: http://cacm.acm.org/blogs/blog-cacm/157589-what-
does-big-data-mean-part-3/fulltext Part 4: http://cacm.acm.org/blogs/blog-cacm/162095-what-
does-big-data-mean-part-4/fulltext
52 2013 ISO/IEC JTC 1 Plenary