Upload
svntemp
View
220
Download
0
Embed Size (px)
Citation preview
Big Data Without Big ChangeSemTech West 2012
Michael Lang
Revelytix
Discussion Points
Review the RDBMS, ETL, and data warehouse data management paradigms
Compare those paradigms to data virtualization and Big Data
Propose “Bigger Data” in support of radically better analytic capability
In 1970, E.F. Codd, with the IBM Research Laboratory in San Jose, California, wrote a paper published in ACM,
“A Relational Model of Data for Large Shared Data Banks”
Codd wrote, “The problems treated here are those of data independence – the independence of application programs from the growth in data types and changes in data representation...”
This paper set in motion the architecture for data management systems for the next forty years. These systems are known as
relational database management systems (RDBMS)
The Last Forty Years
The Last Forty Years
Siloed Information Management Systems– All data in a single shared databank
– Rigid schemas
– Data and metadata are different types of things
– Query processor only knows about its local data expressed in a fixed schema
– Excellent ACID / CRUD capability
The Age of Virtualization
DIMSDistributed Information
Management System
Virtualization
Hardware and operating system virtualization became available in 2004 and brought great value to IT infrastructure
– Cloud-based deployment
– Extreme flexibility
– Efficient use of hardware resources
– Independence from operating systems
Leading to an enormous ROI for large enterprises
EDM
Hardware virtualization did not help with the problems associated with Enterprise Data Management
– Data remains distributed over many silos, even in cloud-based environments
– Meaning of data in independent silos is still obscure
– Schema are still disparate
Data Virtualization
The advent of RDF, OWL, and SPARQL have created the technical foundation for building a completely virtualized data infrastructure
– All information can be managed in the same data model
– Any domain can be described at the schema level
– SPARQL provides a distributed query and transformation language
– R2RML provides mappings from native schema to RDF schema
– Standards-based data virtualization is here to stay
Data Virtualization
This paradigm assumes data is completely distributed, and that anyone/anything should be able to find it and use it
– RDF is the data model
– OWL is the schema model
– SPARQL is the query language
– URI provide a unique identifiers
– URL provides the location
Data Abstraction
A RDBMS is an abstraction layer above an OS-based file systems
– Made it vastly simpler to work with local data
Data Virtualization is an abstraction layer above multiple RDBMS and/or other sources of data
– vastly simpler to work with distributed data
– Distributed Information Management System
Caveats
Data virtualization technologies are not as performant as locally managed data
Data virtualization depends on sophisticated transformation of complex and unstructured data
Bigger Data: Hadoop and Virtual Data
DIMSDistributed Information
Management System
NoSQL / Big Data
Another seminal paper: Copyright 2003 ACM
“The Google File System”Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung
• These data processing systems are highly distributed but, …
• Each NoSQL database is a “large shared databank”
• Data cannot be combined for analytics across NoSQL databases
• NoSQL is an evolutionary step in data storage; it is not a paradigm shift in information management
Big Data
Hadoop is an excellent technology to use for transforming data of varying structures to formats useful for analytics
Hadoop also excels at handling very large amounts of disparate data
Virtual data needs a place to be materialized
Data Virtualization technologies provide a common structure and access methodology for disparate sets of data
RDB RDB
Mappings(R2RML)
RDB Schema(Source
Ontology)
Mappings(R2RML)
Data Validation& Analysis
SPARQLSPARQL
RDB Schema(Source
Ontology)
Rules(RIF)
DomainOntology
SPARQL(data input)
SPARQL(data input)
Inferred Data
SPARQL(data output)
SPARQL
Data Virtualization
Hadoop
The RDF-based technology implementing a virtual data infrastructure is useful for Hadoop data transformations using MapReduce
– All of the disparate data sets in a Hadoop cluster can be organized with a common set of semantics provided by an R2RML map and a Domain Ontology
– Data transformations are made using a series of MapReduce jobs
– ETL becomes ELT
ELT
Extract, Load, and Transform is a fundamentally new paradigm facilitating enterprise analytics
– Data can be loaded in its native formats and structures
– Transformation activities take place after the data is loaded into a Hadoop cluster
– Hadoop and MapReduce are excellent technologies for data transformations at scale
Need to transform structure– Relational -> RDF
– HDFS/HBase -> Tuples
– Merge data from multiple sets (federate)
– Basic query processing: join, aggregation, etc
– Execute arbitrary user-defined analytical functions (UDFs)
Revelytix query engines already do these– Spinner – federation, query processing, Hadoop-to-tuples
– Spyder – relational-to-RDF, query processing
Query Engine = Transformation Engine
Hadoop/Cloud Infrastructure
Triples
Relational Database
Load,Index
Triples
Relational Database
Extract
Data
HDFS Files
HBase
Source Data
The big win is to leave the data in situ, and define networked pipelines of transformations to move data through various processing stages.
Transform
Transforming Data in Hadoop
Dataflow Pipeline
Definition S6 S1b
S5 S4
S3Execution
QueryS8 S7
X1
local cloudDesign
‘endpoints’
D2
F1D3
Configure execution environments for parts of pipeline
D1
S2
S1aX6aX6b
X5
X8
T
T
T
TT
T
Processing Pipeline
Data Flow
Mix of materialized and virtual data sets… inter-linked by a set of transformations
Distributed Pipelined Processing
Query Processing in Hadoop
Hadoop and SPARQL
Once the data sets have been transformed to a common set of semantics, SPARQL queries can be executed as a set of distributed MapReduce jobs
We must know the relationships between data sets
The descriptions of the relations need to be available at query time
Query Client
Query Processor
Hadoop/Cloud Infrastructure
Query Processor
Data
HDFS Files
HBase
Query processor is shipped to all Hadoop nodes for parallel processing, using the Hadoop MapReduce framework.
Query Processor
Query Processor
Query Processor
Query Execution in the Cloud
Query Processing
Hadoop/Cloud Infrastructure
Hadoop Adapter
SpinnerData
HDFS Files
HBase
Hadoop/Cloud Infrastructure
Hadoop Adapter
Spyder
Data
HDFS Files
HBase
Spinner
• Query processing can be done locally, remotely (in cloud), or mix• Many types of transformations can be done
• Basic query processing (SPARQL or SQL) • Relational to graph (R2RML) transformations• Federation over multiple sources or data sets• Hadoop HDFS-to-Tuple and HBase-to-Tuple transformations
• We can plan and optimize across all these for maximum performance
Hadoop and RIF
Once the data sets have been transformed to a common set of semantics, RIF rules can be executed as a set of distributed MapReduce jobs
– Inference
– Classification
– Validation
– Compliance
Enable access to large volumes of data
Warehouse-style access
Enable a ‘processing pipeline’ in the cloud
Push processing into Map-Reduce infrastructure
Parallelize query execution– Extreme scalability
Architectural flexibility
Why Use Hadoop?
Future Directions
27
Hadoop and Solr
Integration between Hadoop, Data Virtualization, and Solr provides massively scalable faceted search
– The common set of semantics, applied over disparate unstructured data sets provides a powerful paradigm for searching with facets over massive amounts of data
What Are We Offering?
Seamless integration of virtual data and HadoopLinkage (relationships) between data sets, yielding…
– Provenance/traceability/lineage
– Metadata management and data visibility/understanding
– Powerful analytics infrastructure
Common data model, enabling…
– Mixing of relational and graph-based data
– Mixing of SQL and SPARQL queries
– Access to all cloud-based data
Optimization across heterogeneous data systems
The Shift is OnDistributed Information
Management SystemDIMS is available now
Questions
Revelytix.com for much additional information
Thank You