Upload
lekhuong
View
220
Download
0
Embed Size (px)
Citation preview
Defining Architecture Components of
the Big Data Ecosystem
Yuri Demchenko
SNE Group, University of Amsterdam
2nd BDDAC2014 Symposium, CTS2014 Conference
19-23 May 2014, Minneapolis, USA
Outline
• Big Data and Data Intensive Science as a new technology wave
– The Fourth Paradigm
• Big Data definition: From 6 Vs to 5 parts
• Big Data technology drivers
– Where do the data come from? What are Big Data drivers?
• Big Data: Paradigm change and new challenges
– From Big Data to All-Data – Moving to data centric service models
• Defining Big Data Architecture Framework (BDAF)
– Big Data Infrastructure (BDI) and Big Data Analytics infrastructure/tools
• Summary and Discussion
BDDAC2014 @CTS2014 Big Data Architecture Framework Slide_2
Big Data and Security Research at System and
Network Engineering, University of Amsterdam
• Long time research and development on Infrastructure services and facilities
– High speed optical networking and data intensive applications
– Semantic description of infrastructure and network services
– Collaborative systems, Grid, Clouds and currently Big Data
• Focus on Infrastructure definition and services
– Software Defined Infrastructure based on Cloud/Intercloud technologies
– Dynamically provisioned security infrastructure and services
• NIST Big Data Working Group
– Contribution to Reference Architecture, Big Data Definition and Taxonomy, Big Data
Security
• Research Data Alliance (RDA)
– Interest Group on Education and Skills Development on Data Intensive Science
– Big Data Analytics Interest Group
• Big Data Interest Group at UvA
– Non-formal but active, meets two-weekly/monthly
– Provided input to NIST BD-WG and RDA activities
BDDAC2014 @CTS2014 Big Data Architecture Framework 3
Technology Definitions and Timeline - Overview
• Service Oriented Architecture (SOA): First proposed in 1996 and revived with
the Web Services advent in 2001-2002
– Currently standard for industry, and widely used
– Provided a conceptual basis for Web Services development
• Computer Grids: Initially proposed in 1998 and finally shaped in 2003 with the
Open Grid Services Architecture (OGSA) by Open Grid Forum (OGF)
– Currently remains as a collaborative environment
– Migrates to cloud and inter-cloud platform
• Cloud Computing: Initially proposed in 2008 – Now in a productive phase
– Defined new features, capabilities, operational/usage models and actually provided
a guidance for the new technology development
– Originated from the Service Computing domain and service management focused
• Big Data and Data Intensive Science: Yet to be defined
– Involves more components and processes to be included into the definition
– Can be better defined as Ecosystem where data are the main driving component
– Need to define the Big Data properties, expected technology capabilities and provide a
guidance/vision for future technology development
BDDAC2014 @CTS2014 Big Data Architecture Framework 4
Visionaries and Drivers:
Seminal works, High level reports, Activities
The Fourth Paradigm: Data-Intensive Scientific Discovery.
By Jim Gray, Microsoft, 2009. Edited by Tony Hey, et al.
http://research.microsoft.com/en-us/collaboration/fourthparadigm/
BDDAC2014 @CTS2014 Big Data Architecture Framework 5
Riding the wave: How Europe can gain from
the rising tide of scientific data.
Final report of the High Level Expert Group on
Scientific Data. October 2010.
http://cordis.europa.eu/fp7/ict/e-
infrastructure/docs/hlg-sdi-report.pdf
AAA Study: Study on
AAA Platforms For
Scientific
data/information
Resources in Europe,
TERENA, UvA, LIBER,
UinvDeb.
(2011-2012)
https://www.rd-alliance.org/
NIST Big Data Working Group (NBD-WG)
https://www.rd-alliance.org/
ISO/IEC JTC1 Big Data Study Group (SGBD)
http://jtc1bigdatasg.nist.gov/home.php
The Fourth Paradigm of Scientific Research
1. Theory and logical reasoning
2. Observation or Experiment
– E.g. Newton observed apples falling to design his theory of
mechanics
– But Gallileo Galilei made experiments with falling objects from the
Pisa leaning tower
3. Simulation of theory or model
– Digital simulation can prove theory or model
4. Data-driven Scientific Discovery (aka Data Science)
– More data beat hypnotized theory
BDDAC2014 @CTS2014 Big Data Architecture Framework 6
Gartner Technology Hypercycle (October 2013)
Source http://www.gartner.com/technology/research/methodologies/hype-cycle.jsp
BDDAC2014 @CTS2014 Big Data Architecture Framework 7
Big Data
Cloud Computing
Our/SNE Big Data Technology Research Cycle
Source http://www.gartner.com/technology/research/methodologies/hype-cycle.jsp
BDDAC2014 @CTS2014 Big Data Architecture Framework 8
Big Data
Cloud Computing
Mid 2014
Mid-End 2013
2012
2011
End 2014
Active research into Big Data domain definition
Building community links
Remote BD technology following.
EU Study AAA for Research Data
Main research in Cloud/Intercloud
Component technologies mastering
Education courses development
Active and productive research
Teaching on Big Data Tech/Infra
New style of technology development
Technology consolidation
Big Data Definitions Overview
• IDC definition of Big Data (conservative and strict approach) :"A new generation of technologies and architectures designed to economically extract value from very large volumes of a wide variety of data by enabling high-velocity capture, discovery, and/or analysis“
• Gartner definitionBig data is high-volume, high-velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.http://www.gartner.com/it-glossary/big-data/
– Termed as 3 parts definition, not 3V definition
• Big Data: a massive volume of both structured and unstructured data that is so large that it's difficult to process using traditional database and software techniques.
– From “The Big Data Long Tail” blog post by Jason Bloomberg (Jan 17, 2013). http://www.devx.com/blog/the-big-data-long-tail.html
• “Data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn’t fit the structures of your database architectures. To gain value from this data, you must choose an alternative way to process it.”
– Ed Dumbill, program chair for the O’Reilly Strata Conference
BDDAC2014 @CTS2014 Big Data Architecture Framework 9
Improved: 6 (5+1) V’s of Big Data
BDDAC2014 @CTS2014 Big Data Architecture Framework 10
• Trustworthiness
• Authenticity
• Origin, Reputation
• Availability
• Accountability
Veracity
• Batch
• Real/near-time
• Processes
• Streams
Velocity
• Changing data
• Changing model
• Linkage
Variability
• Correlations
• Statistical
• Events
• Hypothetical
Value
• Terabytes
• Records/Arch
• Tables, Files
• Distributed
Volume
• Structured
• Unstructured
• Multi-factor
• Probabilistic
• Linked
• Dynamic
Variety
6 Vs of
Big Data
Generic Big Data
Properties
• Volume
• Variety
• Velocity
Acquired Properties
(after entering system)
• Value
• Veracity
• Variability
Commonly accepted
3V’s of Big Data
Adopted in general
by NIST BD-WG
Big Data Definition: From 6V to 5 Parts (1)
(1) Big Data Properties: 5V– Volume, Variety, Velocity, Value, Veracity
– Additionally: Data Dynamicity (Variability)
(2) New Data Models– Data Lifecycle and Variability
– Data linking, provenance and referral integrity
(3) New Analytics– Real-time/streaming analytics, interactive and machine learning analytics
(4) New Infrastructure and Tools– High performance Computing, Storage, Network
– Heterogeneous multi-provider services integration
– New Data Centric (multi-stakeholder) service models
– New Data Centric security models for trusted infrastructure and data processing and storage
(5) Source and Target– High velocity/speed data capture from variety of sensors and data sources
– Data delivery to different visualisation and actionable systems and consumers
– Full digitised input and output, (ubiquitous) sensor networks, full digital control
BDDAC2014 @CTS2014 Big Data Architecture Framework 11
Big Data Definition: From 6V to 5 Parts (1)
(1) Big Data Properties: 5V– Volume, Variety, Velocity, Value, Veracity
– Additionally: Data Dynamicity (Variability)
(2) New Data Models– Data linking, provenance and referral integrity
– Data Lifecycle and Variability/Evolution
(3) New Analytics– Real-time/streaming analytics, interactive and machine learning analytics
(4) New Infrastructure and Tools– High performance Computing, Storage, Network
– Heterogeneous multi-provider services integration
– New Data Centric (multi-stakeholder) service models
– New Data Centric security models for trusted infrastructure and data processing and storage
(5) Source and Target– High velocity/speed data capture from variety of sensors and data sources
– Data delivery to different visualisation and actionable systems and consumers
– Full digitised input and output, (ubiquitous) sensor networks, full digital control
BDDAC2014 @CTS2014 Big Data Architecture Framework 12
Big Data Definition: From 6V to 5 Parts (2)
Refining Gartner definition
“Big data is (1) high-volume, high-velocity and high-variety information assets that demand (3) cost-effective, innovative forms of information processing for (5) enhanced insight and decision making”
• Big Data (Data Intensive) Technologies are targeting to process (1) high-volume, high-velocity, high-variety data (sets/assets) to extract intended data value and ensure high-veracity of original data and obtained information that demand cost-effective, innovative forms of data and information processing (analytics) for enhanced insight, decision making, and processes control; all of those demand (should be supported by) new data models (supporting all data states and stages during the whole data lifecycle) and new infrastructure services and tools that allows also obtaining (and processing data) from a variety of sources (including sensor networks) and delivering data in a variety of forms to different data and information consumers and devices.
(1) Big Data Properties: 5V
(2) New Data Models
(3) New Analytics
(4) New Infrastructure and Tools
(5) Source and Target
BDDAC2014 @CTS2014 Big Data Architecture Framework 13
Big Data Nature: Origin and Target (consumers)
Big Data Origin
• Science
• Internet, Web
• Industry
• Business
• Living Environment,
Cities
• Social media and
networks
• Healthcare
• Telecom/Infrastructure
BDDAC2014 @CTS2014 Big Data Architecture Framework 14
Big Data Target Use
• Scientific discovery
• New technologies
• Manufacturing,
processes, transport
• Personal services,
campaigns
• Living environment
support
• Healthcare support
• Social Networking
Da
ta T
ransfo
rmation
Volume, Velocity, Variety & Value, Veracity, Variability
Big Data technology drivers (1)
• Modern e-Science in search for new knowledge– Scientific experiments and tools are becoming bigger and
heavily based on data processing and mining
– The long tail of science
• Traditional data intensive industry– Genomic research, drugs development, Healthcare
– High-tech industry, CAD/CAM, weather/climate, etc.
• Customer facing industry and companies– Advertisement, retail business, service delivery
• Intelligence and security
• Network/infrastructure management– Network monitoring, Intrusion detection, troubleshooting
BDDAC2014 @CTS2014 Big Data Architecture Framework 15
The Long Tail of Science (aka “Dark Data”)
• Collectively “Long Tail” science is generating a lot of data
– Estimated as over 1PB per year and it is growing fast with the new
technology proliferation
• 80-20 rule: 20% users generate 80% data but not necessarily 80%
knowledge
BDDAC2014 @CTS2014 Big Data Architecture Framework 16
Source: Dennis Gannon (Microsoft)
NIST Big Data Workshop, 2012
Big Data technology drivers – Technology Loop
• Technology loop (known as Jevons Paradox)
– Increased efficiency to process current demand will create
new uses and increase demand even more
BDDAC2014 @CTS2014 Big Data Architecture Framework 17
Elastic Demand for Work:
A doubling of fuel
efficiency more than
doubles work demanded,
increasing the amount of
fuel used.
Jevons paradox occurs.
Big Data technology drivers (2) – Managing public
campaigns, e.g. election, public relations
• The rise of public opinion stored in platforms like Twitter,
Google, Facebook, etc. provide enough intelligence to
influence the campaign development, timing, geography and
even the colour of the campaign signs
– Twitter was a major source of data aggregation for the Republican
Race in the US
– Multimillion-dollar contract for data management
and collection services awarded May 1, 2013 to
Liberty Work to build advanced list of voters
• Article “In Data we trust” by T.Edsall in The New York Times
– Book: In Data We Trust: How Customer Data is
Revolutionising Our Economy (Aug 2012)
• A strategy for tomorrow's data world
BDDAC2014 @CTS2014 Big Data Architecture Framework 18
NIST Big Data Working Group (NBD-WG) and
ISO/IEC JTC1 Study Group on Big Data (SGBD)
• Started June 2013 - http://bigdatawg.nist.gov/home.php– Weekly calls, open participation, mailing list
• Targeted formal delivery Autumn 2014 of a set of NIST documentshttp://bigdatawg.nist.gov/V1_output_docs.phpVolume 1: NIST Big Data Definitions
Volume 2: NIST Big Data Taxonomies
Volume 3: NIST Big Data Use Case & Requirements (co-chair Geoffrey Fox)
Volume 4: NIST Big Data Security and Privacy Requirements
Volume 5: NIST Big Data Architectures White Paper Survey
Volume 6: NIST Big Data Reference Architecture
Volume 7: NIST Big Data Technology Roadmap
• ISO/IEC Study Group on Big Data (SGBD)http://jtc1bigdatasg.nist.gov/home.php
– Term (December 2013) – September 2014
– Extends NIST BDWG activity and scope
– 2nd meeting hosted in Amsterdam 13-16 May 2014http://jtc1bigdatasg.nist.gov/programEU.php
BDDAC2014 @CTS2014 Big Data Architecture Framework 19
2nd ISO/IEC SGBD meeting 13-16 May 2014
Discussions and results
• Two days workshop 13-14 May 2014
– EU and NL focus, UvA activities
• Refining Big Data technology definition and Big
Data Architecture definition
• New items proposed
– Big Data market aspects
– Data ownership (including during data lifecycle/staging
and aggregation)
– Opacity (obfuscation) data linkage during processing
– Data linkage
BDDAC2014 @CTS2014 Big Data Architecture Framework 20
From Big Data to All-Data – Paradigm Change
• Breaking paradigm changing factor
– Data storage and processing
– Security
– Identification and provenance
• Traditional model
– BIG Storage and BIG Computer with
FAT pipe
– Move compute to data vs Move data
to compute
• New Paradigm
– Continuous data production
– Continuous data processing
– DataBus as a Data container and
Protocol
BDDAC2014 @CTS2014 Big Data Architecture Framework 21
?
Big DataBig
ComputerNetwork
Move or not
to move?
Data Bus
Visuali
sation
Presen
tation
Action
Distributed Compute
and Analytics
Infrastructure Abstraction
Distributed Big Data
StorageData Abstraction
DataBus: (1) Data Container
(2) Metadata, State
(3) Data Transfer
Protocol
Moving to Data-Centric Models and Technologies
• Current IT and communication technologies are host based or host centric– Any communication or processing are bound to host/computer that
runs software
– Especially in security: all security models are host/client based
• Big Data requires new data-centric models– Data location, search, access
– Data integrity and identification
– Data lifecycle and variability
– Data centric (declarative) programming models
– Data aware infrastructure to support new data formats and data centric programming models
• Data centric security and access control
BDDAC2014 @CTS2014 Big Data Architecture Framework 22
Defining Big Data Architecture Framework
• Existing attempts address architecture issues in a traditional way: ODCA, TMF, NIST– http://bigdatawg.nist.gov/_uploadfiles/BD_Vol5_RefArchSurvey_V1Draft_Pre-
release.pdf
• Architecture vs Ecosystem– Big Data undergo a number of transformations during their lifecycle
– Big Data fuel the whole transformation chain • Data sources and data consumers, target data usage
– Multi-dimensional relations between• Data models and data driven processes
• Infrastructure components and data centric services
• Architecture vs Architecture Framework– Separates concerns and factors
• Control and Management functions, orthogonal factors
– Architecture Framework components are inter-related
BDDAC2014 @CTS2014 Big Data Architecture Framework 23
Big Data Architecture Framework (BDAF) (1)
(1) Data Models, Structures, Types– Data formats, non/relational, file systems, etc.
(2) Big Data Management– Big Data Lifecycle (Management) Model
• Big Data transformation/staging
– Provenance, Curation, Archiving
(3) Big Data Analytics and Tools– Big Data Applications
• Target use, presentation, visualisation
(4) Big Data Infrastructure (BDI)– Storage, Compute, (High Performance Computing,) Network
– Sensor network, target/actionable devices
– Big Data Operational support
(5) Big Data Security– Data security in-rest, in-move, trusted processing environments
BDDAC2014 @CTS2014 Big Data Architecture Framework 24
Big Data Architecture Framework (BDAF) –
Aggregated – Relations between components (2)
Col: Used By
Row: Requires
This
Data
Models
Structrs
Data
Managmnt
& Lifecycle
BigData
Infrastr &
Operations
BigData
Analytics &
Applicatn
BigData
Security
Data Models
& Structures
+ ++ + ++
Data
Managmnt &
Lifecycle
++ ++ ++ ++
BigData
Infrastruct &
Operations
+++ +++ ++ +++
BigData
Analytics &
Applications
++ + ++ ++
BigData
Security
+++ +++ +++ +
BDDAC2014 @CTS2014 Big Data Architecture Framework 25
Big Data Ecosystem: General BD Infrastructure
Data Transformation, Data Management
BDDAC2014 @CTS2014 Big Data Architecture Framework 26
Co
nsu
me
r
Data
Source
Data
Filter/Enrich,
Classification
Data
Delivery,
Visualisatio
n
Data
Analytics,
Modeling,
Prediction
Data
Collection&
Registratio
n
Storage
General
Purpose
Compute
General
Purpose
High
Performance
Computer
Clusters
Data Management
Big Data Analytic/Tools
Big Data Source/Origin (sensor, experiment, logdata, behavioral data)
Big Data Target/Customer: Actionable/Usable Data
Target users, processes, objects, behavior, etc.
Infrastructure
Management/MonitoringSecurity InfrastructureNetwork Infrastructure
Internal
Intercloud multi-provider heterogeneous Infrastructure
Federated
Access
and
Delivery
Infrastructure
(FADI)
Storage
Specialised
Databases
Archives
(analytics DB,
In memory,
operstional)
Data categories: metadata, (un)structured, (non)identifiableData categories: metadata, (un)structured, (non)identifiableData categories: metadata, (un)structured, (non)identifiable
Big Data Infrastructure
• Heterogeneous multi-provider
inter-cloud infrastructure
• Data management
infrastructure
• Collaborative Environment
(user/groups managements)
• Advanced high performance
(programmable) network
• Security infrastructure
Big Data Infrastructure and Analytics Tools
BDDAC2014 @CTS2014 Big Data Architecture Framework 27
Big Data Infrastructure• Heterogeneous multi-provider
inter-cloud infrastructure
• Data management
infrastructure
• Collaborative Environment
(user/groups managements)
• Advanced high performance
(programmable) network
• Security infrastructure
Big Data Analytics• High Performance Computer
Clusters (HPCC)
• Analytics/processing: Real-
time, Interactive, Batch,
Streaming
• Big Data Analytics tools and
applications
Data Lifecycle/Transformation Model
• Does Data Model changes along
lifecycle or data evolution?
• Identifying and linking data
– Persistent identifiers
– Data ownership
– Traceability vs Opacity
– Referral integrity
BDDAC2014 @CTS2014 Big Data Architecture Framework 28
Common Data Model?
• Data Variety and Variability
• Semantic InteroperabilityData Model (1)Data Model (1)
Data Model (3)
Data Model (4)Data (inter)linking?
• PID/OID
• ORCID
• Identification
• Privacy, Opacity
Data Storage
Con
su
me
r
Data
An
alit
ics
Ap
plic
atio
n
Data
Source
Data
Filter/Enrich,
Classification
Data
Delivery,
Visualisation
Data
Analytics,
Modeling,
Prediction
Data
Collection&
Registration
Data repurposing,
Analitics re-factoring,
Secondary processing
Evolutional/Hierarchical Data Model
Topics for discussion, research and standardisation
• Common Data Model?
• Data interlinking?
• Fits to Graph data type?
• Metadata
• Referrals
• Control information
• Policy
• Data patterns
BDDAC2014 @CTS2014 Big Data Architecture Framework 29
Usable DataActionable Data
Processed Data (for target use)
Classified/Structured Data
Raw Data
Classified/Structured Data
Classified/Structured Data
Processed Data (for target use)
Papers/Reports Archival Data
Processed Data (for target use)
ORCID
PID/DOI
Summary and topics for discussion
• Researching, learning mastering Big Data domain is a Big
Data problem itself
• Cloud Computing as a natural platform for Big Data
– Acceptance of clouds will grow, so demand for specialists
• New generically data centric models are required
• New distributed data processing and analytics computing
models to be developed/re-factored
• Data Scientist is a new focus for talents search by
companies and task for universities to develop a new
curriculum
BDDAC2014 @CTS2014 Big Data Architecture Framework 30
Additional topics
• Data Scientist: New profession and need for
Education&Training
BDDAC2014 @CTS2014 Big Data Architecture Framework 31
Data Scientist: New Profession and Opportunities
• There will be a shortage of talent necessary for organizations to take advantage
of Big Data. – By 2018, the United States alone could face a shortage of 140,000 to 190,000 people with deep
analytical skills as well as
– 1.5 million managers and analysts with the know-how to use the analysis of big data to make
effective decisions
SOURCE:US Bureau of Labor Statistics; US Census; Dun & Bradstreet; company interviews; McKinsey analysis
Wsh SGBD, 13-16 May 2014 Big Data and Education 32
McKinsey Institute on Big Data Jobs (2011)
http://www.mckinsey.com/mgi/publications/big_data/index.asp
Strata Survey Skills and Data Scientist Self-ID
Wsh SGBD, 13-16 May 2014 Big Data and Education 33
Analysing the Analysers. O’Reilly Strata Survey – Harris, Murphy & Vaisman, 2013
• Based on how data scientists think about themselves and their work
• Identified four Data Scientist clusters
Skills and Self-ID Top Factors
Wsh SGBD, 13-16 May 2014 Big Data and Education 34
ML/BigData
Business
Math/OR
Programming
Statistics
ML – Machine Learning
OR – Operations Research
Wsh SGBD, 13-16 May 2014 Big Data and Education 35
Slide from the presentation
Demystifying Data Science
(by Natasha Balac, SDSC)
Key to a Great Data Scientist
Technical skills (Coding, Statistics, Math)
+ Commitment +Creativity
+ Intuition
+ Presentation Skills
+ Business Savvy
= Great Data Scientist!
• How Long Does It Take For a Beginner to Become a Good
Data Scientist?
– 3-5 years according to KDnuggets survey [278 votes total]
Wsh SGBD, 13-16 May 2014 Big Data and Education 36