Upload
gilles-fedak
View
108
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Citation preview
Introduction Active Data Discussion Conclusion
Active DataA Data-Centric Approach to Data Life-Cycle Management
Anthony Simonet1 Gilles Fedak1
Matei Ripeanu2 Samer Al-Kiswany2
1Inria, ENS Lyon, University of Lyon 2University of British Columbia
November 18th, 2013
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 1/20
Introduction Active Data Discussion Conclusion
Outline
Introduction
Data Life Cycle Management
Use-case
Requirements
Active Data
Active Data: principles & features
Exemple: Globus Online and iRODS
Discussion
Advantages
Limitations
Conclusion
Related works
Conclusion
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 2/20
Introduction Active Data Discussion Conclusion
Big Data
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 3/20
I Science and Industry have become data-intensiveI Volume of data produced by science and industry grows exponentiallyI How to store this deluge of data?I How to extract knowledge and sense?I How to make data valuable?
I Some examplesI CERN’s Large Hadron Collider: 1.5PB/weekI Large Synoptic Survey Telescope, Chile: 30 TB/nightI Billion edge social network graphsI Searching and mining the Web
Introduction Active Data Discussion Conclusion
Data Life Cycle
Data Life Cycle
I Creation/Acquisition
I Transfer
I Replication
I Disposal/Archiving
Definition
The life cycle is the course of operational stages through whichdata pass from the time when they enter a system to the timewhen they leave it.
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 4/20
Introduction Active Data Discussion Conclusion
Data Life Cycle Management
Complicated scenarios
I Execution of workflow
I Complex interactions between software
I Need to quickly react to operational events
Ad-hoc task-centric approaches
I Hard to program, maintain and debug
I No formal specification
I Complicates interactions between systems
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 5/20
Introduction Active Data Discussion Conclusion
Data Life Cycle Use-case
Example: the Advanced Photon Source at Argonne National LabI 100TB of raw data per dayI Raw data are preprocessed and registered in a Globus dataset
catalogI Data are analyzed by various applicationsI Results are stored in the dataset catalog and shared
Instrument(Beamline)
LocalStorage
Transfer
MetadataCatalog
Extract &Register Metadata
RemoteData Center
Transfer
AcademicCluster
Analysis
More analysis
Upload result
Register result metadata
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 6/20
Introduction Active Data Discussion Conclusion
Use-case
Task Centric
Vs
Data CentricI Independent scripts I Express data-dependancies
I Hard to program, maintain, verify I Cross data-center coordination
I Coarse granularity I User-level fault-tolerance
I Incremental processing
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 7/20
Introduction Active Data Discussion Conclusion
Requirements
Challenges: a perfect system should. . .
I Simply represent the life cycle of data distributed acrossdifferent data centers and systems
I Simplify DLM modeling and reasoning
I Hide the complexity resulting from using differentinfrastructures and systems
I Be easy to integrate with existing systems
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 8/20
Introduction Active Data Discussion Conclusion
Active Data principles
System programmers expose their system’s internal data life cyclewith a model based on Petri Nets.A life cycle model is made of Places and Transitions
•Created
t1
Written
t2
Read
t3
t4
Terminated
public void handler () {
computeMD5 ();
}
Each token has a unique identifier, corresponding to the actualdata item’s.
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 9/20
Introduction Active Data Discussion Conclusion
Active Data principles
System programmers expose their system’s internal data life cyclewith a model based on Petri Nets.A life cycle model is made of Places and Transitions
Created
t1
•Written
t2
Read
t3
t4
Terminated
public void handler () {
computeMD5 ();
}
A transition is fired whenever a data state changes.
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 9/20
Introduction Active Data Discussion Conclusion
Active Data principles
System programmers expose their system’s internal data life cyclewith a model based on Petri Nets.A life cycle model is made of Places and Transitions
Created
t1
•Written
t2
Read
t3
t4
Terminated
public void handler () {
computeMD5 ();
}
Code may be plugged by clients to transitions.It is executed whenever the transition is fired.
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 9/20
Introduction Active Data Discussion Conclusion
Active Data features
The Active Data programming model and runtime environment:
I Allows to react to life cycle progression
I Exposes transparently distributed data sets
I Can be integrated with existing systems
I Has scalable performance and minimum overhead overexisting systems
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 10/20
Introduction Active Data Discussion Conclusion
Implementation
I Prototype implemented in Java (' 2,800 LOC)
I Client/Service communication is Publish/SubscribeI 2 types of subscription:
I Every transitions for a given data itemI Every data items for a given transition
Active DataService
Client
Client
subscribeClient subscribe
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 11/20
Introduction Active Data Discussion Conclusion
Implementation
I Several ways to publish transitionsI Instrument the codeI Read the logsI Rely on an existing notification system
I The service orders transitions by time of arrival
Active DataService
Client
publish transition
Client
subscribeClient subscribe
publish transi
tion
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 11/20
Introduction Active Data Discussion Conclusion
Implementation
I Clients run transition handler code locallyI Transition handlers are executed
I SeriallyI In a blocking wayI In the order transitions were published
Active DataService
Client
publish transition
Client
subscribenotify
Client subscribe
notify
publish transi
tion
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 11/20
Introduction Active Data Discussion Conclusion
Performance evaluation: Throughput
10 50 100 200 300 400 450 500 550# clients
5,000
10,000
15,000
20,000
25,000
30,000
35,000
Tra
nsi
tions
per
seco
nd
Figure: Average number of transitions per second handled by the ActiveData Service
Clients publish 10,000 transitions in a row without pausing.
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 12/20
Introduction Active Data Discussion Conclusion
Performance evaluation: Throughput
10 50 100 200 300 400 450 500 550# clients
5,000
10,000
15,000
20,000
25,000
30,000
35,000
Tra
nsi
tions
per
seco
nd
Figure: Average number of transitions per second handled by the ActiveData Service
The prototype scales up to 30,000 transitions per seconds.
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 12/20
Introduction Active Data Discussion Conclusion
Exemple: Data Provenance
Definition
The complete history of data life cycle derivations and operations.
I Assess the quality of data
I Keep track of the origin of data over time
I Specialized Provenance Aware Storage Systems
−→ What about heterogeneous systems?
Example with Globus Online and iRODS
File transfer service Data store and metadata catalog
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 13/20
Introduction Active Data Discussion Conclusion
Exemple: Data Provenance
Definition
The complete history of data life cycle derivations and operations.
I Assess the quality of data
I Keep track of the origin of data over time
I Specialized Provenance Aware Storage Systems−→ What about heterogeneous systems?
Example with Globus Online and iRODS
File transfer service Data store and metadata catalog
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 13/20
Introduction Active Data Discussion Conclusion
Exemple: Data Provenance
Definition
The complete history of data life cycle derivations and operations.
I Assess the quality of data
I Keep track of the origin of data over time
I Specialized Provenance Aware Storage Systems−→ What about heterogeneous systems?
Example with Globus Online and iRODS
File transfer service Data store and metadata catalog
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 13/20
Introduction Active Data Discussion Conclusion
Exemple: Globus Online and iRODS
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 14/20
Data events coming from Globus Online and iRODS
Created
GetPut
Terminated
t5
t9
t6
t7
t8 t10
iRODS
Id: {GO: 7b9e02c4-925d-11e2,iRODS: 10032}
public void handler () {
annotate ();
}
•
Created
t1 t2
Succeeded Failed
t3 t4
Terminated
Globus Online
Id: {GO: 7b9e02c4-925d-11e2}
public void handler () {
iput (...);
}
Introduction Active Data Discussion Conclusion
Exemple: Globus Online and iRODS
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 14/20
Data events coming from Globus Online and iRODS
Created
GetPut
Terminated
t5
t9
t6
t7
t8 t10
iRODS
Id: {GO: 7b9e02c4-925d-11e2,iRODS: 10032}
public void handler () {
annotate ();
}
Created
t1 t2
•
Succeeded Failed
t3 t4
Terminated
Globus Online
Id: {GO: 7b9e02c4-925d-11e2}
public void handler () {
iput (...);
}
Introduction Active Data Discussion Conclusion
Exemple: Globus Online and iRODS
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 14/20
Data events coming from Globus Online and iRODS
•
Created
GetPut
Terminated
t5
t9
t6
t7
t8 t10
iRODS
Id: {GO: 7b9e02c4-925d-11e2,iRODS: 10032}
public void handler () {
annotate ();
}
Created
t1 t2
Succeeded Failed
t3 t4
Terminated
Globus Online
Id: {GO: 7b9e02c4-925d-11e2}public void handler () {
iput (...);
}
Introduction Active Data Discussion Conclusion
Exemple: Globus Online and iRODS
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 14/20
Data events coming from Globus Online and iRODS
Created
Get
• Put
Terminated
t5
t9
t6
t7
t8 t10
iRODS
Id: {GO: 7b9e02c4-925d-11e2,iRODS: 10032}
public void handler () {
annotate ();
}
Created
t1 t2
Succeeded Failed
t3 t4
Terminated
Globus Online
Id: {GO: 7b9e02c4-925d-11e2}public void handler () {
iput (...);
}
Introduction Active Data Discussion Conclusion
Exemple: Globus Online and iRODS
$ imeta ls -d test/out_test_4628
AVUs defined for dataObj test/out_test_4628:
attribute: GO_FAULTS
value: 0
----
attribute: GO_COMPLETION_TIME
value: 2013 -03 -21 19:28:41Z
----
attribute: GO_REQUEST_TIME
value: 2013 -03 -21 19:28:17Z
----
attribute: GO_TASK_ID
value: 7b9e02c4 -925d-11e2 -97ce -123139404 f2e
----
attribute: GO_SOURCE
value: go#ep1/~/ test
----
attribute: GO_DESTINATION
value: asimonet#fraise /~/ out_test_4628
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 15/20
Introduction Active Data Discussion Conclusion
Advantages
I Simple and graphical way to program DLM operations
I Allows to formally verify some properties of data life cycles
I Easy coordination between systems
I Easy to scale
I Easy to debug
I Easy fault tolerance
I Fine-grain interaction with data life cycle
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 16/20
Introduction Active Data Discussion Conclusion
Limitations
I Complexity to reason in terms of life cycle events
I Lack of standard
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 17/20
Introduction Active Data Discussion Conclusion
Related works
Data-centric parallel processingI Programing models:
I MapReduce and higher level abstractions: PigLatin, TwisterI Incremental systems: MapReduce-Online, Percolator, Chimera,
NepheleI Other models with implicit parallelism: Swift, Dryad, Allpairs
I Storage systemsI BitDewI MosaStoreI Provenance Aware Storage SystemsI Active Storage
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 18/20
Introduction Active Data Discussion Conclusion
Conclusion
Active Data is. . .
I Data-centric & Event-driven
I System-level data integration
What’s next?
I Advanced representation of operations that consume andproduce data: represent data derivation
I Data collection abilities
I Distributed implementation of the Publish/Subscribe layer
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 19/20
Introduction Active Data Discussion Conclusion
Thank you!
Questions?
Inria booth #2116
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 20/20