Upload
bertram-ludaescher
View
59
Download
0
Embed Size (px)
Citation preview
Retrospec)ve Provenance without a Run)me Provenance Recorder
Timothy McPhillips Shawn Bowers Khalid Belhajjame Bertram Ludäscher*
Overview / Exec Summary • Scien=fic Workflows: ASAP! • Scripts are Scien)fic Workflows, too!
– .. so how can we help workflow script authors? • YesWorkflow (YW)
– Prospec)ve provenance through YW-‐annota=ons: • script + annota=ons @begin, @end, @in, @out => workflow model
• YW-‐Recon: Retrospec)ve provenance without a provenance recorder: – .. add @URI template-‐annota=ons => linking script persisted (meta-‐)data with workflow model
... => YW Retrospec)ve Provenance Queries
– .. using YW workflow model to query YW-‐reconstructed run=me provenance!
YesWorkflow Provenance @ TaPP'15 2
Scientific Workflows: ASAP! • Automation
– wfs to automate computational aspects of science
• Scaling (exploit and optimize machine cycles) – wfs should make use of parallel compute resources – wfs should be able handle large data
• Abstraction, Evolution, Reuse (human cycles) – wfs should be easy to (re-)use, evolve, share
• Provenance – wfs should capture processing history, data lineage è traceable data- and wf-evolution è Reproducible Science Trident
Workbench
VisTrails
YesWorkflow Provenance @ TaPP'15 3 Es war einmal …
Scientific Workflows …
Cabellos et al. Computer Physics Communica6ons 182, 2011 YesWorkflow Provenance @ TaPP'15 4
… are a wonderful thing …
YesWorkflow Provenance @ TaPP'15 5
Dr. Norbert Podhorszki (then: UC Davis)
… after simplifying a bit (here: Kepler/COMAD)
YesWorkflow Provenance @ TaPP'15 6
Dr. Sven Köhler (then: UC Davis)
I beg your pardon, I never promised you ..
“Thanks to our Graphical UI your scientific workflows will be much easier to develop, understand and maintain!” Hmm… this was supposed to be easier than programming!
YesWorkflow Provenance @ TaPP'15 7
SKOPE: Synthesized Knowledge Of Past Environments
YesWorkflow Provenance @ TaPP'15 9
Bocinsky, Kohler et al. study rain-‐fed maize of Anasazi – Four Corners; AD 600–1500. Climate change influenced Mesa Verde Migra)ons; late
13th century AD. Uses network of tree-‐ring chronologies to reconstruct a spa)o-‐temporal climate field at a fairly high resolu=on (~800 m) from AD 1–2000. Algorithm es=mates joint informa=on in tree-‐rings and a climate signal to iden=fy “best” tree-‐ring chronologies for climate reconstruc=ng.
K. Bocinsky, T. Kohler, A 2000-‐year reconstruc=on of the rain-‐fed maize agricultural niche in the US Southwest. Nature
Communica.ons. doi:10.1038/ncomms6618
… implemented as an R Script …
… HPCBio Workflows @ Illinois
YesWorkflow Provenance @ TaPP'15 10
Na6onal Petascale Compu6ng Facility
Broad Ins)tute: Recommended workflow for variant analysis
Liudmila Mainzer, Victor Jongeneel HPC Bio @ Illinois
Quickly, say: #!/bin/bash
It’s )me to shi^ control …
YesWorkflow Provenance @ TaPP'15 11
• … back from being consumers of someone else’s (= our) tools .. – “Just click here!”
• ... to tool makers! – Scien=sts who author workflows as scripts!
• Go where the wild things (users!) are … – Yes, develop for “end users” … – … but don’t forget the tool makers!
• Can we do this together?
GetModernClimate
PRISM_annual_growing_season_precipitation
SubsetAllData
dendro_series_for_calibration
dendro_series_for_reconstruction CAR_Analysis_unique
cellwise_unique_selected_linear_models
CAR_Analysis_union
cellwise_union_selected_linear_models
CAR_Reconstruction_union
raster_brick_spatial_reconstruction raster_brick_spatial_reconstruction_errors
CAR_Reconstruction_union_output
ZuniCibola_PRISM_grow_prcp_ols_loocv_union_recons.tif ZuniCibola_PRISM_grow_prcp_ols_loocv_union_errors.tif
master_data_directory prism_directory
tree_ring_datacalibration_years retrodiction_years
?
YesWorkflow: Yes, scripts are workflows, too!
• Script vs Workflows/ASAP: – Automation: ***** – Scaling: ** – Abstraction: * – Provenance: **
Enter: YesWorkflow! (yesworkflow.org)
• YesWorkflow (YW) – Grass-‐roots effort – … mee=ng the scien=sts/users where they R!
• R, Matlab, (i)Python, Jupyter, …
– Scripts + simple user annota=ons
• => Reveal the workflow model/abstrac)on … that underlies the (script) implementa6on
• => YW can give us more of ASAP! – First YW: ASAP (Abstrac=on)... – Then YW-‐recon: ASAP (reconstruc=ng run)me Provenance)
13 YesWorkflow Provenance @ TaPP'15
Related Work, other Approaches … to bring workflow/provenance benefits to scripts: • Run)me Provenance Recorders:
– use (R, Python, ..) libraries and/or code instrumenta)on to capture run)me observables
• file read/write, func=on calls, program variables & state, …
– noWorkflow system • [Murta-‐Braganholo-‐Chiriga=-‐Koop-‐Freire-‐IPAW14] • exploit Python profiling library to capture run=me provenance
=> helps with "S" and "P"
• OS-‐level capture of (system) provenance – Some talks at TaPP !?
YesWorkflow Provenance @ TaPP'15 15
YW (prospec.ve) and YW-‐Recon (retrospec.ve) Provenance • 1. YW: Annotate Script => YW Model
– Annotate @BEGIN..@END, @IN, @OUT – Visualize, share, be happy J
• 2. Run script – Files are read and wrizen – Folder-‐ & Filenames have metadata
• 3. YW-‐Recon – Use @URI tags that link YW Model ó Persisted Data – Run URI-‐template queries
• cf. “ls -‐R” & RegEx matching
• 4. YW-‐Query – Answer the user’s provenance queries
YesWorkflow Provenance @ TaPP'15 16
YesWorkflow: Prospec)ve & Retrospec=ve Provenance … (almost) for free!
• YW annota=ons in the script (R, Python, Matlab) are used to recreate the workflow view from the script …
YesWorkflow Provenance @ TaPP'15 18
cassette_id
sample_score_cutoff
sample_spreadsheetfile:cassette_{cassette_id}_spreadsheet.csv
calibration_imagefile:calibration.img
initialize_run
run_logfile:run/run_log.txt
load_screening_results
sample_namesample_quality
calculate_strategy
rejected_sample accepted_sample num_images energies
log_rejected_sample
rejection_logfile:/run/rejected_samples.txt
collect_data_set
sample_id energy frame_numberraw_image
file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw
transform_images
corrected_imagefile:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img
total_intensitypixel_count corrected_image_path
log_average_image_intensity
collection_logfile:run/collected_images.csv
YW!
Voila! The Workflow revealed!
YesWorkflow Provenance @ TaPP'15 19
cassette_id
sample_score_cutoff
sample_spreadsheetfile:cassette_{cassette_id}_spreadsheet.csv
calibration_imagefile:calibration.img
initialize_run
run_logfile:run/run_log.txt
load_screening_results
sample_namesample_quality
calculate_strategy
rejected_sample accepted_sample num_images energies
log_rejected_sample
rejection_logfile:/run/rejected_samples.txt
collect_data_set
sample_id energy frame_numberraw_image
file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw
transform_images
corrected_imagefile:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img
total_intensitypixel_count corrected_image_path
log_average_image_intensity
collection_logfile:run/collected_images.csv
Get 3 views for the price of 1!
YesWorkflow Provenance @ TaPP'15 20
Process view
Data view
Combined view
GetModernClimate
PRISM_annual_growing_season_precipitation
SubsetAllData
dendro_series_for_calibration
dendro_series_for_reconstruction CAR_Analysis_unique
cellwise_unique_selected_linear_models
CAR_Analysis_union
cellwise_union_selected_linear_models
CAR_Reconstruction_union
raster_brick_spatial_reconstruction raster_brick_spatial_reconstruction_errors
CAR_Reconstruction_union_output
ZuniCibola_PRISM_grow_prcp_ols_loocv_union_recons.tif ZuniCibola_PRISM_grow_prcp_ols_loocv_union_errors.tif
master_data_directory prism_directory
tree_ring_datacalibration_years retrodiction_years
Paleoclimate Reconstruc)on (EnviRecon.org)
YesWorkflow Provenance @ TaPP'15 21
• … explained using YesWorkflow!
Kyle B., (computa=onal) archaeologist: "It took me about 20 minutes to comment. Less than an hour to learn and YW-‐annotate, all-‐told."
Provenance Lands
22
Workflow Modeling & Design (a.k.a. prospec.ve provenance
“Workflow-‐land”)
Run)me Provenance (a.k.a. traces, logs,
retrospec.ve provenance, “Trace-‐land”)
YesWorkflow Provenance @ TaPP'15
run/
├── raw
│ └── q55
│ ├── DRT240
│ │ ├── e10000
│ │ │ ├── image_001.raw
... ... ... ...
│ │ │ └── image_037.raw
│ │ └── e11000
│ │ ├── image_001.raw
... ... ...
│ │ └── image_037.raw
│ └── DRT322
│ ├── e10000
│ │ ├── image_001.raw
... ... ...
│ │ └── image_030.raw
│ └── e11000
│ ├── image_001.raw
... ...
│ └── image_030.raw
├── data
│ ├── DRT240
│ │ ├── DRT240_10000eV_001.img
... ... ...
│ │ └── DRT240_11000eV_037.img
│ └── DRT322
│ ├── DRT322_10000eV_001.img
... ...
│ └── DRT322_11000eV_030.img
│
├── collected_images.csv
├── rejected_samples.txt
└── run_log.txt
YW-‐RECON: Prospec=ve & Retrospec)ve Provenance … (almost) for free!
YesWorkflow Provenance @ TaPP'15 23
cassette_id
sample_score_cutoff
sample_spreadsheetfile:cassette_{cassette_id}_spreadsheet.csv
calibration_imagefile:calibration.img
initialize_run
run_logfile:run/run_log.txt
load_screening_results
sample_namesample_quality
calculate_strategy
rejected_sample accepted_sample num_images energies
log_rejected_sample
rejection_logfile:/run/rejected_samples.txt
collect_data_set
sample_id energy frame_numberraw_image
file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw
transform_images
corrected_imagefile:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img
total_intensitypixel_count corrected_image_path
log_average_image_intensity
collection_logfile:run/collected_images.csv
• URI-‐templates link conceptual en==es to run)me provenance “le| behind” by the script author …
• … facilita=ng provenance reconstruc=on
YW (prospec.ve) and YW-‐Recon (retrospec.ve) Provenance • 1. YW: Annotate Script => YW Model
– Annotate @BEGIN..@END, @IN, @OUT – Visualize, share, be happy J
• 2. Run script – Files are read and wrizen – Folder-‐ & Filenames have metadata
• 3. YW-‐Recon – Use @URI tags that link YW Model ó Persisted Data – Run URI-‐template queries
• cf. “ls -‐R” & RegEx matching
• 4. YW-‐Query – Answer the user’s provenance queries
YesWorkflow Provenance @ TaPP'15 24
initialize_run
run_logfile:run/run_log.txt
load_screening_results
sample_name sample_quality
calculate_strategy
rejected_sample accepted_sample num_imagesenergies
log_rejected_sample
rejection_logfile:/run/rejected_samples.txt
collect_data_set
sample_idenergyframe_numberraw_image
file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw
transform_images
corrected_imagefile:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img
total_intensitypixel_count corrected_image_path
log_average_image_intensity
collection_logfile:run/collected_images.csv
sample_spreadsheetfile:cassette_{cassette_id}_spreadsheet.csv
calibration_imagefile:calibration.img
cassette_id
sample_score_cutoff
Data collec)on workflow (X-‐ray diffrac=on )
YesWorkflow Provenance @ TaPP'15 25
initialize_run
run_logfile:run/run_log.txt
load_screening_results
sample_name sample_quality
calculate_strategy
rejected_sample accepted_sample num_imagesenergies
log_rejected_sample
rejection_logfile:/run/rejected_samples.txt
collect_data_set
sample_idenergyframe_numberraw_image
file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw
transform_images
corrected_imagefile:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img
total_intensitypixel_count corrected_image_path
log_average_image_intensity
collection_logfile:run/collected_images.csv
sample_spreadsheetfile:cassette_{cassette_id}_spreadsheet.csv
calibration_imagefile:calibration.img
cassette_id
sample_score_cutoff
Data collec=on workflow: run)me data
YesWorkflow Provenance @ TaPP'15 26
run/
├── raw
│ └── q55
│ ├── DRT240
│ │ ├── e10000
│ │ │ ├── image_001.raw
... ... ... ...
│ │ │ └── image_037.raw
│ │ └── e11000
│ │ ├── image_001.raw
... ... ...
│ │ └── image_037.raw
│ └── DRT322
│ ├── e10000
│ │ ├── image_001.raw
... ... ...
│ │ └── image_030.raw
│ └── e11000
│ ├── image_001.raw
... ...
│ └── image_030.raw
├── data
│ ├── DRT240
│ │ ├── DRT240_10000eV_001.img
... ... ...
│ │ └── DRT240_11000eV_037.img
│ └── DRT322
│ ├── DRT322_10000eV_001.img
... ...
│ └── DRT322_11000eV_030.img
│
├── collected_images.csv
├── rejected_samples.txt
└── run_log.txt
1. YW annota)ons => YW model 2. Files & Folders le^ by a run => run)me (meta-‐)data
initialize_run
run_logfile:run/run_log.txt
load_screening_results
sample_name sample_quality
calculate_strategy
rejected_sample accepted_sample num_imagesenergies
log_rejected_sample
rejection_logfile:/run/rejected_samples.txt
collect_data_set
sample_idenergyframe_numberraw_image
file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw
transform_images
corrected_imagefile:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img
total_intensitypixel_count corrected_image_path
log_average_image_intensity
collection_logfile:run/collected_images.csv
sample_spreadsheetfile:cassette_{cassette_id}_spreadsheet.csv
calibration_imagefile:calibration.img
cassette_id
sample_score_cutoff
Q1: What samples did the script run collect images from?
run/
├── raw
│ └── q55
│ ├── DRT240
│ │ ├── e10000
│ │ │ ├── image_001.raw
... ... ... ...
│ │ │ └── image_037.raw
│ │ └── e11000
│ │ ├── image_001.raw
... ... ...
│ │ └── image_037.raw
│ └── DRT322
│ ├── e10000
│ │ ├── image_001.raw
... ... ...
│ │ └── image_030.raw
│ └── e11000
│ ├── image_001.raw
... ...
│ └── image_030.raw
├── data
│ ├── DRT240
│ │ ├── DRT240_10000eV_001.img
... ... ...
│ │ └── DRT240_11000eV_037.img
│ └── DRT322
│ ├── DRT322_10000eV_001.img
... ...
│ └── DRT322_11000eV_030.img
│
├── collected_images.csv
├── rejected_samples.txt
└── run_log.txt
YesWorkflow Provenance @ TaPP'15 27
initialize_run
run_logfile:run/run_log.txt
load_screening_results
sample_name sample_quality
calculate_strategy
rejected_sample accepted_sample num_imagesenergies
log_rejected_sample
rejection_logfile:/run/rejected_samples.txt
collect_data_set
sample_idenergyframe_numberraw_image
file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw
transform_images
corrected_imagefile:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img
total_intensitypixel_count corrected_image_path
log_average_image_intensity
collection_logfile:run/collected_images.csv
sample_spreadsheetfile:cassette_{cassette_id}_spreadsheet.csv
calibration_imagefile:calibration.img
cassette_id
sample_score_cutoff
Q2: What energies were used for image collec=on from sample DRT322?
run/
├── raw
│ └── q55
│ ├── DRT240
│ │ ├── e10000
│ │ │ ├── image_001.raw
... ... ... ...
│ │ │ └── image_037.raw
│ │ └── e11000
│ │ ├── image_001.raw
... ... ...
│ │ └── image_037.raw
│ └── DRT322
│ ├── e10000
│ │ ├── image_001.raw
... ... ...
│ │ └── image_030.raw
│ └── e11000
│ ├── image_001.raw
... ...
│ └── image_030.raw
├── data
│ ├── DRT240
│ │ ├── DRT240_10000eV_001.img
... ... ...
│ │ └── DRT240_11000eV_037.img
│ └── DRT322
│ ├── DRT322_10000eV_001.img
... ...
│ └── DRT322_11000eV_030.img
│
├── collected_images.csv
├── rejected_samples.txt
└── run_log.txt
YesWorkflow Provenance @ TaPP'15 28
initialize_run
run_logfile:run/run_log.txt
load_screening_results
sample_name sample_quality
calculate_strategy
rejected_sample accepted_sample num_imagesenergies
log_rejected_sample
rejection_logfile:/run/rejected_samples.txt
collect_data_set
sample_idenergyframe_numberraw_image
file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw
transform_images
corrected_imagefile:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img
total_intensitypixel_count corrected_image_path
log_average_image_intensity
collection_logfile:run/collected_images.csv
sample_spreadsheetfile:cassette_{cassette_id}_spreadsheet.csv
calibration_imagefile:calibration.img
cassette_id
sample_score_cutoff
Q3: Where is the raw image of the corrected image DRT322_11000ev_030.img? run/
├── raw
│ └── q55
│ ├── DRT240
│ │ ├── e10000
│ │ │ ├── image_001.raw
... ... ... ...
│ │ │ └── image_037.raw
│ │ └── e11000
│ │ ├── image_001.raw
... ... ...
│ │ └── image_037.raw
│ └── DRT322
│ ├── e10000
│ │ ├── image_001.raw
... ... ...
│ │ └── image_030.raw
│ └── e11000
│ ├── image_001.raw
... ...
│ └── image_030.raw
├── data
│ ├── DRT240
│ │ ├── DRT240_10000eV_001.img
... ... ...
│ │ └── DRT240_11000eV_037.img
│ └── DRT322
│ ├── DRT322_10000eV_001.img
... ...
│ └── DRT322_11000eV_030.img
│
├── collected_images.csv
├── rejected_samples.txt
└── run_log.txt
YesWorkflow Provenance @ TaPP'15 29
initialize_run
run_logfile:run/run_log.txt
load_screening_results
sample_name sample_quality
calculate_strategy
rejected_sample accepted_sample num_imagesenergies
log_rejected_sample
rejection_logfile:/run/rejected_samples.txt
collect_data_set
sample_idenergyframe_numberraw_image
file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw
transform_images
corrected_imagefile:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img
total_intensitypixel_count corrected_image_path
log_average_image_intensity
collection_logfile:run/collected_images.csv
sample_spreadsheetfile:cassette_{cassette_id}_spreadsheet.csv
calibration_imagefile:calibration.img
cassette_id
sample_score_cutoff
run/
├── raw
│ └── q55
│ ├── DRT240
│ │ ├── e10000
│ │ │ ├── image_001.raw
... ... ... ...
│ │ │ └── image_037.raw
│ │ └── e11000
│ │ ├── image_001.raw
... ... ...
│ │ └── image_037.raw
│ └── DRT322
│ ├── e10000
│ │ ├── image_001.raw
... ... ...
│ │ └── image_030.raw
│ └── e11000
│ ├── image_001.raw
... ...
│ └── image_030.raw
├── data
│ ├── DRT240
│ │ ├── DRT240_10000eV_001.img
... ... ...
│ │ └── DRT240_11000eV_037.img
│ └── DRT322
│ ├── DRT322_10000eV_001.img
... ...
│ └── DRT322_11000eV_030.img
│
├── collected_images.csv
├── rejected_samples.txt
└── run_log.txt
Q5: What casseke-‐id had the sample leading to DRT240_10000ev_001.img?
YesWorkflow Provenance @ TaPP'15 30
1. Large total data footprint 2. Large number of files 3. Large number of simultaneous but independent non-‐MPI computa)ons 4. Keeping track of what was done to the data: large amount of metadata 5. Workflow bozlenecks: fans & merges; more fans
Data & Workflow Management
YesWorkflow Provenance @ TaPP'15 32
Source: L Mainzer, V Jongeneel (IGB & NCSA)
Taking YW for a spin … • “To document on-‐the fly, specifically for a given
workflow configura6on invoked: – do not insert annota6ons into code, – but rather have code print annota.ons into a special log
during execu6on, – then parse that log!” – Liudmila Mainzer
YesWorkflow Provenance @ TaPP'15 33
Source: L Mainzer, V Jongeneel (IGB & NCSA)
Conclusions & the Road ahead … • YW: Go where the users are!
– … they already capture provenance through metadata! • Beware your level of provenance abstrac)on
– Let the user provide a workflow model easily! • YW-‐Recon:
– … finishing support for retrospec)ve provenance without using a run=me provenance recorder!
– Key insight: scien=sts already leave provenance “bread crumbs” behind! (it’s not an accident!)
– Exploit this annota)ons: URI-‐templates – Extend YW to exploit log files
• YW-‐GUI: – Exploring & experimen=ng with a UI
• Let script author immediately see effect of YW-‐annota=ons • Let scrip (re-‐)user explore the code and data provenance
• YW-‐Query: – ? Datalog, RPQ, … for querying provenance …
YesWorkflow Provenance @ TaPP'15 34