34
Retrospec)ve Provenance without a Run)me Provenance Recorder Timothy McPhillips Shawn Bowers Khalid Belhajjame Bertram Ludäscher*

YesWorkflow: Retrospective Provenance Without a Runtime Provenance Recorder

Embed Size (px)

Citation preview

Retrospec)ve  Provenance  without    a  Run)me  Provenance  Recorder  

 Timothy  McPhillips                Shawn  Bowers  Khalid  Belhajjame                Bertram  Ludäscher*  

Overview  /  Exec  Summary  •  Scien=fic  Workflows:  ASAP!  •  Scripts  are  Scien)fic  Workflows,  too!  

–  ..  so  how  can  we  help  workflow  script  authors?  •  YesWorkflow  (YW)  

–  Prospec)ve  provenance  through  YW-­‐annota=ons:  •  script  +  annota=ons  @begin,  @end,  @in,  @out  =>  workflow  model  

•  YW-­‐Recon:  Retrospec)ve  provenance  without  a  provenance  recorder:      –  ..  add  @URI  template-­‐annota=ons  =>  linking  script  persisted  (meta-­‐)data  with  workflow  model  

 ...  =>  YW  Retrospec)ve  Provenance  Queries  

–  ..  using  YW  workflow  model  to  query  YW-­‐reconstructed  run=me  provenance!      

YesWorkflow  Provenance  @  TaPP'15   2  

Scientific Workflows: ASAP! •  Automation

–  wfs to automate computational aspects of science

•  Scaling (exploit and optimize machine cycles) –  wfs should make use of parallel compute resources –  wfs should be able handle large data

•  Abstraction, Evolution, Reuse (human cycles) –  wfs should be easy to (re-)use, evolve, share

•  Provenance –  wfs should capture processing history, data lineage è traceable data- and wf-evolution è  Reproducible Science Trident  

Workbench  

VisTrails  

YesWorkflow  Provenance  @  TaPP'15   3  Es  war  einmal  …      

Scientific Workflows …

Cabellos  et  al.  Computer  Physics  Communica6ons  182,  2011  YesWorkflow  Provenance  @  TaPP'15   4  

…  are  a  wonderful  thing  …    

YesWorkflow  Provenance  @  TaPP'15   5  

Dr.  Norbert  Podhorszki  (then:  UC  Davis)  

… after simplifying a bit (here: Kepler/COMAD)

YesWorkflow  Provenance  @  TaPP'15   6  

Dr.  Sven  Köhler  (then:  UC  Davis)  

I beg your pardon, I never promised you ..

“Thanks to our Graphical UI your scientific workflows will be much easier to develop, understand and maintain!” Hmm…    this  was  supposed  to  be  easier  than  programming!  

YesWorkflow  Provenance  @  TaPP'15   7  

Meanwhile, on a nearby planet …

Interactive Visualization

YesWorkflow  Provenance  @  TaPP'15   8  

SKOPE:  Synthesized  Knowledge  Of  Past  Environments  

YesWorkflow  Provenance  @  TaPP'15   9  

Bocinsky,  Kohler  et  al.  study  rain-­‐fed  maize  of  Anasazi    –  Four  Corners;  AD  600–1500.  Climate  change  influenced  Mesa  Verde  Migra)ons;  late  

13th  century  AD.  Uses  network  of  tree-­‐ring  chronologies  to  reconstruct  a  spa)o-­‐temporal  climate  field  at  a  fairly  high  resolu=on  (~800  m)  from  AD  1–2000.  Algorithm  es=mates  joint  informa=on  in  tree-­‐rings  and  a  climate  signal  to  iden=fy  “best”    tree-­‐ring  chronologies  for  climate  reconstruc=ng.  

K.  Bocinsky,  T.  Kohler,  A  2000-­‐year  reconstruc=on  of  the  rain-­‐fed  maize  agricultural  niche  in  the  US  Southwest.  Nature  

Communica.ons.  doi:10.1038/ncomms6618    

… implemented as an R Script …

…  HPCBio  Workflows  @  Illinois  

YesWorkflow  Provenance  @  TaPP'15   10  

 Na6onal  Petascale  Compu6ng  Facility  

Broad  Ins)tute:    Recommended  workflow  for  variant  analysis  

Liudmila  Mainzer,  Victor  Jongeneel  HPC  Bio  @  Illinois  

Quickly,  say:    #!/bin/bash  

It’s  )me  to  shi^  control  …  

YesWorkflow  Provenance  @  TaPP'15   11  

•  …  back  from  being  consumers  of  someone  else’s  (=  our)  tools  ..    –  “Just  click  here!”  

•  ...  to  tool  makers!  –  Scien=sts  who  author  workflows  as  scripts!  

•  Go  where  the  wild  things  (users!)  are  …      –  Yes,  develop  for  “end  users”  …      – …  but  don’t  forget  the  tool  makers!  

•  Can  we  do  this  together?    

GetModernClimate

PRISM_annual_growing_season_precipitation

SubsetAllData

dendro_series_for_calibration

dendro_series_for_reconstruction CAR_Analysis_unique

cellwise_unique_selected_linear_models

CAR_Analysis_union

cellwise_union_selected_linear_models

CAR_Reconstruction_union

raster_brick_spatial_reconstruction raster_brick_spatial_reconstruction_errors

CAR_Reconstruction_union_output

ZuniCibola_PRISM_grow_prcp_ols_loocv_union_recons.tif ZuniCibola_PRISM_grow_prcp_ols_loocv_union_errors.tif

master_data_directory prism_directory

tree_ring_datacalibration_years retrodiction_years

?  

YesWorkflow:    Yes,  scripts  are  workflows,  too!  

•  Script  vs  Workflows/ASAP:  – Automation:    *****  – Scaling:          **  – Abstraction:  *    – Provenance:    **  

Enter:  YesWorkflow!  (yesworkflow.org)  

•  YesWorkflow  (YW)  –  Grass-­‐roots  effort      –  …  mee=ng  the  scien=sts/users  where  they  R!  

•  R,  Matlab,  (i)Python,  Jupyter,  …  

–  Scripts  +  simple  user  annota=ons  

•  =>  Reveal  the  workflow  model/abstrac)on      …  that  underlies  the  (script)  implementa6on  

•  =>  YW  can  give  us  more  of  ASAP!  –  First  YW:    ASAP  (Abstrac=on)...  –  Then  YW-­‐recon:  ASAP  (reconstruc=ng  run)me  Provenance)  

13  YesWorkflow  Provenance  @  TaPP'15  

YesWorkflow.org  

YesWorkflow  Provenance  @  TaPP'15   14  

Related  Work,  other  Approaches  …  to  bring  workflow/provenance  benefits  to  scripts:  •  Run)me  Provenance  Recorders:  

–  use  (R,  Python,  ..)  libraries  and/or  code  instrumenta)on  to  capture  run)me  observables  

•  file  read/write,  func=on  calls,  program  variables  &  state,  …  

–  noWorkflow  system    •  [Murta-­‐Braganholo-­‐Chiriga=-­‐Koop-­‐Freire-­‐IPAW14]    •  exploit  Python  profiling  library  to  capture  run=me  provenance  

=>  helps  with  "S"  and  "P"      

•   OS-­‐level  capture  of  (system)  provenance  –  Some  talks  at  TaPP  !?  

YesWorkflow  Provenance  @  TaPP'15   15  

YW  (prospec.ve)  and    YW-­‐Recon  (retrospec.ve)  Provenance  •  1.  YW:  Annotate  Script  =>  YW  Model  

–  Annotate  @BEGIN..@END,  @IN,  @OUT  –  Visualize,  share,  be  happy  J    

•  2.  Run  script  –  Files  are  read  and  wrizen  –  Folder-­‐  &  Filenames  have  metadata  

•  3.  YW-­‐Recon  –  Use  @URI  tags  that  link  YW  Model  ó  Persisted  Data  –  Run  URI-­‐template  queries    

•  cf.  “ls  -­‐R”  &  RegEx  matching  

•  4.  YW-­‐Query  –  Answer  the  user’s  provenance  queries    

YesWorkflow  Provenance  @  TaPP'15   16  

YW  annota)ons:  Model  your  Workflow!  

YesWorkflow  Provenance  @  TaPP'15   17  

YesWorkflow:  Prospec)ve  &  Retrospec=ve  Provenance  …  (almost)  for  free!    

•  YW  annota=ons  in  the  script  (R,  Python,  Matlab)  are  used  to  recreate  the  workflow  view  from  the  script  …    

YesWorkflow  Provenance  @  TaPP'15   18  

cassette_id

sample_score_cutoff

sample_spreadsheetfile:cassette_{cassette_id}_spreadsheet.csv

calibration_imagefile:calibration.img

initialize_run

run_logfile:run/run_log.txt

load_screening_results

sample_namesample_quality

calculate_strategy

rejected_sample accepted_sample num_images energies

log_rejected_sample

rejection_logfile:/run/rejected_samples.txt

collect_data_set

sample_id energy frame_numberraw_image

file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw

transform_images

corrected_imagefile:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img

total_intensitypixel_count corrected_image_path

log_average_image_intensity

collection_logfile:run/collected_images.csv

YW!  

Voila!  The  Workflow  revealed!  

YesWorkflow  Provenance  @  TaPP'15   19  

cassette_id

sample_score_cutoff

sample_spreadsheetfile:cassette_{cassette_id}_spreadsheet.csv

calibration_imagefile:calibration.img

initialize_run

run_logfile:run/run_log.txt

load_screening_results

sample_namesample_quality

calculate_strategy

rejected_sample accepted_sample num_images energies

log_rejected_sample

rejection_logfile:/run/rejected_samples.txt

collect_data_set

sample_id energy frame_numberraw_image

file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw

transform_images

corrected_imagefile:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img

total_intensitypixel_count corrected_image_path

log_average_image_intensity

collection_logfile:run/collected_images.csv

Get  3  views  for  the  price  of  1!  

YesWorkflow  Provenance  @  TaPP'15   20  

Process  view  

Data  view  

Combined  view  

GetModernClimate

PRISM_annual_growing_season_precipitation

SubsetAllData

dendro_series_for_calibration

dendro_series_for_reconstruction CAR_Analysis_unique

cellwise_unique_selected_linear_models

CAR_Analysis_union

cellwise_union_selected_linear_models

CAR_Reconstruction_union

raster_brick_spatial_reconstruction raster_brick_spatial_reconstruction_errors

CAR_Reconstruction_union_output

ZuniCibola_PRISM_grow_prcp_ols_loocv_union_recons.tif ZuniCibola_PRISM_grow_prcp_ols_loocv_union_errors.tif

master_data_directory prism_directory

tree_ring_datacalibration_years retrodiction_years

Paleoclimate  Reconstruc)on  (EnviRecon.org)    

YesWorkflow  Provenance  @  TaPP'15   21  

•  …  explained  using  YesWorkflow!  

Kyle  B.,  (computa=onal)  archaeologist:    "It  took  me  about  20  minutes  to  comment.  Less  than  an  hour  to  learn  and  YW-­‐annotate,  all-­‐told."  

Provenance Lands

22  

Workflow  Modeling  &  Design  (a.k.a.  prospec.ve  provenance  

“Workflow-­‐land”)  

Run)me  Provenance    (a.k.a.  traces,  logs,      

retrospec.ve  provenance,  “Trace-­‐land”)  

YesWorkflow  Provenance  @  TaPP'15  

run/  

├──  raw  

│      └──  q55  

│              ├──  DRT240  

│              │      ├──  e10000  

│              │      │      ├──  image_001.raw  

...          ...  ...  ...  

│              │      │      └──  image_037.raw  

│              │      └──  e11000  

│              │              ├──  image_001.raw  

...          ...          ...  

│              │              └──  image_037.raw  

│              └──  DRT322  

│                      ├──  e10000  

│                      │      ├──  image_001.raw  

...                  ...  ...  

│                      │      └──  image_030.raw  

│                      └──  e11000  

│                              ├──  image_001.raw  

...                          ...  

│                              └──  image_030.raw  

├──  data  

│      ├──  DRT240  

│      │      ├──  DRT240_10000eV_001.img  

...  ...  ...  

│      │      └──  DRT240_11000eV_037.img  

│      └──  DRT322  

│              ├──  DRT322_10000eV_001.img  

...          ...  

│              └──  DRT322_11000eV_030.img  

│  

├──  collected_images.csv  

├──  rejected_samples.txt  

└──  run_log.txt  

 

YW-­‐RECON:  Prospec=ve  &  Retrospec)ve  Provenance  …  (almost)  for  free!    

YesWorkflow  Provenance  @  TaPP'15   23  

cassette_id

sample_score_cutoff

sample_spreadsheetfile:cassette_{cassette_id}_spreadsheet.csv

calibration_imagefile:calibration.img

initialize_run

run_logfile:run/run_log.txt

load_screening_results

sample_namesample_quality

calculate_strategy

rejected_sample accepted_sample num_images energies

log_rejected_sample

rejection_logfile:/run/rejected_samples.txt

collect_data_set

sample_id energy frame_numberraw_image

file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw

transform_images

corrected_imagefile:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img

total_intensitypixel_count corrected_image_path

log_average_image_intensity

collection_logfile:run/collected_images.csv

•  URI-­‐templates  link  conceptual  en==es  to  run)me  provenance  “le|  behind”  by  the  script  author  …    

•  …  facilita=ng  provenance  reconstruc=on  

YW  (prospec.ve)  and    YW-­‐Recon  (retrospec.ve)  Provenance  •  1.  YW:  Annotate  Script  =>  YW  Model  

–  Annotate  @BEGIN..@END,  @IN,  @OUT  –  Visualize,  share,  be  happy  J    

•  2.  Run  script  –  Files  are  read  and  wrizen  –  Folder-­‐  &  Filenames  have  metadata  

•  3.  YW-­‐Recon  –  Use  @URI  tags  that  link  YW  Model  ó  Persisted  Data  –  Run  URI-­‐template  queries    

•  cf.  “ls  -­‐R”  &  RegEx  matching  

•  4.  YW-­‐Query  –  Answer  the  user’s  provenance  queries    

YesWorkflow  Provenance  @  TaPP'15   24  

initialize_run

run_logfile:run/run_log.txt

load_screening_results

sample_name sample_quality

calculate_strategy

rejected_sample accepted_sample num_imagesenergies

log_rejected_sample

rejection_logfile:/run/rejected_samples.txt

collect_data_set

sample_idenergyframe_numberraw_image

file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw

transform_images

corrected_imagefile:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img

total_intensitypixel_count corrected_image_path

log_average_image_intensity

collection_logfile:run/collected_images.csv

sample_spreadsheetfile:cassette_{cassette_id}_spreadsheet.csv

calibration_imagefile:calibration.img

cassette_id

sample_score_cutoff

Data  collec)on  workflow  (X-­‐ray  diffrac=on  )  

YesWorkflow  Provenance  @  TaPP'15   25  

initialize_run

run_logfile:run/run_log.txt

load_screening_results

sample_name sample_quality

calculate_strategy

rejected_sample accepted_sample num_imagesenergies

log_rejected_sample

rejection_logfile:/run/rejected_samples.txt

collect_data_set

sample_idenergyframe_numberraw_image

file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw

transform_images

corrected_imagefile:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img

total_intensitypixel_count corrected_image_path

log_average_image_intensity

collection_logfile:run/collected_images.csv

sample_spreadsheetfile:cassette_{cassette_id}_spreadsheet.csv

calibration_imagefile:calibration.img

cassette_id

sample_score_cutoff

Data  collec=on  workflow:  run)me  data  

YesWorkflow  Provenance  @  TaPP'15   26  

run/  

├──  raw  

│      └──  q55  

│              ├──  DRT240  

│              │      ├──  e10000  

│              │      │      ├──  image_001.raw  

...          ...  ...  ...  

│              │      │      └──  image_037.raw  

│              │      └──  e11000  

│              │              ├──  image_001.raw  

...          ...          ...  

│              │              └──  image_037.raw  

│              └──  DRT322  

│                      ├──  e10000  

│                      │      ├──  image_001.raw  

...                  ...  ...  

│                      │      └──  image_030.raw  

│                      └──  e11000  

│                              ├──  image_001.raw  

...                          ...  

│                              └──  image_030.raw  

├──  data  

│      ├──  DRT240  

│      │      ├──  DRT240_10000eV_001.img  

...  ...  ...  

│      │      └──  DRT240_11000eV_037.img  

│      └──  DRT322  

│              ├──  DRT322_10000eV_001.img  

...          ...  

│              └──  DRT322_11000eV_030.img  

│  

├──  collected_images.csv  

├──  rejected_samples.txt  

└──  run_log.txt  

 

1.   YW  annota)ons  =>  YW  model  2.   Files  &  Folders  le^  by  a  run  =>  run)me  (meta-­‐)data  

initialize_run

run_logfile:run/run_log.txt

load_screening_results

sample_name sample_quality

calculate_strategy

rejected_sample accepted_sample num_imagesenergies

log_rejected_sample

rejection_logfile:/run/rejected_samples.txt

collect_data_set

sample_idenergyframe_numberraw_image

file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw

transform_images

corrected_imagefile:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img

total_intensitypixel_count corrected_image_path

log_average_image_intensity

collection_logfile:run/collected_images.csv

sample_spreadsheetfile:cassette_{cassette_id}_spreadsheet.csv

calibration_imagefile:calibration.img

cassette_id

sample_score_cutoff

Q1:  What  samples  did  the  script  run  collect  images  from?  

run/  

├──  raw  

│      └──  q55  

│              ├──  DRT240  

│              │      ├──  e10000  

│              │      │      ├──  image_001.raw  

...          ...  ...  ...  

│              │      │      └──  image_037.raw  

│              │      └──  e11000  

│              │              ├──  image_001.raw  

...          ...          ...  

│              │              └──  image_037.raw  

│              └──  DRT322  

│                      ├──  e10000  

│                      │      ├──  image_001.raw  

...                  ...  ...  

│                      │      └──  image_030.raw  

│                      └──  e11000  

│                              ├──  image_001.raw  

...                          ...  

│                              └──  image_030.raw  

├──  data  

│      ├──  DRT240  

│      │      ├──  DRT240_10000eV_001.img  

...  ...  ...  

│      │      └──  DRT240_11000eV_037.img  

│      └──  DRT322  

│              ├──  DRT322_10000eV_001.img  

...          ...  

│              └──  DRT322_11000eV_030.img  

│  

├──  collected_images.csv  

├──  rejected_samples.txt  

└──  run_log.txt  

 YesWorkflow  Provenance  @  TaPP'15   27  

initialize_run

run_logfile:run/run_log.txt

load_screening_results

sample_name sample_quality

calculate_strategy

rejected_sample accepted_sample num_imagesenergies

log_rejected_sample

rejection_logfile:/run/rejected_samples.txt

collect_data_set

sample_idenergyframe_numberraw_image

file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw

transform_images

corrected_imagefile:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img

total_intensitypixel_count corrected_image_path

log_average_image_intensity

collection_logfile:run/collected_images.csv

sample_spreadsheetfile:cassette_{cassette_id}_spreadsheet.csv

calibration_imagefile:calibration.img

cassette_id

sample_score_cutoff

Q2:  What  energies  were  used  for  image  collec=on  from  sample  DRT322?  

run/  

├──  raw  

│      └──  q55  

│              ├──  DRT240  

│              │      ├──  e10000  

│              │      │      ├──  image_001.raw  

...          ...  ...  ...  

│              │      │      └──  image_037.raw  

│              │      └──  e11000  

│              │              ├──  image_001.raw  

...          ...          ...  

│              │              └──  image_037.raw  

│              └──  DRT322  

│                      ├──  e10000  

│                      │      ├──  image_001.raw  

...                  ...  ...  

│                      │      └──  image_030.raw  

│                      └──  e11000  

│                              ├──  image_001.raw  

...                          ...  

│                              └──  image_030.raw  

├──  data  

│      ├──  DRT240  

│      │      ├──  DRT240_10000eV_001.img  

...  ...  ...  

│      │      └──  DRT240_11000eV_037.img  

│      └──  DRT322  

│              ├──  DRT322_10000eV_001.img  

...          ...  

│              └──  DRT322_11000eV_030.img  

│  

├──  collected_images.csv  

├──  rejected_samples.txt  

└──  run_log.txt  

 YesWorkflow  Provenance  @  TaPP'15   28  

initialize_run

run_logfile:run/run_log.txt

load_screening_results

sample_name sample_quality

calculate_strategy

rejected_sample accepted_sample num_imagesenergies

log_rejected_sample

rejection_logfile:/run/rejected_samples.txt

collect_data_set

sample_idenergyframe_numberraw_image

file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw

transform_images

corrected_imagefile:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img

total_intensitypixel_count corrected_image_path

log_average_image_intensity

collection_logfile:run/collected_images.csv

sample_spreadsheetfile:cassette_{cassette_id}_spreadsheet.csv

calibration_imagefile:calibration.img

cassette_id

sample_score_cutoff

Q3:  Where  is  the  raw  image  of  the  corrected  image  DRT322_11000ev_030.img?    run/  

├──  raw  

│      └──  q55  

│              ├──  DRT240  

│              │      ├──  e10000  

│              │      │      ├──  image_001.raw  

...          ...  ...  ...  

│              │      │      └──  image_037.raw  

│              │      └──  e11000  

│              │              ├──  image_001.raw  

...          ...          ...  

│              │              └──  image_037.raw  

│              └──  DRT322  

│                      ├──  e10000  

│                      │      ├──  image_001.raw  

...                  ...  ...  

│                      │      └──  image_030.raw  

│                      └──  e11000  

│                              ├──  image_001.raw  

...                          ...  

│                              └──  image_030.raw  

├──  data  

│      ├──  DRT240  

│      │      ├──  DRT240_10000eV_001.img  

...  ...  ...  

│      │      └──  DRT240_11000eV_037.img  

│      └──  DRT322  

│              ├──  DRT322_10000eV_001.img  

...          ...  

│              └──  DRT322_11000eV_030.img  

│  

├──  collected_images.csv  

├──  rejected_samples.txt  

└──  run_log.txt  

 

YesWorkflow  Provenance  @  TaPP'15   29  

initialize_run

run_logfile:run/run_log.txt

load_screening_results

sample_name sample_quality

calculate_strategy

rejected_sample accepted_sample num_imagesenergies

log_rejected_sample

rejection_logfile:/run/rejected_samples.txt

collect_data_set

sample_idenergyframe_numberraw_image

file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw

transform_images

corrected_imagefile:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img

total_intensitypixel_count corrected_image_path

log_average_image_intensity

collection_logfile:run/collected_images.csv

sample_spreadsheetfile:cassette_{cassette_id}_spreadsheet.csv

calibration_imagefile:calibration.img

cassette_id

sample_score_cutoff

run/  

├──  raw  

│      └──  q55  

│              ├──  DRT240  

│              │      ├──  e10000  

│              │      │      ├──  image_001.raw  

...          ...  ...  ...  

│              │      │      └──  image_037.raw  

│              │      └──  e11000  

│              │              ├──  image_001.raw  

...          ...          ...  

│              │              └──  image_037.raw  

│              └──  DRT322  

│                      ├──  e10000  

│                      │      ├──  image_001.raw  

...                  ...  ...  

│                      │      └──  image_030.raw  

│                      └──  e11000  

│                              ├──  image_001.raw  

...                          ...  

│                              └──  image_030.raw  

├──  data  

│      ├──  DRT240  

│      │      ├──  DRT240_10000eV_001.img  

...  ...  ...  

│      │      └──  DRT240_11000eV_037.img  

│      └──  DRT322  

│              ├──  DRT322_10000eV_001.img  

...          ...  

│              └──  DRT322_11000eV_030.img  

│  

├──  collected_images.csv  

├──  rejected_samples.txt  

└──  run_log.txt  

 

Q5:  What  casseke-­‐id  had  the  sample  leading  to  DRT240_10000ev_001.img?  

YesWorkflow  Provenance  @  TaPP'15   30  

Querying  Provenance  

YesWorkflow  Provenance  @  TaPP'15   31  

1.  Large  total  data  footprint  2.  Large  number  of  files  3.  Large  number  of  simultaneous  but  independent  non-­‐MPI  computa)ons  4.   Keeping  track  of  what  was  done  to  the  data:  large  amount  of  metadata  5.  Workflow  bozlenecks:  fans  &  merges;  more  fans  

Data  &  Workflow  Management  

YesWorkflow  Provenance  @  TaPP'15   32  

Source:  L  Mainzer,  V  Jongeneel  (IGB  &  NCSA)    

Taking  YW  for  a  spin  …    •  “To  document  on-­‐the  fly,  specifically  for  a  given  

workflow  configura6on  invoked:    –  do  not  insert  annota6ons  into  code,  –  but  rather  have  code  print  annota.ons  into  a  special  log  

during  execu6on,  –  then  parse  that  log!”      –  Liudmila  Mainzer  

YesWorkflow  Provenance  @  TaPP'15   33  

Source:  L  Mainzer,  V  Jongeneel  (IGB  &  NCSA)    

Conclusions  &  the  Road  ahead  …    •  YW:  Go  where  the  users  are!  

–  …  they  already  capture  provenance  through  metadata!  •  Beware  your  level  of  provenance  abstrac)on  

–  Let  the  user  provide  a  workflow  model  easily!    •  YW-­‐Recon:  

–  …  finishing  support  for  retrospec)ve  provenance  without  using  a  run=me  provenance  recorder!  

–  Key  insight:  scien=sts  already  leave  provenance  “bread  crumbs”  behind!  (it’s  not  an  accident!)  

–  Exploit  this  annota)ons:  URI-­‐templates  –  Extend  YW  to  exploit  log  files    

•  YW-­‐GUI:  –  Exploring  &  experimen=ng  with  a  UI    

•  Let  script  author  immediately  see  effect  of  YW-­‐annota=ons  •  Let  scrip  (re-­‐)user  explore  the  code  and  data  provenance  

•  YW-­‐Query:  –  ?  Datalog,  RPQ,  …  for  querying  provenance  …    

YesWorkflow  Provenance  @  TaPP'15   34