20
SCAP E Preservation Workflows with Taverna Clemens Neudecker, Afdeling Onderzoek, Koninklijke Bibliotheek I&O Kennissessie 28 november 2012 SCAPE SCAlable Preservation Environments

Preservation Workflows with Taverna

Embed Size (px)

Citation preview

Page 1: Preservation Workflows with Taverna

SCAPE

Preservation Workflows with TavernaClemens Neudecker, Afdeling

Onderzoek, Koninklijke Bibliotheek

I&O Kennissessie

28 november

2012

SCAPESCAlable

Preservation

Environments

Page 2: Preservation Workflows with Taverna

SCAPE

Page 3: Preservation Workflows with Taverna

SCAPEBackground

What is a scientific workflow?•

““The automation of a business process, in whole or part, 

during which documents, information or tasks are passed  from one participant to another for action, according to a  set of procedural rules.”

Background: eSciences, in particular Life Sciences

Two approaches•

Data driven (what)

Control driven (how)

3

Page 4: Preservation Workflows with Taverna

SCAPEBackground II

Why use scientific workflows?•

Automation of repetitive processes

Chaining of distinct components (interoperability)•

“In‐silico

experimentation”

Documented experiment configuration•

Re‐usable by others (encapsulation)

Page 5: Preservation Workflows with Taverna

SCAPEBackground III

Scientific Workflow Management Systems•

Taverna (myGrid, UK)

Kepler

(Kepler, USA)•

Meandre

(SEASR, USA)

and there are many more…

Why Taverna?•

Good experience in IMPACT, Open source

European partner (University of Manchester, UK)•

Widely used (> 4000 active users)

Shields complexity from end‐user

Page 6: Preservation Workflows with Taverna

SCAPEExcourse: IMPACT

EU FP7 project on OCR, coordinated by the KB•

Prototyping use of scientific workflows in digitization

Some components being further developed in SCAPE

See also http://impact.kbresearch.nl/

Page 7: Preservation Workflows with Taverna

SCAPE…but back to digital preservation

Example use cases for scientific workflows•

File format identification/migration/validation

Tool evaluation•

Quality assurance

…and many more!

Page 8: Preservation Workflows with Taverna

SCAPEEnter Taverna

Web services (SOAP, REST)•

Beanshells

(Java scripting, libraries)

R (statistics)•

Local tools (SH/SSH)

Excel/CSV•

Plugins

Page 9: Preservation Workflows with Taverna

SCAPEComponents I

Taverna Workbench

Page 10: Preservation Workflows with Taverna

SCAPEComponents II

Taverna Server

Page 11: Preservation Workflows with Taverna

SCAPEComponents III

SCAPECatalogue

Page 12: Preservation Workflows with Taverna

SCAPEComponents IV

myExperiment

Page 13: Preservation Workflows with Taverna

SCAPEExamples

Validate  JPEG2000 with 

Jpylyzer, convert  invalid JP2’s 

based on TIFF  masters and 

validate derived  JP2’s again using  Jpylyzer

Page 14: Preservation Workflows with Taverna

SCAPEExamples

Apply Matchbox  Book Page 

Images  Duplicate 

Detection to a  list of books 

from Google  Books Project

Page 15: Preservation Workflows with Taverna

SCAPEExamples

Takes a list of  ARC files as input 

and creates a  mime type report 

per ARC and a  summary report 

over all ARCs using TIKA

Page 16: Preservation Workflows with Taverna

SCAPEExamples

Validating WAV  File Format 

using JHOVE2  Web Service

Page 17: Preservation Workflows with Taverna

SCAPEScalability

Taverna workflows on Hadoop•

Hadoop

= Map/Reduce implementation from Yahoo

Idea: Execute workflows on a Hadoop

cluster•

Mainly responsible: AIT, UMAN

Clusters: IMF, ONB, KB, SB•

Some problems:•

Scheduling: Hadoop

(1 big jar) or Taverna (many small jars)?

Error handling (long running automated workflows)•

List handling (cross product vs. dot product)

“Small files problem”

Hadoop

sequenceFile•

OPF Blog: 

http://www.openplanetsfoundation.org/blogs/2012‐08‐07‐big‐data‐ processing‐chaining‐hadoop‐jobs‐using‐taverna

Page 18: Preservation Workflows with Taverna

SCAPEExamples

Workflow for  preparing large 

document  collections for data 

analysis. 

Different types of  hadoop

jobs 

(Hadoop‐Streaming‐ API, Hadoop

Map/Reduce, and  Hive) are used 

(ONB) 

Processing time 60.000 books / 24 Mio. pages: 6 h

Page 19: Preservation Workflows with Taverna

SCAPE

Demo(s)

Page 20: Preservation Workflows with Taverna

SCAPEWant some more?

SCAPE source code on githubgithub.com/openplanets/scape

SCAPE for DevelopersSCAPE Developer's Guide

SCAPE PlatformSCAPE Preservation Execution Platform

SCAPE workshops, hackathons: check with us!http://www.scape‐project.eu/events