Upload
cneudecker
View
93
Download
4
Tags:
Embed Size (px)
Citation preview
SCAPE
Preservation Workflows with TavernaClemens Neudecker, Afdeling
Onderzoek, Koninklijke Bibliotheek
I&O Kennissessie
28 november
2012
SCAPESCAlable
Preservation
Environments
SCAPE
SCAPEBackground
•
What is a scientific workflow?•
““The automation of a business process, in whole or part,
during which documents, information or tasks are passed from one participant to another for action, according to a set of procedural rules.”
•
Background: eSciences, in particular Life Sciences
•
Two approaches•
Data driven (what)
•
Control driven (how)
3
SCAPEBackground II
•
Why use scientific workflows?•
Automation of repetitive processes
•
Chaining of distinct components (interoperability)•
“In‐silico
experimentation”
•
Documented experiment configuration•
Re‐usable by others (encapsulation)
SCAPEBackground III
•
Scientific Workflow Management Systems•
Taverna (myGrid, UK)
•
Kepler
(Kepler, USA)•
Meandre
(SEASR, USA)
•
and there are many more…
•
Why Taverna?•
Good experience in IMPACT, Open source
•
European partner (University of Manchester, UK)•
Widely used (> 4000 active users)
•
Shields complexity from end‐user
SCAPEExcourse: IMPACT
•
EU FP7 project on OCR, coordinated by the KB•
Prototyping use of scientific workflows in digitization
•
Some components being further developed in SCAPE
•
See also http://impact.kbresearch.nl/
SCAPE…but back to digital preservation
•
Example use cases for scientific workflows•
File format identification/migration/validation
•
Tool evaluation•
Quality assurance
•
…and many more!
SCAPEEnter Taverna
•
Web services (SOAP, REST)•
Beanshells
(Java scripting, libraries)
•
R (statistics)•
Local tools (SH/SSH)
•
Excel/CSV•
Plugins
SCAPEExamples
Validate JPEG2000 with
Jpylyzer, convert invalid JP2’s
based on TIFF masters and
validate derived JP2’s again using Jpylyzer
SCAPEExamples
Apply Matchbox Book Page
Images Duplicate
Detection to a list of books
from Google Books Project
SCAPEExamples
Takes a list of ARC files as input
and creates a mime type report
per ARC and a summary report
over all ARCs using TIKA
SCAPEExamples
Validating WAV File Format
using JHOVE2 Web Service
SCAPEScalability
•
Taverna workflows on Hadoop•
Hadoop
= Map/Reduce implementation from Yahoo
•
Idea: Execute workflows on a Hadoop
cluster•
Mainly responsible: AIT, UMAN
•
Clusters: IMF, ONB, KB, SB•
Some problems:•
Scheduling: Hadoop
(1 big jar) or Taverna (many small jars)?
•
Error handling (long running automated workflows)•
List handling (cross product vs. dot product)
•
“Small files problem”
Hadoop
sequenceFile•
OPF Blog:
http://www.openplanetsfoundation.org/blogs/2012‐08‐07‐big‐data‐ processing‐chaining‐hadoop‐jobs‐using‐taverna
SCAPEExamples
Workflow for preparing large
document collections for data
analysis.
Different types of hadoop
jobs
(Hadoop‐Streaming‐ API, Hadoop
Map/Reduce, and Hive) are used
(ONB)
Processing time 60.000 books / 24 Mio. pages: 6 h
SCAPE
Demo(s)
SCAPEWant some more?
•
SCAPE source code on githubgithub.com/openplanets/scape
•
SCAPE for DevelopersSCAPE Developer's Guide
•
SCAPE PlatformSCAPE Preservation Execution Platform
•
SCAPE workshops, hackathons: check with us!http://www.scape‐project.eu/events