Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012

Sven Schlarb Austrian National Library

Keeping Control: Scalable Preservation Environments for Identification and Characterisation Guimarães, Portugal, 07/12/2012

Large scale preservation workflows with Taverna

What do you mean by „Workflow“?

• Data flow rather than control flow • (Semi-)Automated data processing pipeline • Defined inputs and outputs • Modular and reusable processing units • Easy to deploy, execute, and share

Modularise complex preservation tasks

• Assuming that complex preservation tasks can be separated into processing steps

• Together the steps represent the automated processing pipeline

Migrate Characterise Quality Assurance Ingest

Experimental workflow development

• Easy to execute a workflow on standard platforms from anywhere

• Experimental data available online or downloadable • Reproducible experiment results • Workflow development as a community activity

Taverna

• Workflow language and computational model for creating composite data-intensive processing chains

• Developed since 2004 as a tool for life scientists and bio-informaticians by myGrid, University of Manchester, UK

• Available for Windows/Linux/OSX and as open source (LGPL)

SCUFL/T2FLOW/SCUFL2

• Alternative to other workflow description languages, such as the Business Process Enactment Language (BPEL)

• SCUFL2 is Taverna's new workflow specification language (Taverna 3), workflow bundle format, and Java API

• SCUFL2 will replace the t2flow format (which replaced the SCUFL format)

• Adopts Linked Data technology

Creating workflows using Taverna

• Users interactively build data processing pipelines • Set of nodes represents data processing elements • Nodes are connected by directed edges and the

workflow itself is a directed graph • Nodes can have multiple inputs and outputs • Workflows can contain other (embedded) workflows

Processors

• Web service clients (SOAP/REST) • Local scripts (R and Beanshell languages) • Remote shell script invocations via ssh (Tool) • XML splitters - XSLT (interoperability!)

SCAPE List handling: Implicit iteration over multiple

inputs • A „single value“ input port (list depth 0) processes

values iteratively (foreach) • A flat value list has list depth 1 • List depth > 1 for tree structures • Multiple input ports with lists are combined as cross

product or dot product

Example: Tika Preservation Component

• Input: „file“

• Processor: Tika web service (SOAP)

• Output: Mime-Type

Workflow development and execution • Local development: Taverna Workbench

Workflow registry • Web 2.0 style registry: myExperiment

Remote Workflow Execution • Web client using REST API of Taverna Server

Hadoop

• Open source implementation of MapReduce (Dean & Ghemawat, Google, 2004)

• Hadoop = MapReduce + HDFS • HDFS: Distributed file system, data stored in 64MB

(default) blocks

Hadoop

• Job tracker (master) manages job execution on task trackers (workers)

• Each machine is configured to dedicate processing cores to MapReduce tasks (each core is a worker)

• Name node manages HDFS, i.e. distribution of data blocks on data nodes

Hadoop job building blocks

Map/reduce Application

Job configuration Set or overwrite configuration parameters.

Map method Create intermediate key/value pair output

Reduce method Aggregate intermediate key/value pair output from map

Cluster

Dette billede kan ikke vises i øjeblikket.

Apache Tomcat Web Application

Taverna Server (REST API)

Hadoop Jobtracker

File server

Cluster

Large scale execution environment

SCAPE Example: Characterisation on a large document

collection • Using „Tool“ service, remote ssh execution • Orchestration of hadoop jobs (Hadoop-Streaming-

API, Hadoop Map/Reduce, and Hive) • Available on myExperiment:

http://www.myexperiment.org/workflows/3105 • See Blogpost:

http://www.openplanetsfoundation.org/blogs/2012-08-07-big-data-processing-chaining-hadoop-jobs-using-taverna

Create text file containing JPEG2000 input file paths and read Image metadata using Exiftool via the Hadoop Streaming API.

Reading image metadata

/NAS/Z119585409/00000001.jp2 /NAS/Z119585409/00000002.jp2 /NAS/Z119585409/00000003.jp2 … /NAS/Z117655409/00000001.jp2 /NAS/Z117655409/00000002.jp2 /NAS/Z117655409/00000003.jp2 … /NAS/Z119585987/00000001.jp2 /NAS/Z119585987/00000002.jp2 /NAS/Z119585987/00000003.jp2 … /NAS/Z119584539/00000001.jp2 /NAS/Z119584539/00000002.jp2 /NAS/Z119584539/00000003.jp2 … /NAS/Z119599879/00000001.jp2l /NAS/Z119589879/00000002.jp2 /NAS/Z119589879/00000003.jp2 ...

reading files from NAS

1,4 GB 1,2 GB

: ~ 5 h + ~ 38 h = ~ 43 h 60.000 books

24 Million pages

SCAPE Jp2PathCreator HadoopStreamingExiftoolRead

Z119585409/00000001 2345 Z119585409/00000002 2340 Z119585409/00000003 2543 … Z117655409/00000001 2300 Z117655409/00000002 2300 Z117655409/00000003 2345 … Z119585987/00000001 2300 Z119585987/00000002 2340 Z119585987/00000003 2432 … Z119584539/00000001 5205 Z119584539/00000002 2310 Z119584539/00000003 2134 … Z119599879/00000001 2312 Z119589879/00000002 2300 Z119589879/00000003 2300 ...

Create text file containing HTML input file paths and create one sequence file with the complete file content in HDFS.

SequenceFile creation

/NAS/Z119585409/00000707.html /NAS/Z119585409/00000708.html /NAS/Z119585409/00000709.html … /NAS/Z138682341/00000707.html /NAS/Z138682341/00000708.html /NAS/Z138682341/00000709.html … /NAS/Z178791257/00000707.html /NAS/Z178791257/00000708.html /NAS/Z178791257/00000709.html … /NAS/Z967985409/00000707.html /NAS/Z967985409/00000708.html /NAS/Z967985409/00000709.html … /NAS/Z196545409/00000707.html /NAS/Z196545409/00000708.html /NAS/Z196545409/00000709.html ...

Z119585409/00000707

Z119585409/00000708

Z119585409/00000709

Z119585409/00000710

Z119585409/00000711

Z119585409/00000712

reading files from NAS

1,4 GB 997 GB (uncompressed)

: ~ 5 h + ~ 24 h = ~ 29 h 60.000 books

24 Million pages

SCAPE HtmlPathCreator SequenceFileCreator

Execute Hadoop MapReduce job using the sequence file created before in order to calculate the average paragraph block width.

HTML Parsing

Z119585409/00000001

Z119585409/00000002

Z119585409/00000003

Z119585409/00000004

Z119585409/00000005 ...

: ~ 6 h 60.000 books

24 Million pages

Z119585409/00000001 2100 Z119585409/00000001 2200 Z119585409/00000001 2300 Z119585409/00000001 2400

Z119585409/00000002 2100 Z119585409/00000002 2200 Z119585409/00000002 2300 Z119585409/00000002 2400

Z119585409/00000003 2100 Z119585409/00000003 2200 Z119585409/00000003 2300 Z119585409/00000003 2400

Z119585409/00000004 2100 Z119585409/00000004 2200 Z119585409/00000004 2300 Z119585409/00000004 2400

Z119585409/00000005 2100 Z119585409/00000005 2200 Z119585409/00000005 2300 Z119585409/00000005 2400

Z119585409/00000001 2250 Z119585409/00000002 2250 Z119585409/00000003 2250 Z119585409/00000004 2250 Z119585409/00000005 2250

Map Reduce HadoopAvBlockWidthMapReduce

SequenceFile Textfile

Create hive table and load generated data into the Hive database.

Analytic Queries

27 : ~ 6 h

60.000 books 24 Million pages

SCAPE HiveLoadExifData & HiveLoadHocrData

htmlwidth

jp2width

Z119585409/00000001 1870 Z119585409/00000002 2100 Z119585409/00000003 2015 Z119585409/00000004 1350 Z119585409/00000005 1700

Z119585409/00000001 2250 Z119585409/00000002 2150 Z119585409/00000003 2125 Z119585409/00000004 2125 Z119585409/00000005 2250

CREATE TABLE jp2width (hid STRING, jwidth INT)

CREATE TABLE htmlwidth (hid STRING, hwidth INT)

Analytic Queries

28 : ~ 6 h

60.000 books 24 Million pages

SCAPE HiveSelect

Dette billede kan ikke vises i øjeblikket. Dette billede kan ikke vises i øjeblikket.

htmlwidth jp2width

select jid, jwidth, hwidth from jp2width inner join htmlwidth on jid = hid

Do a simple hive query in order to test if the database has been created successfully.

Example: Web Archiving

Hands on – Virtual machine

• 0.20.2+923.421 Pseudo-distributed Hadoop configuration

• Chromium Webbrowser with Hadoop Admin Links • Taverna Workbench 2.3.0 • NetBeans IDE 7.1.2 • SampleHadoopCommand.txt (executable Hadoop

Command for DEMO1) • Latest patches

Hands on – VM setup

• Unpackage scape4youTraining.tar.gz • VirtualBox: Mashine => Add => Browse to folder =>

select VBOX file • VM instance login:

• user: scape • pw: scape123

Hands on – Demo1

• Using Hadoop for analysing ARC files • Located at:

/example/sampleIN/ (HDFS) • Execution via command in:

SampleHadoopCommand.txt (on Desktop)

• Result can then be found at: /example/sample_OUT/

Hands on – Demo2

• Using Taverna for analysing ARC files • Workflow: /home/scape/scanARC/scanARC_TIKA.t2flow • ADD FILE LOCATION (not add value!!) • Input: /home/scape/scanARC/input/ONBSample.txt

• Result: ~/scanARC/outputCSV/fullTIKAReport.csv

• See ~/scanARC/outputGraphics/ graphicsTIKA/tika-

Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012

Technology

Dinner - Taverna Cretekou

Digital Preservation - The Saga Continues - SCAPE Training event, Guimarães 2012

Taverna Night 2014

Taverna summary

Taverna Workflow

Workflows, Taverna, and Teragrid

Plane Scape

Mythos Greek Taverna Menu

Architecture Portfolio_ Paolo Taverna

Mythos Greek Taverna

TAMAR GUIMARÃES

meniu - Taverna Adonis

Taverna and my Grid

Preservation Workflows with Taverna

Taverna & EU-ADR

2014 Taverna tutorial Xpath

SCAPE Rainer Schmidt SCAPE Information Day May 5 th, 2014 Österreichische Nationalbibliothek The SCAPE Platform Overview

SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 2014Scape information day sb scape presentation

Matchbox tool. Quality control for digital collections – SCAPE Training event, Guimarães 2012

Taverna and myExperiment. SCAPE presentation at a Hack-a-thon