2010, Boston
Community-driven computational biology with Debian and Taverna
Steffen Möller, Hajo Krabbenhöft (Lübeck)Alan Williams, Katy Wolstencroft, Carole Goble (Manchester)
Andreas Tille, Charles Plessy, David Paleino (Debian)
BOSC 2010, Boston
2010, Boston
Motivation
● Open Source Bioinformatics continues to grow and improve● steadily increasing number of tools and databases● addressing more and more complex issues
● Bioinformatics found entry into wet-lab routine● strong service units with many diverse projects● single deeply embedded individuals
● Wanted:● Exchange of bioinformatics recipes, as a database or eventually
linked from papers' method sections● Reliable, instant-available powerful external resources to perform
analysis
2010, Boston
Dual role of Cloud technologies
● Sharing of physical resources● Computation● Storage
● Sharing of management resources● Reference Images● Pre-downloaded, pre-indexed data
– Amazon public data sets– “whatever BOSC 2010 agrees on” for our Eucalyptus
playground
2010, Boston
How to Co-Maintain Cloud Images● Cloud images can be maintained just like regular machines
● The installation of many tools by many people● works, you get somewhere, but then you don't want to touch it again● Is error prone because of inter-dependencies of packages (shared
files, version incompatibilities)
● The partial update of such co maintained images● will most likely break something somewhere → modularity● you want to know what has been done to an image without a
dependency on external web pages → introspection
2010, Boston
How to Co-Maintain Cloud Images
Wanted:● Mechanism to allow the individual upgrading of
software tools and integrity checks● Sharing of the effort
– to compile the source code – one wants to install the binaries only whenever possible
– to describe the packages – should be of little overhead or be already available
This is basically what Linux distributions do.
2010, Boston
Dual role of Debian● Package provider
● many tens of thousands packages are offered– directly as a Linux distribution
– indirectly via descendents Ubuntu or BioLinux
● technical excellence– coherent builds across many platforms (PowerPC, Intel 32 and 64 bit, AMD,
MIPS) and Kernels (Linux, HURD, BSD, OpenSolaris)
– separation of documentation from binaries, GUI from command line, ...
● Community● bug reports● mailing Lists, special interest groups, you may discuss
– packages that are missing
– problems that many of us have that are yet unsolved
2010, Boston
bioinformatics blend● subversion and git repositories for packages
● friendly and open community
● keen on close links with upstream
● Series of tasks within Debian Med – not only bioinformatics:Biology - Debian Med micro-biology packagesBiology development - Debian Med packages for development of micro-biology applicationsContent management - Debian Med content management systemsMedical data - Debian Med suggestions for medical databasesDental - Debian Med packages related to dental practiceEpidemiology - Debian Med epidemiology related packagesHospital information systems - Debian Med suggestions for Hospital Information SystemsImaging - Cross-platform for visualizing, processing and analysing of bioimagesImaging development - Debian Med packages for medical image developmentLaboratory - Debian Med suggestions for medical laboratoriesPharmacy - Debian Med packages for pharmaceutical researchPhysics - Debian Med packages for medical physicistsPractice - Debian Med packages for practice managementPsychology - Debian Med packages for psychologyStatistics - Debian Med statisticsTools - Debian Med several toolsTypesetting - Debian Med support for typesetting and publishing
2010, Boston
How to Co-Maintain a Debian Package● Technically
● Do not touch the original source tree
● Create folder “debian” with files
– “control” - description of package + build deps
– “changelog” - version of package and what's new
– “rules” - how to say “make” and “make install”
– “install” - to split documentation from the rest
Should not be more difficult than executing “make all” directly, contact me or the list when running into problems.
● FTP-upload of package to distribution's server
● Sharing of “debian” folder with community with subversion/git/bazaar
● Community-driven security● Web of trust: Creator of package signs with his GPG key prior to upload,
GPG key is signed by others
● Bug reports may block transition of package to “stable” release
2010, Boston
Something's missing
● We now have the resources.● packages that auto-transform into Cloud images● machines and disk to compute and store in-/output
● We have quite some Bio* community
● Wanted:● Linking of cloud resources with the desktop● Linking of web resources into it● Exchange and reference of
– Inter-package
– Inter-resource
processes that (have) work(ed for someone) and may be adapted
2010, Boston
Dual role of Taverna● Technology:
● Connects files, web services and applications to workflows
● Workflows may comprise other workflows
● Community:
Portal to completeand partial solutionsas workflows onmyExperiment.org
2010, Boston
Taverna integrates command line
● Any command executed in the shell can be integrated● local execution, remote execution with ssh or grid● nicely links clouds, packages and web
● Introduction of UseCases as workflow elements● Database with XML-specification of
– Inputs, Outputs and their MIME types– Commmand line and tools it needs
● Purpose-specific wrappers around binaries or scripts
Krabbenhöft et al., Bioinformatics, 2008
2010, Boston
Shared UseCase management
2010, Boston
Example: Clustering many sequences
● Compute times of several hours are generally not acceptable for public web services
● Not a problem with integrated clouds
CloudImage
Selection
apt-getinstall
t-coffee
StartinstanceLo
cal
Clo
ud
InformTavernaabout
IP number
WorkflowExecution
ResultsInterpretation
2010, Boston
Remaining challenge:sharing public data
● Could work like the management of software, but● Often large with frequent updates
users differ in their demands for latest versions
● Involves post-processingusers differ in their demand to perform such
● Clouds could help, but● one would not want to pay for everything all the time● the installation process would need to be transparent to locally
recreate or update or … improve the data
2010, Boston
Proposal: getData, a shared Perl script● The script is a large hash table
● extendable by configuration files that may be contributed from various packages, like EMBOSS
● Every entry comprises another hash table with attributes– Name – full name of database
– Source – how to retrieve it
– Post-download – what to do once it has arrived
– Recommends – tools suggested to install with the data
● All very simple and extendable● Direct mirroring of effort performed on the command line● The community can co-maintain this script more easily than
some cloud instance● More on http://wiki.debian.org/getData
2010, Boston
Summary● Debian as community and repository for
bioinformatics software● Mailing lists, source code management● FTP servers
● Clouds introduce dynamics into the collaboration● Data flow between packages● Usability● Shared maintenance of public data
● Taverna ● Connects web, grid, cloud instances and local machine● Fosters exchange of experiences with various workflows
2010, Boston
References and Acknowledgements
[1] Debian-Med http://debian-med.alioth.debian.org
[2] getData http://wiki.debian.org/getData
[3] Eucalyptus http://www.eucalyptus.com
[4] Taverna http://www.taverna.org.uk
[5] Taverna UseCases http://taverna.nordugrid.org
[6] myExperiment http://www.myExperiment.org
[7] Eucalyptus http://www.eucalyptus.com
The development of the UseCass plugin to Taverna was funded by the “KnowARC” EU project.
2010, Boston
Debian/Ubuntu contributes● Impressive number of packages
● Bioinformatics (Bio*, EMBOSS, clustering, ...)● Cheminformatics (autodock, gromacs, ballview, …)● General scientific computing tools and libraries
– Clustering (Torque, Sun Grid Engine, ...)– Eucalyptus Cloud environment
● Automation of database updates and indexing with the “getData” script
2010, Boston
Concept: Distro+Workflows+Cloud
● Debian/Ubuntu Linux Distribution● Chem- + Bioinformatics packages● Friendly Community
● Taverna Workflow Suite● Access to services in the web● Access to command line tools via ssh or grids● Exchange of ideas via myExperiment.org
● Eucalyptus or Amazon Clouds● Sharing of databases and indices● Readily available or customized images to instantiate
2010, Boston
The Cloud contributes
A platform for individuals to share● Data (“download only once”)● Its management (“update and index only once”)● Experiences (“I show you”)
Physical resources● To be shared in community (“common cluster”)● To be bought on demand (“run at Amazon.com”)
Solutions● Readily usable images – by community or industry● Adaptability to local demands
Recommended