Sequence analysis and bioinformatics using Debian...

Preview:

Citation preview

Sequence analysis and bioinformatics usingDebian GNU/Linux

Andreas Tille

Libre Software Meeting

LSM, Nantes 2009

1 / 27

Overview

1 Debian MedDebian Pure Blend for medical care and health scienceWhy Debian

2 ImplementationAvailable packagesBiological databases

3 Looking beyondAlternatives and prospectus

2 / 27

Scope of Debian Med

Free management systems for patients in medical practiceand hospitals are rare

GNUmed Patient record documentation for generalpracticiants

MedinTux Practice management system written forFrench health care system

Vista Comprehensive software suite for hospitals(U.S. Department of Veterans Affairs)

Care2x Web based hospital management systemOthers . . .

However, people who hear the sound “Debian Med” justassume we provide a practice management system . . .. . . even if you tell them explicitly it is notSo what are the real strengths of Debian Med?

3 / 27

Medical imaging

Debian Med can only include existing softwareFair amount of high quality Free Software for medicalimaging

Aeskulap, Amide: Medical image viewersDcmtk: OFFIS DICOM toolkitSofa: Simulation Open Framework ArchitectureFsl: analysis tools for brain imaging. . .

Complete overview on Debian Med imaging tasks page

4 / 27

Molecular and structural biology, bioinformatics

Most established branch of Debian Med because of goodcoverage by upstream softwareFostering

Development at universitiesOrganised funding

HinderingAdvertising for proprietary softwareDifferent preferences of initiators

5 / 27

Selected metapackages of Debian Med

2002 2003 2004 2005 2006 2007 2008 2009

Number of dependencies of selected Debian Med metapackages0

2040

6080 Microbiology

ImagingPractice

6 / 27

Top 10 posters on debian-med@lists.debian.org

2002 2003 2004 2005 2006 2007 2008 2009

010

020

030

040

0 Andreas.TilleCharles.PlessyDavid.PaleinoKarsten.HilbertSteffen.MoellerNelson.A..de.OliveiraMathieu.MalaterreMichael.HankeDaniel.LeidertSteve.M..Robbins

7 / 27

Differences to commercial distributions

Commercial distributor Debian

Company Structure Organisation

Employees Persons Volunteers

CDs, Service Sells Nothing

Business plan Release If 0 RC-bugs

Certified Oracle, SAP, etc. Runs in principle

Beginners Preferred by Administrators

Rpm Packages Deb

Market Customisation Do-O-Cracy

8 / 27

Customising Debian

Debian > 20000 packagesFocus on medical subset of those packagesEasy installation and configurationAutomatic installation→ cloud computingMaintaining a general infrastructure for medical usersPropagate the idea of Free Software in medicineCompletely integrated into Debian - no fork

Basic idea: Do not make a separatedistribution but make Debian fit for medical

care instead

9 / 27

Debian - adaptable for any purpose?

Developed by about 1000 volunteersFlexible, not bound on commercial interestStrict rules (policy) glue all things togetherCommon interest of each individual developer:Get the best operating system for himself.Developers have children in real life or work in the field ofmedicine etc.In contrast to employees of companies every single Debiandeveloper has the freedom and ability to realize his visionEvery developer is able to influence the development ofDebian - he just has to do it.

Do-O-Cracy = "The doer decides"

10 / 27

Programming language support

BioPerl Collection of Perl tools for computationalmolecular biology

BioPython Python library for computational molecular biologyBioRuby Ruby tools for computational molecular biologyBioJava Java API to biological data and applications

BioSQUID library of C code functions for sequence analysis

11 / 27

Widely used software

BLAST2 Basic Local Alignment Search Toolofficial NCBI version of this famous sequencealignment program (Note that databases are notincluded in Debian; they must be retrievedmanually.)

EMBOSS European Molecular Biology Open Software SuiteEMBOSS is a free Open Source software analysispackage specially developed for the needs of themolecular biology (e.g. EMBnet) user community

12 / 27

Statistics using GNU RR-cran-genetics GNU R package for population genetics

The package provides a library for the statisticsenvironment R that contains classes to representgenotypes and haplotypes at single markers up tomultiple markers on multiple chromosomes.

R-cran-haplo.stats GNU R package for haplotype analysisThe package provides routines for the GNU Rstatistics environment for statistical Analysis ofindirectly measured Haplotypes with Traits andCovariates when Linkage Phase is Ambiguous

Bioconductor GNU R tools for the analysis and comprehensionof genomic data.Not yet packaged for Debian but work in progressto automate packaging of CRAN and Bioconductorpackages.

There are some more general R packages recommended bymed-statistics

13 / 27

Phylogenetic analysis

Altree Perform phylogeny based analysesfastdnaml Construction of phylogenetic trees of DNA

sequencesNjplot phylogenetic tree drawing program

Tree-puzzle Reconstruction of phylogenetic trees by maximumlikelihood

Treeviewx Displays and prints phylogenetic trees

Phylip Package of programs for inferring phylogeniesTreetool Interactive tool for displaying phylogenetic trees

14 / 27

Genetics and analysis of RNA sequences

Genetics:Fastlink Faster version of pedigree programs of Linkage

Loki MCMC linkage analysis on general pedigreesPlink Whole-genome association analysis toolset

R-cran-qtl GNU R package for genetic marker linkageanalysis

Analysis of RNA sequences:

Infernal Inference of RNA secondary structural alignmentsRnahybrid Fast and effective prediction of microRNA/target

duplexes

15 / 27

Sequence alignments and related programs

amap-alig Protein multiple alignment by sequence annealingBoxshade Pretty-printing of multiple sequence alignments

Dialign(-tx) Segment-based multiple sequence alignmentExonerate Generic tool for pairwise sequence comparisonGff2aplot Pair-wise alignment-plots for genomic sequences

in PostScriptHmmer Profile hidden Markov models for protein

sequence analysisKalign Global and progressive multiple sequence

alignmentMafft Multiple alignment program for amino acid or

nucleotide sequencesMummer Efficient sequence alignment of full genomes

Muscle Multiple alignment program of protein sequences

16 / 27

Sequence alignments and related programs (cont.)Poa Partial Order Alignment for multiple sequence

alignmentProbcons PROBabilistic CONSistency-based multiple

sequence alignmentProda Multiple alignment of protein sequences

Seaview Multiplatform interface for sequence alignment andphylogeny

Sibsim4 Align expressed RNA sequences on a DNAtemplate

Sigma-align Simple greedy multiple alignment of non-codingDNA sequences

Sim4 Tool for aligning cDNA and genomic DNAT-coffee Multiple Sequence Alignment

Wise Comparison of biopolymers, commonly DNA andprotein sequences

17 / 27

Molecular modelling and molecular dynamics

Adun.app Molecular Simulator for GNUstepAutogrid Pre-calculate binding of ligands to their receptor

Gamgi Construct, view and analyse atomic structuresGarlic Visualisation program for biomoleculesGdpc Visualiser of molecular dynamic simulations

Ghemical GNOME molecular modelling environmentGromacs Molecular dynamics simulator, with building and

analysis toolsPymol Molecular Graphics System

R-other-bio3d GNU R package for biological structure analysisRasmol Visualise biological macromolecules

Autodocktools GUI to help set up, launch and analyseAutoDock dockings

18 / 27

High-throughput sequencing

“Next-generation sequencing”Chip-systems to sequence a genomeReads are very short (40 nucleotides rather thantraditionally about 600)Enormous amount of chromosomal regions covered

Last-align Genome-scale comparison of biologicalsequences

Maq Maps short fixed-length polymorphic DNAsequence reads to reference sequences

Ssake Genomics application for assembling millions ofvery short DNA sequences

Velvet Nucleic acid sequence assembler for very shortreads

19 / 27

Mikrobiological packages

More than 80 PackagesOverview ataccording tasks page of Debian Med projectSoftware developed by

National Center for Biotechnology Information (NCBI)Sanger InstituteThe Institute for Genomic Research (TIGR). . .

20 / 27

DebTags

udd=# SELECT tag, COUNT(*) FROM debtagsWHERE tag LIKE ’%bio%’GROUP BY tag ORDER BY tag;

tag | count-------------------------------+-------biology::emboss | 2biology::format:aln | 9biology::format:fasta | 9biology::nuceleic-acids | 11biology::peptidic | 12field::biology | 174field::biology:bioinformatics | 86field::biology:molecular | 8field::biology:structural | 16

(9 rows)

21 / 27

How to install large databases

Bundling into Debian package makes no senseSize costs bandwidths and mirror spaceMoving target: Stable distribution will be out of date soonRemote service seems appropriate

Solution also works for astronomy and meteorology

getDataObtain data from external sourceMove data to local mirrorPreparation of configuration file for particular system thatdeals with the databasegetData is still in a proof of concept stagePeople are much welcome to join this development(Google Summer of Code project)

22 / 27

Open Database License (ODbL) v1.0

Open Data CommonsLicense agreement intended to allow users to freely share,modify, and use this Database while maintaining this samefreedom for othersDatabases can contain a wide variety of types of content(images, audiovisual material, and sounds all in the samedatabase, for example)ODbL only governs the rights over the Database, and notthe contents of the Database individually

23 / 27

Alternatives

BioLinuxBased on DebianCreate a policy incompatible structure in/usr/local/biolinux

Some software not yet available in Debian but really sloppywith licensesWe try to include the missing stuff in Debian to create apolicy compliant, really free systemHope BioLinux people will adopt this

FreeBSD ports collection BiologyAlso contains a fair amount of biological softwareOnly a few unimportant missing in Debian

24 / 27

Prospectus

There are good requisites in DebianMost important tools of molecular biology, structuralbiology and bioinformatics for use in life sciences areincludedFurther increase of interest of developers and users andgetting them involved in the projectTurning Debian into the distribution of choice for peopleworking in the field of medicine because there is bestsupport for free medical software

25 / 27

This talk is available athttp://people.debian.org/˜ tille/talks/Andreas Tille <tille@debian.org>

Recommended