Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2003.01
Introduction to Bioinformatics
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2003.01
SIB and EMBnet Bioinformatics resources for biomedical
scientists
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2003.01
The Swiss Institute of Bioinformatics
Founded in March 1998 Collaborative structure Lausanne - Geneva -
Basel Groups at ISREC, Ludwig Institute, Unil, HUG,
UniGe, recently UniBas and soon EPFL. Several roles: teaching, services, research Currently: ~ 130 employees
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2003.01
Projects at SIB
Databases SWISS-PROT, PROSITE, EPD, World-2DPAGE, SWISS-MODEL TrEST, TrGEN (predicted proteins), tromer (transcriptome)
Softwares Melanie, Deep View, proteomic tools, ESTScan, pftools, Java
applets Services
Web servers ExPASy, EMBnet Teaching and helpdesk
Research Mostly sequence and expression analysis, 3D structure, and
proteomic
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2003.01
QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2003.01
Teaching
DEA (master degree) in Bioinformatics: 1 year full time, first diploma common to Unige and Unil.
EMBnet courses: 2x 1 week per year in Lausanne, to be extended in Basel
Pregrade courses in Geneva, Fribourg and Lausanne Universities
Other courses at CHUV and EPFL Courses in other countries: Colombia,
Cambodia, Peru, …
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2003.01
Research
New algorithms (faster alignments…) New technology (GRID or cluster computing) New tools (protein analysis, microarrays,
confocal microscopy) New databases (microarrays, transcriptome,
proteome)
Collaborations with lab researchers!
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2003.01
Three levels of services
Simple web access to softwares and databases Easy to use for basic occasional research with few sequences Potentially insecure
Command-line access with a local Unix account More powerful (automation) and secure Requires to understand Unix system and frequent practice
Collaboration with SIB Access to experts in the field (help desk) For projects requiring huge programming or special hardware
resources
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2003.01
SIB’s important sites
Home www.isb-sib.ch
ExPASy - Expert Protein Analysis System www.expasy.org
Hits database and tools hits.isb-sib.ch
EMBnet Switzerland www.ch.embnet.org
Geneva Bioinformatics www.genebio.ch
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2003.01
SIB home
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2003.01
Expert Protein Analysis SystemQuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2003.01
Swiss node http://www.ch.embnet.org
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2003.01
EMBnet organisation
European in 1988, now world-wide spread 32 country nodes, 8 special nodes.
Role Training, education (EMBER) Software development (EMBOSS, SRS) Computing resources (databases, websites, services) Helpdesk and technical support Publications (EMBnet.news, Briefings in Bioinformatics)
Access: www.embnet.org Each node with “www.xx.embnet.org” where xx is the country
code (e.g., ch for Switzerland)
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2003.01
EMBnet home
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2003.01
European Molecular Biology Open Software Suite
Free Open Source (for most Unix plateforms) GCG successor (compatible with GCG file
format) More than 200 programs Easy to install locally
but no interface, requires local databases Unix command-line only
Interfaces Jemboss, www2gcg, w2h, wemboss … (with account) Pise, EMBOSS-GUI (no account)
Access: www.emboss.org
QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2003.01
Other important sites
ExPASy - Expert Protein Analysis System www.expasy.org
EBI - European Bioinformatics Institute www.ebi.ac.uk
NCBI - National Center for Biotechnology Information www.ncbi.nlm.nih.gov
Sanger - The Sanger Institute www.sanger.ac.uk
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2003.01
Bioinformatics: definition
Every application of computer science to biology Sequence analysis, images analysis, sample
management, population modelling, … Analysis of data coming from large-scale
biological projects Genomes, transcriptomes, proteomes, metabolomes,
etc…
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2003.01
The new biology
Traditional biology Small team working on a specialized topic Well defined experiment to answer precise questions
New « high-throughput » biology Large international teams using cutting edge
technology defining the project Results are given raw to the scientific community
without any underlying hypothesis
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2003.01
Example of « high-throughput »
Complete genome sequencing Large-scale sampling of the transcriptome (EST) Simultaneous expression analysis of thousands of genes
(DNA microarrays, SAGE) Large-scale sampling of the proteome Protein-protein analysis large-scale 2-hybrid (yeast,
worm) Large-scale 3D structure production (yeast) Metabolism modelling Simulations Biodiversity
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2003.01
Role of bioinformatics
Control and management of the data Analysis of primary data e.g.
Base calling from chromatograms Mass spectra analysis DNA microarrays images analysis
Statistics Database storage and access Results analysis in a biological context
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2003.01
First information: a sequence ?
Nucleotide RNA (or cDNA) Genomic (intron-exon) Complete or incomplete?
mRNA with 5’ and 3’ UTR regions Entire chromosome
Protein Pre/Pro or functional protein? Function prediction Post-translational modifications? Holy Grail: 3D structure?
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2003.01
Genomes in numbers
Sizes: virus: 103 to 105 nt bacteria: 105 to 107 nt yeast: 1.35 x 107 nt mammals: 108 to
1010 nt plants: 1010 to 1011 nt
Gene number: virus: 3 to 100 bacteria: ~ 1000 yeast: ~ 7000 mammals: ~ 30’000 Plants: 30’000-
50’000?
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2003.01
Sequencing projects
« small » genomes (<107): bacteria, virus Many already sequenced (industry excluded) More than 100 microbial genomes already in the public
domain More to come! (one new every two weeks…)
« large » genomes (107-1010) eucaryotes 15 finished (S.cerevisiae, S. Pombe, E. cuniculi, G. theta,
C.elegans, D.melanogaster, A. gambiae, P. falciparum, P. yoelii, D. rerio, F. rubripes, A.thaliana, O. sativa (2x), M. musculus, Homo sapiens)
Many more to come: rat, pig, cow, maize (and other plants), insects, fishes, many pathogenic parasites (Leishmania…)
EST sequencing Partial mRNA sequences ~15x106 sequences in the public
domain
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2003.01
Human genome
Size: 3 x 109 nt for a haploid genome Highly repetitive sequences 25%, moderately repetitive
sequences 25-30% Size of a gene: from 900 to >2’000’000 bases (introns
included) Proportion of the genome coding for proteins: 5-7% Number of chromosomes: 22 autosomal, 1 sexual
chromosome Size of a chromosome: 5 x 107 to 5 x 108 bases
centromer exons of a gene telomer
regulatory elements repetitive sequences
locus control region
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2003.01
How to sequence the human genome?
Consortium « international » approach: Generate genetic maps (meiotic recombination) and
pseudogenetic maps (chromosome hybrids) for indicator sequences
Generate a physical map based on large clones (BAC or PAC) Sequence enough large clones to cover the genome
« commercial » approach (Celera): Generate random libraries of fixed length genomic clones (2kb
and 10kb) Sequence both ends of enough clones to obtain a 10x coverage Use computer techniques to reconstitute the chromosomal
sequences, check with the public project physical map
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2003.01
Sequencing progression
QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2003.01
Interpretation of the human draft
Still many gaps and unordered small pieces (except for chr 6, 7, 13, 14 20, 21, 22, Y)
Even a genomic sequence does not tell you where the genes are encoded. The genome is far from being « decoded »
One must combine genome and transcriptome to have a better idea
QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.
Last freeze Ncbi30 June 24, 2002Last freeze Ncbi30 June 24, 2002
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2003.01
The transcriptome
The set of all functional RNAs (tRNA, rRNA, mRNA etc…) that can potentially be transcribed from the genome
The documentation of the localization (cell type) and conditions under which these RNAs are expressed
The documentation of the biological function(s) of each RNA species
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2003.01
Public draft transcriptome
Information about the expression specificity and the function of mRNAs « full » cDNA sequences of know function « full » cDNA sequences, but « anonymous » (e.g. KIAA or
DKFZ collections) EST sequences
cDNA libraries derived from many different tissues Rapid random sequencing of the ends of all clones ORESTES sequences
Growing set of expression data (microarrays, SAGE etc…) Increasing evidences for multiple alternative splicing and
polyadenylation
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2003.01
Example mapping of ESTs and mRNAs
ESTsmRNAs
Computer prediction
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2003.01
The proteome
Set of proteins present in a particular cell type under particular conditions
Set of proteins potentially expressed from the genome
Information about the specific expression and function of the proteins
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2003.01
Information on the proteome
Separation of a complex mixture of proteins 2D PAGE (IEF + SDS PAGE) Capillary chromatography
Individual characterisation of proteins Tryptic peptides signature (MS) Sequencing by chemistry or MS/MS
All post-translational modifications (PTMs) !
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2003.01
Tridimentional structures
Methods to determine structures X-ray cristallography NMR
Data format Atoms coordinates (except H) in a cartesian space
Databases For proteins and nucleic acids (RSCB, was PDB) Independent databases for sugars and small organic
molecules
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2003.01
Visualisation of the structures
Secondary structure elements Alpha helices, beta sheets, other
Softwares Various representations (atoms, bonds, secondary…) Big choice of commercial and free software (e.g.,
DeepView)
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2003.01
Sequence information, and so what ?
How to store and organise ? Databases (next lecture)
How to access, search, compare ? Pairwise alignments, dot plots (Tuesday) BLAST searches in db (Tuesday) EST clustering (Wednesday) Multiple Alignments (Wednesday) Patterns, PSI-BLAST, Profiles and HMMs (Thursday) Gene prediction (Thursday) Protein function prediction (Friday) Users problems (Friday)
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2003.01
Thank you