Introducció a la Bioinformàtica Roderic Guigó i Serra roderic.guigo@crg.cat Bioinformàtica, UPF...

Preview:

Citation preview

Introducció a la BioinformàticaRoderic Guigó i Serraroderic.guigo@crg.cat

Bioinformàtica, UPF Curs 2011-2012

Van Leeuwenhoek and the microscopy

In 1676 his credibility was questioned when he sent the Royal Society a copy of his first observations of microscopic single celled organisms. Heretofore, the existence of single celled organisms was entirely unknown … The Royal Society arranged to send an English vicar, as well as a team of respected jurists and doctors to Delft, Holland to determine whether it was in fact Van Leeuwenhoek's ability to observe and reason clearly (wikipedia)

El impacte de la tecnologia en la practica

de la biologia

gagttttatcgcttccatgacgcagaagttaacactttcggatatttctgatgagtcgaaaaattatcttgataaagcaggaattactactgcttgtttacgaattaaatcgaagtggactgctggcggaaaatgagaaaattcgacctatccttgcgcagctcgagaagctcttactttgcgacctttcgccatcaactaacgattctgtcaaaaactgacgcgttggatgaggagaagtggcttaatatgcttggcacgttcgtcaaggactggtttagatatgagtcacattttgttcatggtagagattctcttgttgacattttaaaagagcgtggattactatctgagtccgatgctgttcaaccactaataggtaagaaatcatgagtcaagttactgaacaatccgtacgtttccagaccgctttggcctctattaagctcattcaggcttctgccgttttggatttaaccgaagatgatttcgattttctgacgagtaacaaagtttggattgctactgaccgctctcgtgctcgtcgctgcgttgaggcttgcgtttatggtacgctggactttgtgggataccctcgctttcctgctcctgttgagtttattgctgccgtcattgcttattatgttcatcccgtcaacattcaaacggcctgtctcatcatggaaggcgctgaatttacggaaaacattattaatggcgtcgagcgtccggttaaagccgctgaattgttcgcgtttaccttgcgtgtacgcgcaggaaacactgacgttcttactgacgcagaagaaaacgtgcgtcaaaaattacgtgcggaaggagtgatgtaatgtctaaaggtaaaaaacgttctggcgctcgccctggtcgtccgcagccgttgcgaggtactaaaggcaagcgtaaaggcgctcgtctttggtatgtaggtggtcaacaattttaattgcaggggcttcggccccttacttgaggataaattatgtctaatattcaaactggcgccgagcgtatgccgcatgacctttcccatcttggcttccttgctggtcagattggtcgtcttattaccatttcaactactccggttatcgctggcgactccttcgagatggacgccgttggcgctctccgtctttctccattgcgtcgtggccttgctattgactctactgtagacatttttactttttatgtccctcatcgtcacgtttatggtgaacagtggattaagttcatgaaggatggtgttaatgccactcctctcccgactgttaacactactggttatattgaccatgccgcttttcttggcacgattaaccctgataccaataaaatccctaagcatttgtttcagggttatttgaatatctataacaactattttaaagcgccgtggatgcctgaccgtaccgaggctaaccctaatgagcttaatcaagatgatgctcgttatggtttccgttgctgccatctcaaaaacatttggactgctccgcttcctcctgagactgagctttctcgccaaatgacgacttctaccacatctattgacattatgggtctgcaagctgcttatgctaatttgcatactgaccaagaacgtgattacttcatgcagcgttaccatgatgttatttcttcatttggaggtaaaacctcttatgacgctgacaaccgtcctttacttgtcatgcgctctaatctctgggcatctggctatgatgttgatggaactgaccaaacgtcgttaggccagttttctggtcgtgttcaacagacctataaacattctgtgccgcgtttctttgttcctgagcatggcactatgtttactcttgcgcttgttcgttttccgcctactgcgactaaagagattcagtaccttaacgctaaaggtgctttgacttataccgatattgctggcgaccctgttttgtatggcaacttgccgccgcgtgaaatttctatgaaggatgttttccgttctggtgattcgtctaagaagtttaagattgctgagggtcagtggtatcgttatgcgccttcgtatgtttctcctgcttatcaccttcttgaaggcttcccattcattcaggaaccgccttctggtgatttgcaagaacgcgtacttattcgccaccatgattatgaccagtgtttccagtccgttcagttgttgcagtggaatagtcaggttaaatttaatgtgaccgtttatcgcaatctgccgaccactcgcgattcaatcatgacttcgtgataaaagattgagtgtgaggttataacgccgaagcggtaaaaattttaatttttgccgctgaggggttgaccaagcgaagcgcggtaggttttctgcttaggagtttaatcatgtttcagacttttatttctcgccataattcaaactttttttctgataagctggttctcacttctgttactccagcttcttcggcacctgttttacagacacctaaagctacatcgtcaacgttatattttgatagtttgacggttaatgctggtaatggtggttttcttcattgcattcagatggatacatctgtcaacgccgctaatcaggttgtttctgttggtgctgatattgcttttgatgccgaccctaaattttttgcctgtttggttcgctttgagtcttcttcggttccgactaccctcccgactgcctatgatgtttatcctttgaatggtcgccatgatggtggttattataccgtcaaggactgtgtgactattgacgtccttccccgtacgccgggcaataacgtttatgttggtttcatggtttggtctaactttaccgctactaaatgccgcggattggtttcgctgaatcaggttattaaagagattatttgtctccagccacttaagtgaggtgatttatgtttggtgctattgctggcggtattgcttctgctcttgctggtggcgccatgtctaaattgtttggaggcggtcaaaaagccgcctccggtggcattcaaggtgatgtgcttgctaccgataacaatactgtaggcatgggtgatgctggtattaaatctgccattcaaggctctaatgttcctaaccctgatgaggccgcccctagttttgtttctggtgctatggctaaagctggtaaaggacttcttgaaggtacgttgcaggctggcacttctgccgtttctgataagttgcttgatttggttggacttggtggcaagtctgccgctgataaaggaaaggatactcgtgattatcttgctgctgcatttcctgagcttaatgcttgggagcgtgctggtgctgatgcttcctctgctggtatggttgacgccggatttgagaatcaaaaagagcttactaaaatgcaactggacaatcagaaagagattgccgagatgcaaaatgagactcaaaaagagattgctggcattcagtcggcgacttcacgccagaatacgaaagaccaggtatatgcacaaaatgagatgcttgcttatcaacagaaggagtctactgctcgcgttgcgtctattatggaaaacaccaatcttcccaagcaacagcaggtttccgagattatgcgccaaatgcttactcaagctcaaacggctggtcagtattttaccaatgaccaaatcaaagaaatgactcgcaaggttagtgctgaggttgacttagttcatcagcaaacgcagaatcagcggtatggctcttctcatattggcgctactgcaaaggatatttctaatgtcgtcactgatgctgcttctggtgtggttgatatttttcatggtattgataaagctgttgccgatacttggaacaatttctggaaagacggtaaagctgatggtattggctctaatttgtctaggaaataaccgtcaggattgacaccctcccaattgtatgttttcatgcctccaaatcttggaggcttttttatggttcgttcttattacccttctgaatgtcacgctgattattttgactttgag

1975: DNA Sequencing.Sanger. Maxam and Gilbert

Human Genome Project Milestones

2001: la culminación del proyecto

What’s past is prologue

• Shakespeare, The Tempest

• 2008: Major genome centers can sequence the same number of base pairs every 4 days

• 1000 Genome project launched

• World-wide capacity dramatically increasing

Further Evolution of Large-scale Genome Sequencing

• 2000: Human genome working drafts

• Data unit of approximately 10x coverage of human– 10 years and cost about $3 billion

• 2009: Every 4 hours ($25,000)

• 2010: Every 14 minutes ($5,000)

• Illumina HiSeq2000 machine produces 200 gigabases per 8 day run (BGI have ordered have 128)

Slide from Paul Flicek. EBI Bioinformatics Advisory Council

Moore’s Law

Fig. 1 A doubling of sequencing output every 9 months has outpaced and overtaken performance improvements within the disk storage and high-performance computation

fields.

S D Kahn Science 2011;331:728-729Published by AAAS

Sequencing challenges

• Sequencing to survey dynamics of ecosystems• Metagenomes

– Within individual ecosystems

• Other species genomes• Reference Human Genome• Individual genomes• Individual meta-genomes• Within individual genomic diversity• Sequencing as the read-out of experiments

– Chip-Seq and nucleosome positioning• RNA sequencing as a proxy to the cell’s

phenotype

Sequencing challenges

• Sequencing to survey dynamics of ecosystems• Metagenomes

– Within individual ecosystems

• Other species genomes• Reference Human Genome• Individual genomes• Individual meta-genomes• Within individual genomic diversity• Sequencing as the read-out of experiments

– Chip-Seq and nucleosome positioning• RNA sequencing as a proxy to the cell’s

phenotype

Sequencing challenges

• Sequencing to survey dynamics of ecosystems• Metagenomes

– Within individual ecosystems

• Other species genomes• Reference Human Genome• Individual genomes• Individual meta-genomes• Within individual genomic diversity• Sequencing as the read-out of experiments

– Chip-Seq and nucleosome positioning• RNA sequencing as a proxy to the cell’s

phenotype

Sequencing challenges

• Sequencing to survey dynamics of ecosystems• Metagenomes

– Ecosystems (enviromental, individual)

• Other species genomes• Reference Human Genome• Individual genomes• Individual meta-genomes• Within individual genomic diversity• Sequencing as the read-out of experiments

– Chip-Seq and nucleosome positioning• RNA sequencing as a proxy to the cell’s

phenotype

Sequencing challenges

• Sequencing to survey dynamics of ecosystems• Metagenomes

– Ecosystems (enviromental, individual)

• Other species genomes• Reference Human Genome• Individual genomes• Individual meta-genomes• Within individual genomic diversity• Sequencing as the read-out of experiments

– Chip-Seq and nucleosome positioning• RNA sequencing as a proxy to the cell’s

phenotype

Sequencing challenges

• Sequencing to survey dynamics of ecosystems• Metagenomes

– Ecosystems (enviromental, individual)

• Other species genomes• Reference Human Genome• Individual genomes• Individual meta-genomes• Within individual genomic diversity• Sequencing as the read-out of experiments

– Chip-Seq and nucleosome positioning• RNA sequencing as a proxy to the cell’s

phenotype

Sequencing challenges

• Sequencing to survey dynamics of ecosystems• Metagenomes

– Ecosystems (enviromental, individual)

• Other species genomes• Reference Human Genome• Individual genomes• Individual meta-genomes• Within individual genomic diversity• Sequencing as the read-out of experiments

– Chip-Seq, nucleosome positioning, …• RNA sequencing as a proxy to the cell’s

phenotype

Sequencing challenges

• Sequencing to survey dynamics of ecosystems• Metagenomes

– Ecosystems (enviromental, individual)

• Other species genomes• Reference Human Genome• Individual genomes• Individual meta-genomes• Within individual genomic diversity• RNA sequencing as a proxy to the cell’s

phenotype

Sequencing challenges

• Sequencing to survey dynamics of ecosystems• Metagenomes

– Ecosystems (enviromental, individual)

• Other species genomes• Reference Human Genome• Individual genomes• Individual meta-genomes• Within individual genomic diversity• RNA sequencing as a proxy to the cell’s

phenotype• Sequencing to interrogate the activity of

the genome, the epigenome

The ENCODE Project

transcripts

Epigenetic modifications

Transcription Factor

Binding sites

Experimental assays used by the ENCODE consortium

in summary

Biology is transitioning (at least partially)

from an “analytic” science: the real world

is disected in its elemental components in order to be comprehended

to “syntetic” science: the challenge is

the integration of globlal information on the living cell/individual/population/(eco)sytem.

From analytic to syntetic

Biology, a science in which the effort has traditionally been directed towards data aquisition has become in a very short time a discipline in which the data is obtained with almost no human intervention, and the effort is turning towards data analysis.

From data acquisition to data analysis

DNA microarrays

bioinformatics

Medline articles

year # articlesTo 1990 0

bioinformatics

Medline articles

year # articlesTo 1990 0

1990-1994 15

bioinformatics

Medline articles

year # articlesTo 1990 0

1990-1994 151995-1999 823

bioinformatics

Medline articles

year # articlesTo 1990 0

1990-1994 151995-1999 8232000-2004 7827

bioinformatics

Medline articles

year # articlesTo 1990 0

1990-1994 151995-1999 8232000-2004 78272005-2008 18822

Bioinformatics, Genomics, Systems Biology in Medline

Engineering and Biology increasingly connected

Engineering and biology: increasingly interconnected• Improved technologies to survey

Biological Systems– NGS and the like

• Engineering of Biological Systems– Medicine– New and modified biological systems

• Using Biology to build non-biological systems– DNA computing

Biology has changed and it is changing

•Quantitative thinking•Ability to attack unanticipated problems

Biology requires quantitative thinking

• Statistics • Mathematics• Computer Science• …

and programming skills (unix)

• The ability to interrogate data, and to models systems

bioinformatics 14,100,000

chemoinformatics 226,000

astroinformatics 195

neuroinformatics 364,000

socioinformatics 610

geoinformatics 506,000

meteoinformatics 48

econoinformatics 441

ecoinformatics 160,000

Bioinformatics

Google search: X-informatics (11 juny, 2007)

ACTCAGCCCCAGCGGAGGTGAAGGACGTCCTTCCCCAGGAGCCGGTGAGAAGCGCAGTCGGGGGCACGGGGATGAGCTCAGGGGCCTCTAGAAAGATGTAGCTGGGACCTCGGGAAGCCCTGGCCTCCAGGTAGTCTCAGGAGAGCTACTCAGGGTCGGGCTTGGGGAGAGGAGGAGCGGGGGTGAGGCCAGCAGCAGGGGACTGGACCTGGGAAGGGCTGGGCAGCAGAGACGACCCGACCCGCTAGAAGGTGGGGTGGGGAGAGCATGTGGACTAGGAGCTAAGCCACAGCAGGACCCCCACGAGTTGTCACTGTCATTTATCGAGCACCTACTGGGTGTCCCCAGTGTCCTCAGATCTCCATAACTGGGAAGCCAGGGGCAGCGACACGGTAGCTAGCCGTCGATTGGAGAACTTTAAAATGAGGACTGAATTAGCTCATAAATGGAAAACGGCGCTTAAATGTGAGGTTAGAGCTTAGAATGTGAAGGGAGAATGAGGAATGCGAGACTGGGACTGAGATGGAACCGGCGGTGGGGAGGGGGAGGGGGTGTGGAATTTGAACCCCGGGAGAGAAAGATGGAATTTTGGCTATGGAGGCCGACCTGGGGATGGGGAAATAAGAGAAGACCAGGAGGGAGTTAAATAGGGAATGGGTTGGGGGCGGCTTGGTAACTGTTTGTGCTGGGATTAGGCTGTTGCAGATAATGGAGCAAGGCTTGGAAGGCTAACCTGGGGTGGGGCCGGGTTGGGGTCGGGCTGGGGGCGGGAGGAGTCCTCACTGGCGGTTGATTGACAGTTTCTCCTTCCCCAGACTGGCCAATCACAGGCAGGAAGATGAAGGTTCTGTGGGCTGCGTTGCTGGTCACATTCCTGGCAGGTATGGGGCGGGGCTTGCTCGGTTTTCCCCGCTTCTCCCCCTCTCATCCTCACCTCAACCTCCTGGCCCCATTCAAGCACACCCTGGGCCCCCTCTTCTTCTGCTGGTCTGTCCCCTGAGGGGAAAGCCCAGGTCTGAGGCTTCTATGCTGCTTTCTGGCTCAGAACAGCGATTTGACGCTCTGTGAGCCTCGGTTCCTCCCCCGCTTTTTTTTTTTCAGCCAGAGTCTCACTCTGTCGCCCAGGCTGGAGTGCAGTGGCGCAATCTCAGCTCACTGCAAGCTCCGCCTCCCGGGTTCACGCTATTCTCCCGCCTCAGCCTCCCGAGTAGCTGGGACTACAGGCGCCCGCCACCATGCCCGGCTAATTTTTTGTACTTTGAGTAGGGAAGGGGTTTCACTGTATTATCCAGGATGGTCTCTATCTCCTGACCTCGTGATCTGCCCGCCTGGCCTCCCAAAGTGCTGGAATTACAGGCGTGAGCCTCCGCGCCCGGCCTCCCCATCCTTAATATAGGAGTTAGAAGTTTTTGTTTGTTTGTTTTGTTTTGTTTTTGTTTTGTTTTGAGATGAAGTCCCTCTGTCGCCCAGGCTGGAGTGCAGTGGCTCCCAGGCTGGAGTTCAGTGGCTGGATCTCGGCTCACTGCAAGCTCCGCCTCCCAGGTTCACGCCATTCTCCTGCCTCAGCCTCCGGAGTAGCTGGGACTACAGGAACATGCCACCACACCCGACTAACTTTTTTTGTATTTTTAGTAGAGACGGGGTTTCACCATGTTGGCCAGGCTGGTCTGGAACTCCTGACCTCAGGTGATCTGCCTGCTTCAACCTCCCAAAGTGCTGGGATTACAGACGTGGGCCACCGCGCCCGGCTGGGAGTTAAGAGGTTTCTAATGCATTGCATTAGAATACCAGACACGGGACAGCTGTGATCTTTATTCTCCATCACCCCACACAGCCCTGCCTGGGGCACACAAGGACACTCAATACACGCTTTTCGGGCGCGGTGGCTCAAGCTGTAATCCCAGCACTTTGGGAGGCTGAGGCGGGTGGTACATGAGGTCAGGAGATCGAGACCATCCTGGCTAACATGGTGAAACCCCGTCTCTACTAAAAATACAAAAAACTAGCCCGGGCGTGGTGGCGGGCGCCTGTAGTCCCAGCTACTCGGAGGCTGAGGCAGGAGAATGGCGTGAACCTGGGAGGCGGAGCTTGCAGTGAGCCGAGATCGCGCCACTGCACTCCAGCCTGGGTGACACAGCGCGAGACTCCGTCTCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAATACACGCTTTTCCGCTAGGCACGGTGGCTCACCCCTGTAATCCCAGCATTTTGGGAGGCCAAGGTGGGAGGATCACTTGAGCCCAGGAGTTCAACACCAGACTCAGCAACATAGTGAGACTCTCTCTACTAAAAATACAAAAATTAGCCAGGCCTGGTGCCACACACCTGTGGTCCCAGCTACTCAGAAGGCTAAGGCAGGAGGATCGCTTAAGCCCAGAAGGTCAAGGTTGCAGTGAACCACGTTCAGGCCACTGCAGTCCAGCCTGGGTGACAGAGCAAGACCCTGTCTGTAAATAAATAACGCTTTTCAAGTGATTAAACAGACTCCCCCCTCACCCTGCCCACCATGGCTCCAAAGCAGCATTTGTGGAGCACCTTCTGTGTGCCCCTAGGTACTAGCTGCCTGGACGGGGTCAGAAGGAACCTGAACCACCTTCAACTTGTTCCACACAGGATGCCAGGCCAAGGTGGAGCAACCGGTGGAGCCAGAGACAGAACCCGACGTTCGCCAGCAGGCTGAGTGGCAGAGCGGCCAGCCCTGGGAGCTGGCACTGGGTCGCTTTTGGGATTACCTGCGCTGGGTGCAGACACTGTCTGAGCAGGTGCAGGAGGAGCTGCTCAGCCCCCAGGTCACCCAGGAACTGACGTGAGTGTCCCCATCCCGGCCCTTGACCCTCCTGGTGGGCGGCTATACCTCCCCAGGTCCAGGTTTCATTCTGCCCCTGCCACTAAGTCTTGGGGGCCTGGGTCTCTGCTGGTTCTAGCTTCCTCTTCCCATTTCTGACTCCTGGCTTTAGCTCTCTGGAATTCTCTCTCTCAGTTCTGTTTCTCCCTCTTCCCTTCTGACTCAGCCTGTCACACTCGTCCTGGCGCTGTCTCTGTCCTTCACTAGCTCTTTTATATAGAGACAGAGAGATGGGGTCTCACTGTGTTGCCCAGGCTGGTCTTGAACTTCTGGGCTCAAGCGATCCTCCCACCTCGCCTCCCAAAGTGCTGGGAATAGAGACATGAGCCACCTTGCTCGGCCTCCTAGCTCTTTCTTCGTCTCTGCCTCTGCTCTCTGCGTCTGTCTTTGTCTCCTCTCTGCCTCTGTCCCGTTCCTTCTCTCTTGGTTCACTGCCCTTCTGTCTCTCCCTGTTCTCCTTAGGAGACTCTCCTCTCTTCCTTCTCGAGTCTCTCTGGCTGATCCCCATCTCACCCACACCTATCC

La matèria cromosòmica és “un cristall aperiòdic”, constituït per la successió d'un nombre petit d'elements isomèrics*, la seqüència concreta dels quals és la responsable de la seva funcionalitat.

(*) “the number of atoms in such a structure need not to be very large to produce an almost unlimited number of possible arrangements. For illustration, think of the Morse code…”

La matèria cromosòmica és “un cristall aperiòdic”, constituït per la successió d'un nombre petit d'elements isomèrics*, la seqüència concreta dels quals és la responsable de la seva funcionalitat.

(*) “the number of atoms in such a structure need not to be very large to produce an almost unlimited number of possible arrangements. For illustration, think of the Morse code…”

1943: Schroëdinger, “What is life?”

Recommended