47
1 Bioinformatics & Computational Biology Thanks to Mark Gerstein (Yale) & Eric Green (NIH) for many borrowed & modified PPTs Drena Dobb Iowa State Universit

Bioinformatics & Computational Biology

  • Upload
    uzuri

  • View
    58

  • Download
    1

Embed Size (px)

DESCRIPTION

Bioinformatics & Computational Biology. Thanks to Mark Gerstein (Yale) & Eric Green (NIH) for many borrowed & modified PPTs. Drena Dobbs Iowa State University. What is Bioinformatics? (& What is Computational Biology?). Wikipedia : - PowerPoint PPT Presentation

Citation preview

Page 1: Bioinformatics &  Computational Biology

1

Bioinformatics & Computational Biology

Thanks to Mark Gerstein (Yale) & Eric Green (NIH)for many borrowed & modified PPTs

Drena DobbsIowa State University

Page 2: Bioinformatics &  Computational Biology

2

What is Bioinformatics?(& What is Computational Biology?)

Wikipedia: •Bioinformatics & computational biology

involve the use of techniques from mathematics, informatics, statistics, and computer science (& engineering) to solve biological problems

Gerstein: • (Molecular) Bioinformatics is conceptualizing biology in

terms of molecules & applying “informatics” techniques - derived from disciplines such as mathematics, computer science, and statistics - to organize and understand information associated with these molecules, on a large scale

Page 3: Bioinformatics &  Computational Biology

3

What is the Information?Biological Sequences, Structures,

ProcessesCentral Dogma

of Molecular Biology

• DNA sequence -> RNA -> Protein -> Phenotype

• Molecules Sequence, Structure, Function

• Processes Mechanism, Specificity, Regulation

Central Paradigm for Bioinformatics

• Genomic (DNA) Sequence -> mRNAs & other RNA sequences -> Protein sequences -> RNA & Protein Structures -> RNA & Protein Functions -> Phenotype

• Large Amounts of Information Standardized Statistical

idea from D Brutlag, Stanford, graphics from S Strobel)Modified from Mark Gerstein

Page 4: Bioinformatics &  Computational Biology

4

Explosion of "Omes" & "Omics!"Genome, Transcriptome, Proteome

• Genome - the complete collection

of DNA (genes and "non-genes") of

an organism

• Transcriptome - the complete

collection of RNAs (mRNAs &

others) expressed in an organism

• Proteome - the complete

collection of of proteins expressed

in an organism

Page 5: Bioinformatics &  Computational Biology

5

Genome = ConstantTranscriptome & Proteome = Variable

• Genome - the complete collection

of DNA (genes and "non-genes") of

an organism

• Transcriptome - the complete

collection of RNAs (mRNAs &

others) expressed in an organism*• Proteome - the complete

collection of proteins expressed in

an organism*

* Note: Although the

DNA is "identical" in all

cells of an organism, the

sets of RNAs or proteins

expressed in different

cells & tissues of a single

organism vary greatly --

and depend on variables

such as environmental

conditions, age.

developmental stage

disease state, etc.

Page 6: Bioinformatics &  Computational Biology

6

Molecular Biology Information: DNA & RNA

Sequences Functions: • Genetic material• Information transfer (mRNA)• Protein synthesis (tRNA/mRNA)• Catalytic & regulatory activities (some very new!)

Information:• 4 letter alphabet

(DNA nucleotides: AGCT)• ~ 1,000 base pairs in a small gene • ~ 3 X 109 bp in a genome (human)

DNA sequence:

atggcaattaaaattggtatcaatggttttggtcgtatgcacaacaccgtgatgacattgaagttgtaggtattaaatggcttatatgttgaaatatgattcaactcacggtcgaaagatggtaacttagtggttaatggtaaaactatccgGcaaacttaaactggggtgcaatcggtgttgatatcgctttaactgatgaaactgctcgtaaacatatcactgcaggcgcaaaaaaagtt

RNA sequence has "U" instead of "T"

• Where are the genes?• Which DNA sequences encode mRNA?• Which DNA sequences are "junk"? • Which RNA sequences encode protein?

Modified from Mark Gerstein

Page 7: Bioinformatics &  Computational Biology

7

Molecular Biology Information: Protein

Sequences

• Biocatalysis• Cofactor transport/storage• Mechanical motion/support• Immune protection• Regulation of growth and

differentiation

Information: • 20 letter alphabet (amino acids)

ACDEFGHIKLMNPQRSTVWY (but not BJOUXZ)

• ~ 300 aa in an average protein (in bacteria)

• ~ 3 X 106 known protein sequences

Protein sequences:

d1dhfa_ LNCIVAVSQNMGIGKNGDLPWPPLRNEFRYFQRMTTd8dfr__ LNSIVAVCQNMGIGKDGNLPWPPLRNEYKYFQRMTSd4dfra_ ISLIAALAVDRVIGMENAMPWN-LPADLAWFKRNTLd3dfr__ TAFLWAQDRDGLIGKDGHLPWH-LPDDLHYFRAQTV

Functions: Most cellular functions are performed or facilitated by proteins

• What is this protein?• Which amino acids are most important -- for folding, activity, interaction with other proteins? • Which sequence variations are harmful (or beneficial)?

Modified from Mark Gerstein

Page 8: Bioinformatics &  Computational Biology

8

Molecular Biology Information:

Macromolecular Structures

DNA/RNA/Protein Structures

• How does a protein (or RNA) sequence fold into an active 3-dimensional structure?

• Can we predict structure from sequence?

• Can we predict function from structure (or perhaps, from sequence alone?)

Modified from Mark Gerstein

Page 9: Bioinformatics &  Computational Biology

9

We don't yet understand the protein folding code - but we try to engineer

proteins anyway!

Modified from Mark Gerstein

Page 10: Bioinformatics &  Computational Biology

10

Molecular Biology Information:

Biological ProcessesFunctional Genomics• How do patterns of gene

expression determine phenotype?

• Which genes and proteins are required for differentiation during during development?

• How do proteins interact in biological networks?

• Which genes and pathways have been most highly conserved during evolution?

Page 11: Bioinformatics &  Computational Biology

11

On a Large Scale?

Whole GenomeSequencing

Genome sequences now accumulate so quickly that, in less than a week, a single laboratory can produce more bits of data than Shakespeare managed in a lifetime, although the latter make better reading.

-- G A Pekso, Nature 401: 115-116 (1999)

Modified from Mark Gerstein

Page 12: Bioinformatics &  Computational Biology

12

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Automated Sequencing for Genome Projects

Another recent improvement: rapid & high resolution separation of fragments in capillaries instead of gels (E Yeung,Ames Lab, ISU)

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Modified from Eric Green

More recently? Pyro-sequencing 454 sequencing http://www.454.com/$ 1000 genomes?

Page 13: Bioinformatics &  Computational Biology

13

1st Draft Human Genome - "Finished" in 2001

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Modified from Eric Green

Page 14: Bioinformatics &  Computational Biology

14

Human Genome Sequencing

Two approaches:

• Public (government) - International Consortium

(6 countries, NIH-funded in US)• "Hierarchical" cloning & BAC-by-BAC sequencing• Map-based assembly

• Private (industry) - Celera (Craig Venter)• Whole genome random "shotgun" sequencing • Computational assembly (took advantage of public maps & sequences,too)How many genes? ~ 20,000 (Science May 2007)

Craig'sGuess which human genome they sequenced?

Page 15: Bioinformatics &  Computational Biology

15

Public Sequencing - International Consortium

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Modified from Eric Green

Page 16: Bioinformatics &  Computational Biology

16

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Comparison of Sequenced Genome Sizes

Plants? Some have much larger genomes than human!

Modified from Eric Green

Page 17: Bioinformatics &  Computational Biology

17

"Complete" Human Genome Sequence - What next?

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

from Eric Green

Page 18: Bioinformatics &  Computational Biology

18

Next Step after the Sequence?

• Expression Analysis• Structural Genomics• Protein Interactions• Pathway Analysis• Systems Biology

Understanding Gene Function on a Genomic

Scale

Evolutionary Implications of: • Introns & Exons• Intergenic Regions as "Gene Graveyard"

Modified from Mark Gerstein

Page 19: Bioinformatics &  Computational Biology

19

Interpreting the Human Genome Sequence!

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

from Eric Green

Page 20: Bioinformatics &  Computational Biology

20

Comparative Genomics: compare entire genomic sequences

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

from Eric Green

Page 21: Bioinformatics &  Computational Biology

21

Comparing Genomes: Functional Elements

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

from Eric Green

Page 22: Bioinformatics &  Computational Biology

22

Gene Expression Data: the Transcriptome (& Proteome)

MicroArray Data

Yeast Expression Data:

• Levels for all 6,000 genes!

• Experiments to investigate how genes respond to changes in environment or how patterns of expression change in normal vs cancerous tissue

(courtesy of J Hager)Modified from Mark Gerstein

ISU's Biotechnology Facilities include state-of-the-art Microarray & Proteomics instrumentation

Page 23: Bioinformatics &  Computational Biology

23

Other Whole-Genome

Experiments

Systematic Knockouts:

Make "knockout" (null) mutations in every gene - one at a time - and analyze the resulting phenotypes!

For yeast:

6,000 KO mutants!

2-hybrid Experiments:

For each (and every) protein, identify every other protein with which it interacts!

For yeast: 6000 x 6000 / 2 ~ 18M interactions!!

Modified from Mark Gerstein

Page 24: Bioinformatics &  Computational Biology

24

Molecular Biology Information:Integrating Data

•Understanding the function of genomes requires integration of many diverse and complex types of information: Metabolic pathways Regulatory networks Whole organism physiology Evolution, phylogeny Environment, ecology Literature (MEDLINE)

Modified from Mark Gerstein

Page 25: Bioinformatics &  Computational Biology

25

Storing & Analyzing Large-scale Information:

Exponential Growth of Data Matched by Development of Computer Technology

CPU vs Disk & Net• Both the increase in

computer speed and the ability to store large amounts of information on computers have been crucial

• Improved computing resources have been a driving force in Bioinformatics

Modified from Mark Gerstein (Internet picture adaptedfrom D Brutlag, Stanford)

ISU's supercomputer "CyBlue" is among 100 most powerful in the world

Page 26: Bioinformatics &  Computational Biology

26Weber Cartoon

from Mark Gerstein

Page 27: Bioinformatics &  Computational Biology

27

Challenges in Organizing & Understanding High-throughput Data:

Redundancy and Multiplicity• Different sequences can have the

same structure• Organism has many similar genes• Single gene may have multiple

functions• Genes and proteins function in

genetic and regulatory pathways• How do we organize all this

information so that we can make sense of it?

Integrative Genomics: genes >< structures <> functions <> pathways <> expression <>regulatory systems <> ….

Modified from Mark Gerstein

Page 28: Bioinformatics &  Computational Biology

28

"Simple" example? ProteinsMolecular Parts = Conserved

Domains

Modified from Mark Gerstein

Page 29: Bioinformatics &  Computational Biology

29

"Parts List" approach to bike maintenance:

What are the shared parts (bolt, nut, washer, spring, bearing), unique parts (cogs, levers)? What are the common parts -- types of parts (nuts & washers)?

How many roles can these play? How flexible and adaptable are they mechanically?

Where are the parts

located? Modified from Mark Gerstein

Page 30: Bioinformatics &  Computational Biology

30

~2,000 folds

~20,000 genes

~2,000 genes1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 …

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 …

(human)

World of protein structures is also finite,

providing a valuable simplification!

Global Surveys of a Finite Set of Parts from Many Perspectives

Same logic for pathways, functions, sequence families, blocks, motifs....

Functions picture from www.fruitfly.org/~suzi (Ashburner); Pathways picture from, ecocyc.pangeasystems.com/ecocyc (Karp, Riley). Related resources: COGS, ProDom, Pfam, Blocks, Domo, WIT, CATH, Scop....

(T. pallidum)

Modified from Mark Gerstein

Page 31: Bioinformatics &  Computational Biology

31

BUT, what actually happens in cells & in whole organisms is much more

complex! providing a challenging complication!!

Exploring the Virtual Cell at ISU

Virtual Cell projects elsewhere...

NCBI's Bookshelf - a great resource!

Page 32: Bioinformatics &  Computational Biology

32

So, having a list of parts is not enough!

BIG QUESTION?

SYSTEMS BIOLOGY

How do parts work together to form a functional system?

What is a system? Macromolecular complex, pathway, network, cell, tissue, organism, ecosystem…

Page 33: Bioinformatics &  Computational Biology

33

Is this Bioinformatics? (#1,with Answers)

• Creating digital libraries Automated bibliographic search and textual comparison Knowledge bases for biological literature

• Motif discovery using Gibb's sampling • Methods for structure determination

Computational X-ray crystallography NMR structure determination

• Distance Geometry

• Metabolic pathway simulation

YES

YES

YES

YES

Modified from Mark Gerstein

Page 34: Bioinformatics &  Computational Biology

34

Is this Bioinformatics? #2

• Gene identification by sequence inspection Prediction of splice sites, promoters, etc.

• DNA methods in forensics • Modeling populations of organisms

Ecological Modeling

• Genomic sequencing methods Assembling contigs Physical and genetic mapping

• Linkage analysis Linking specific genes to various traits

YES

YES

YES

YES

YES

Modified from Mark Gerstein

Page 35: Bioinformatics &  Computational Biology

35

Is this Bioinformatics? #3

• Rational drug design• RNA structure prediction• Protein structure prediction

• Radiological image processing Computational representations for human anatomy

• (e.g., Visible Human)

• Artificial life simulations Artificial immunology Virtual cells

YES

Maybe

Modified from Mark Gerstein

Yes

Page 36: Bioinformatics &  Computational Biology

36

So, this is Bioinformatics

What is it good for?

Page 37: Bioinformatics &  Computational Biology

37

EXAMPLES OF BIOINFORMATICS RESEARCH

A few general ones&

a few personal favorites!

Page 38: Bioinformatics &  Computational Biology

38

Designing New Drugs

•Understanding how proteins bind other molecules•Structural modeling & ligand docking•Designing inhibitors or modulators of key proteins

Figures adapted from Olsen Group Docking Page at Scripps, Dyson NMR Group Web page at Scripps, and from Computational Chemistry Page at Cornell Theory Center).Modified from Mark Gerstein

Page 39: Bioinformatics &  Computational Biology

39

Finding homologs of "new" human genes

Modified from Mark Gerstein

Page 40: Bioinformatics &  Computational Biology

40

Finding WHAT? Homologs - "same genes" in different

organisms(actually, orthologs)• Human vs. Mouse vs. Yeast

Much easier to do experiments on yeast to determine function Often, function of an ortholog in at least one organism is known

Best Sequence Similarity Matches to Date Between Positionally ClonedHuman Genes and S. cerevisiae Proteins

Human Disease MIM # Human GenBank BLASTX Yeast GenBank Yeast Gene Gene Acc# for P-value Gene Acc# for Description Human cDNA Yeast cDNA

Hereditary Non-polyposis Colon Cancer 120436 MSH2 U03911 9.2e-261 MSH2 M84170 DNA repair proteinHereditary Non-polyposis Colon Cancer 120436 MLH1 U07418 6.3e-196 MLH1 U07187 DNA repair proteinCystic Fibrosis 219700 CFTR M28668 1.3e-167 YCF1 L35237 Metal resistance proteinWilson Disease 277900 WND U11700 5.9e-161 CCC2 L36317 Probable copper transporterGlycerol Kinase Deficiency 307030 GK L13943 1.8e-129 GUT1 X69049 Glycerol kinaseBloom Syndrome 210900 BLM U39817 2.6e-119 SGS1 U22341 HelicaseAdrenoleukodystrophy, X-linked 300100 ALD Z21876 3.4e-107 PXA1 U17065 Peroxisomal ABC transporterAtaxia Telangiectasia 208900 ATM U26455 2.8e-90 TEL1 U31331 PI3 kinaseAmyotrophic Lateral Sclerosis 105400 SOD1 K00065 2.0e-58 SOD1 J03279 Superoxide dismutaseMyotonic Dystrophy 160900 DM L19268 5.4e-53 YPK1 M21307 Serine/threonine protein kinaseLowe Syndrome 309000 OCRL M88162 1.2e-47 YIL002C Z47047 Putative IPP-5-phosphataseNeurofibromatosis, Type 1 162200 NF1 M89914 2.0e-46 IRA2 M33779 Inhibitory regulator protein

Choroideremia 303100 CHM X78121 2.1e-42 GDI1 S69371 GDP dissociation inhibitorDiastrophic Dysplasia 222600 DTD U14528 7.2e-38 SUL1 X82013 Sulfate permeaseLissencephaly 247200 LIS1 L13385 1.7e-34 MET30 L26505 Methionine metabolismThomsen Disease 160800 CLC1 Z25884 7.9e-31 GEF1 Z23117 Voltage-gated chloride channelWilms Tumor 194070 WT1 X51630 1.1e-20 FZF1 X67787 Sulphite resistance proteinAchondroplasia 100800 FGFR3 M58051 2.0e-18 IPL1 U07163 Serine/threoinine protein kinaseMenkes Syndrome 309400 MNK X69208 2.1e-17 CCC2 L36317 Probable copper transporter

Modified from Mark Gerstein

Page 41: Bioinformatics &  Computational Biology

41

Comparative Genomics Genome/Transcriptome/Proteome/Metab

olome

Databases, statistics•Occurrence of a

specific genes or features in a genome How many kinases in yeast?

•Compare Tissues Which proteins are

expressed in cancer vs normal tissues?• Diagnostic tools• Drug target discovery

Modified from Mark Gerstein

Page 42: Bioinformatics &  Computational Biology

42

Molecular Recognition:Analyzing & Predicting Macromolecular

Interfaces (in DNA, RNA & protein complexes)

Drena Dobbs, GDCBJae-Hyung LeeMichael TerribiliniJeff SanderPete Zaback

Vasant Honavar, Com SFeihong WuCornelia Caragea

Robert Jernigan, BBMBTaner SenAndrzej Kloczkowski

Kai-Ming Ho, Physics

Page 43: Bioinformatics &  Computational Biology

Designing Zinc Finger DNA-binding proteins to recognize specific sites in genomic DNA

Drena Dobbs, GDCBJeff SanderPete Zaback

Dan Voytas, GDCBFenglli Fu

Les Miller, ComSVasant Honavar, ComS

Keith Joung, Harvard

Page 44: Bioinformatics &  Computational Biology

44

Structure & function of human telomerase: Predicting structure & functional sites in a clinically important but "recalcitrant" RNP

www.intl-pag.org/

Cell Biologist:

Biochemist:

Imagined structure:

Lingner et al (1997) Science 276: 561-567.www.chemicon.com

How would a systems biologist study telomerase?

Page 45: Bioinformatics &  Computational Biology

45

Resources for Bioinformatics & Computational Biology

• Wikipedia: Bioinformatics

• NCBI - National Center for Biotechnology Information

• ISCB - International Society for Computational Biology

• JCB - Jena Center for Bioinformatics• UBC - Bioinformatics Links Directory

Page 46: Bioinformatics &  Computational Biology

46

ISU Resources & Experts

ISU Research Centers & Graduate Training Programs:BCB - Bioinformatics & Computational Biology

Baker Center - Bioinformatics & Biological Statistics

CIAG - Center for Integrated Animal GenomicsCILD - Computational Intelligence, Learning &

Discovery

ISU Facilities:Biotech - Instrumentation FacilitiesCIAG - Center for Integrated Animal GenomicsPSI - Plant Sciences Institute

PSI Centers

Page 47: Bioinformatics &  Computational Biology

47

http://www.dnai.org/c/index.html

For fun: DNA Interactive: "Genomes"

A tutorial on genomic sequencing, gene structure,

genes prediction

Howard Hughes Medical Institute (HHMI)Cold Spring Harbor Laboratory (CSHL)