40
NCSU Summer Institute of Statistical Genetics, Raleigh 2004: Genome Science Session II: Genome Sequencing

NCSU Summer Institute of Statistical Genetics, Raleigh 2004: Genome Science

  • Upload
    tobias

  • View
    38

  • Download
    0

Embed Size (px)

DESCRIPTION

NCSU Summer Institute of Statistical Genetics, Raleigh 2004: Genome Science. Session II : Genome Sequencing. Genome and EST sequencing. Sequencing Technologies Informatics Tools Sequencing project approaches EST sequencing projects Genome sequencing projects - PowerPoint PPT Presentation

Citation preview

Page 1: NCSU Summer Institute of Statistical Genetics, Raleigh 2004:                     Genome Science

NCSU Summer Institute of Statistical Genetics, Raleigh 2004:

Genome Science

Session II: Genome Sequencing

Page 2: NCSU Summer Institute of Statistical Genetics, Raleigh 2004:                     Genome Science

Genome and EST sequencing

The Summer Institute 2004

•Sequencing Technologies•Informatics Tools•Sequencing project approaches•EST sequencing projects•Genome sequencing projects•What we have learned

Page 3: NCSU Summer Institute of Statistical Genetics, Raleigh 2004:                     Genome Science

Some Terms

The Summer Institute 2004

•Complementary – nucleotide sequences that will form specific hybrids

•Hybridize – duplex formation•Label – a molecular tag that facilitate detection•Oligonucleotide – a short single-stranded piece of nucleic acid

•Anneal – to incubate nucleic acid species together under conditions that promote specific hybridization

Page 4: NCSU Summer Institute of Statistical Genetics, Raleigh 2004:                     Genome Science

Why study genomes

The Summer Institute 2004

•Molecular biology and biochemistry need a point of entry•Genetics is reliant on phenotype•Hypothesis driven versus data production - parallels with early Naturalists and modern day physics•Identify similarities and differences amongst diverse life forms

Page 5: NCSU Summer Institute of Statistical Genetics, Raleigh 2004:                     Genome Science

Data mining vs. Data Dredging

The Summer Institute 2004

Page 6: NCSU Summer Institute of Statistical Genetics, Raleigh 2004:                     Genome Science

Gene structural features

The Summer Institute 2004

Sequence read

cDNA

Genomic DNA

Exon

polyA tail

•Hybridization of complementary strands•Specificity of base pairing•Almost any DNA is clonable•You can have the same sequence - but different genes

Page 7: NCSU Summer Institute of Statistical Genetics, Raleigh 2004:                     Genome Science

Sequencing Technologies

•Basic principles•Dideoxy chain termination•Electrophoretic separation•Visualization

•Innovations•Fluorescent tags•Thermocycling•Capillary electrophoresis

•Novel methodologies•Sequencing by hybridization•Mass spectrometry•Nanopore sequencing•Other things of note

The Summer Institute 2004

Page 8: NCSU Summer Institute of Statistical Genetics, Raleigh 2004:                     Genome Science

Primer extension

The Summer Institute 2004

5’-ACGTGATCCCAGTACCGTAGCTAGCCAGTTTAGATCGATTCGA-3’3’-AAATCTAGCTAAGCT-5’

5’-ACGTGATCCCAGTACCGTAGCTAGCCAGTTTAGATCGATTCGA-3’

3’-AAATCTAGCTAAGCT-5’

5’-ACGTGATCCCAGTACCGTAGCTAGCCAGTTTAGATCGATTCGA-3’ AAATCTAGCTAAGCT-5’

•The extended molecule is the reverse complement of the target•The extended molecule can be tagged for visualization•Extension occurs via a 3’ hydroxyl group

Page 9: NCSU Summer Institute of Statistical Genetics, Raleigh 2004:                     Genome Science

Dideoxy chain termination

The Summer Institute 2004

Dideoxy dNTPs will terminate extension because they lack a 3’-OH

By mixing ddATP with dATP a pool of extension products is created wherein termination at each available A occurs

The termination products can be separated by size and visualized by labeling either the ddNTP or the primer

5’-ACGTGATCCCAGTACCGTAGCTAGCCAGTTTAGATCGATTCGA-3’ATCGGTCAAATCTAGCTAAGCT-5’

5’-ACGTGATCCCAGTACCGTAGCTAGCCAGTTTAGATCGATTCGA-3’ATCGATCGGTCAAATCTAGCTAAGCT-5’

5’-ACGTGATCCCAGTACCGTAGCTAGCCAGTTTAGATCGATTCGA-3’ATCGATCGATCGGTCAAATCTAGCTAAGCT-5’

Page 10: NCSU Summer Institute of Statistical Genetics, Raleigh 2004:                     Genome Science

Sanger sequencing

The Summer Institute 2004

ddNTP/dNTP mixtures are made up for each of the four nucleotides - adenine, cytosine, guanine, thymine

Proportion of dideoxy to deoxy NTP determines the frequency of termination

Products from the four reactions are separated by size and DNA sequence is inferred

T C G A A G C T

Invert gel to read the sequence 5’ to 3’

Page 11: NCSU Summer Institute of Statistical Genetics, Raleigh 2004:                     Genome Science

Fluorescent sequencing

The Summer Institute 2004

Each ddNTP is labeled with a different fluor - now all four products can be run in the same gel lane

Fluorescence is detected using a laser scanner to produce a false color image

Electropherograms (chromatograms) are produced that display peak intensity for each fluor

Can also differentially label the primer to achieve the same end

Page 12: NCSU Summer Institute of Statistical Genetics, Raleigh 2004:                     Genome Science

Cycle sequencing with PCR

The Summer Institute 2004

Sanger sequencing can require large amounts of template

Polymerase chain reaction exponentially amplifies specific DNAs

Use of ddNTPs allows the combination of amplification and dideoxy terminator sequencing

Cycle sequencing animation

press

Page 13: NCSU Summer Institute of Statistical Genetics, Raleigh 2004:                     Genome Science

High-throughput sequencing

The Summer Institute 2004

•Dideoxy terminator sequencing is robust and flexible

•Microtiter format•PCR based cycle sequencing requires less template

•Fluorescent sequencing increased gel capacity 4X

•Supporting robotics upstream of sequencing process

•Computational tools•Capillary sequencers

Page 14: NCSU Summer Institute of Statistical Genetics, Raleigh 2004:                     Genome Science

Capillary gels

The Summer Institute 2004

Slab gels make life in the sequencing lab difficult for many reasons:Pouring the gel is time consuming and prone to errorThe microtiter plate format (sequencing reactions) has spacing that is different than the gel loading comb - cumbersomeAssembly and disassembly of the sequencing apparatus is messy and time consumingManual lane tracking is time consuming and prone to errorGels never run perfectly - lanes can sometimes run together making lane tracking difficult

Capillary gels help becauseEach sequencing reaction is run in a separate capillary - there is no lane tracking to worry overMatrix for the capillary gel is robotically assembled, injected and QC’dRobotic loading of samples is compatible with walk-away capability

Page 15: NCSU Summer Institute of Statistical Genetics, Raleigh 2004:                     Genome Science

Informatics essentials

The Summer Institute 2004

•Basecallers convert trace data to sequence•Assemblers form contiguous sequences from small chunks

•Viewers/Editors allow the scientist to interactively work with data

•Databases store sequencing data - from electropherograms to annotation

•Analysis tools compare the sequence against databases of sequences and use algorithms to make educated guesses about the structure and function of a given sequence

Page 16: NCSU Summer Institute of Statistical Genetics, Raleigh 2004:                     Genome Science

Basecalling

The Summer Institute 2004

• Is the spacing of the peaks what is expected?•Is there a peak in the electropherogram?•What fluor is responsible for this peak?•Since noise ensures the presence of more than one peak, which peak is the correct peak?

•What is the probability that the base that is assigned is the correct base?

•Phred score - Phred 20 (1 error in 100 bases) is a typical quality standard

•TraceTuner - algorithm is similar to Phred but reportedly more accurate with ABI3700 traces, plus accelerated execution

•Others are available

Page 17: NCSU Summer Institute of Statistical Genetics, Raleigh 2004:                     Genome Science

Assembly

The Summer Institute 2004

• Production of a single contiguous sequence from multiple sequence reads

• The best assembly programs (including Phrap) use probability scores directly from the output of basecallers such as Phred

• Phrap was designed for genome sequencing projects - EST assemblies make different assumptions

• Final assembly products include contigs and singletons

• Accuracy of the contig consensus sequence is based on error models propagated from basecalling software

Page 18: NCSU Summer Institute of Statistical Genetics, Raleigh 2004:                     Genome Science

Viewers/editors

The Summer Institute 2004

Consedbreak

Page 19: NCSU Summer Institute of Statistical Genetics, Raleigh 2004:                     Genome Science

Storage and analysis of sequence

The Summer Institute 2004

http://www.ebi.ac.uk/Databases/index.html

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi

http://www.ncbi.nlm.nih.gov/

• The amount of sequence information deposited in databases is increasing at a very rapid rate

• Tools to manage sequence data are imperfect and in development• Development of controlled vocabularies and gene ontologies will facilitate database integration

• Analytical tools and algorithm development are growth industries

Page 20: NCSU Summer Institute of Statistical Genetics, Raleigh 2004:                     Genome Science

Impact of database structure

The Summer Institute 2004

•Flat file databases are great for speed but are not built for integration•Lack of controlled vocabularies impedes efficient and reliable searching and inhibits integration

•GenBank uses a controlled index and vocabulary - sort of

•Example of searching for genomic sequence, EST sequence and complete cDNAs

•Relational databases are great for integration but can be slow and changing the schema takes an act of Congress

•Flat file databases with robust re-indexing routines have the advantage of speed and the ability to integrate different data types

Page 21: NCSU Summer Institute of Statistical Genetics, Raleigh 2004:                     Genome Science

Sequencing by hybridization

The Summer Institute 2004

GCATGC…3

TGCATGCATGCATGCATG…1

CATGCA…9

ATGCAT…2

TGCATG…12

GCATGC…3CATGCA…9

ACGTAC

CGTACG

CGCGGA

TACGTA

ACGTAC

CGTACG

GTACGTCACAGA

GGGCCCAATTCC

AGCAGC

TTCCGG1 2 3

4 5 6

7 8 9

10 11 12

Determine constituent sequences by hybridizing to oligos of known sequenceAssemble sequence fragments into contiguous sequence

Page 22: NCSU Summer Institute of Statistical Genetics, Raleigh 2004:                     Genome Science

CATGCA

Sequencing by mass spectrometry

The Summer Institute 2004

GCATGCTGCATG

ATGCATTGCATGGCATGC

Obtain Mass spectra from Reference panel of oligos

Fragment unknown and obtain mass spectra

Deconvolute data

Page 23: NCSU Summer Institute of Statistical Genetics, Raleigh 2004:                     Genome Science

454

The Summer Institute 2004

Page 24: NCSU Summer Institute of Statistical Genetics, Raleigh 2004:                     Genome Science

US Genomics

The Summer Institute 2004

Page 25: NCSU Summer Institute of Statistical Genetics, Raleigh 2004:                     Genome Science

Nanopore sequencing

The Summer Institute 2004

•Two solution filled compartments separated by a membrane with a channel

•Ions flow through the channel in response to an applied voltage

•DNA is negatively charged and will be drawn through the channel

•Channel size allows DNA molecules to be drawn into and through the channel one at a time

•Current is reduced when the channel is occupied by DNA

•Length of current drop is proportional to length of DNA

•Extent of current drop is indicative of physicochemical properties of DNA - thus, one can infer sequence from the trace

Page 26: NCSU Summer Institute of Statistical Genetics, Raleigh 2004:                     Genome Science

Sequencing project approaches

The Summer Institute 2004

•EST projects

•Map-based: assembly based on physical ordering of clones

•Shotgun: assembly based on computational ordering of sequences

•Combination strategies: minimal scaffolding from physical maps, fill in the blanks by shotgun and directed sequencing

Page 27: NCSU Summer Institute of Statistical Genetics, Raleigh 2004:                     Genome Science

EST sequencing projects

The Summer Institute 2004

•Only the expressed genome is sequenced, thereby avoiding the “junk”

•Relatively inexpensive and fast - accessible to small laboratories

•May fail to capture many genes because the appropriate biological condition leading to expression is not captured

•May overestimate gene number due to non-overlapping sequences from the same gene

Page 28: NCSU Summer Institute of Statistical Genetics, Raleigh 2004:                     Genome Science

Project is the operative word

The Summer Institute 2004

Page 29: NCSU Summer Institute of Statistical Genetics, Raleigh 2004:                     Genome Science

Libraries of overlapping clones

The Summer Institute 2004

•Library clones can be ordered by the presence of restriction sites, known sequences, etc.

•Assembly of contiguous sequences is straightforward because the clones form an ordered array

Page 30: NCSU Summer Institute of Statistical Genetics, Raleigh 2004:                     Genome Science

Map-based sequencing

The Summer Institute 2004

•Produce large insert libraries in BACs, cosmids, etc. to “cover” the genome multiple times

•Determine a minimal tiling path of clones by restriction mapping, hybridization of end based probes or end sequencing

•Ordered sets of clones are subcloned into pools of small clones

•Smaller clones can be order or sequenced by shotgun methods

•Fewer sequencing runs = lower costs•Obtaining an ordered array of clones can be time consuming

Page 31: NCSU Summer Institute of Statistical Genetics, Raleigh 2004:                     Genome Science

Shotgun sequencing

The Summer Institute 2004

• Produce sequences from random clones irrespective of their physical order along the chromosomes

• Clones can be small insert or large insert because alignment takes into account only the sequence - not properties of the physical clones

• Assemble sequences to produce contigs• Identify gaps in contiguous sequence and undersequenced areas• Perform directed sequencing to fill in the gaps

Page 32: NCSU Summer Institute of Statistical Genetics, Raleigh 2004:                     Genome Science

Shotgun sequencing issues

The Summer Institute 2004

•Assembly is computationally intensive•Repetitive sequences have to be masked so that they do not confound the preliminary alignment

•First pass alignment based upon non-masked sequences to produce contiguous sequence fragments

•Alignments must account for potential polymorphisms•Repetitive sequences still need to be aligned - their treatment is however distinct from non-repetitive sequences

•Resolution of conflicts in the assembly is challenging

•When is a genome truly finished?•The press release is only the beginning of the process

Page 33: NCSU Summer Institute of Statistical Genetics, Raleigh 2004:                     Genome Science

Complementary strategies

The Summer Institute 2004

• Pure shotgun approaches are likely to leave significant gaps

• Directed sequencing of specific regions is necessary to fill in the gaps

• Pure map-based strategies are cumbersome and time consuming and do not take advantage of efficiencies of scale found in modern industrial sequencing

• A complementary approach combines data from both approaches

• There are adherents to working from the bottom-up and working from the top-down

Page 34: NCSU Summer Institute of Statistical Genetics, Raleigh 2004:                     Genome Science

Genome sequencing projects

The Summer Institute 2004

Page 35: NCSU Summer Institute of Statistical Genetics, Raleigh 2004:                     Genome Science

What is a gene

The Summer Institute 2004

•ESTs and cDNAs identify those parts of the genome that are actually transcribed

•Transcripts have structural features including starts, stops and open reading frames

•Computers can be trained to “sniff” for relevant features in the sequence

•Genefinding algorithms construct probability models based on presence of one or more gene-like features

•Coordination with genetic features gives a comfort level because it is empirical

•Computational methods that rely on similarity to “known” genes in databases can be perilous - a sort of regressive uncertainty

Page 36: NCSU Summer Institute of Statistical Genetics, Raleigh 2004:                     Genome Science

BLAST

The Summer Institute 2004

BLAST Example

Example Sequence

Page 37: NCSU Summer Institute of Statistical Genetics, Raleigh 2004:                     Genome Science

How to make a human

The Summer Institute 2004

Page 38: NCSU Summer Institute of Statistical Genetics, Raleigh 2004:                     Genome Science

The Human Genome Project

The Summer Institute 2004

http://www.nature.com/genomics/human/index.html

http://www.sciencemag.org/content/vol291/issue5507/

Page 39: NCSU Summer Institute of Statistical Genetics, Raleigh 2004:                     Genome Science

Genome information challenges

The Summer Institute 2004

Link to Ensembl

Link to Entrez Genomes

Link to SachDB

Link to TAIR

Link to KEGG

Link to ExPASy

Link to FlyBase

• Data integration from sequence, mutant analysis, mapping, expression analysis, metabolic profiling, and other data types will be the primary challenge to biological science in the 21st century

• Informatics tools are in their infancy• The literature is growing at a rate surpassing sequence data• Importance of statistics cannot be overstated• Gene annotation is regressive

• Danger of balkanization of data?• Is natural language processing the holy grail?

Page 40: NCSU Summer Institute of Statistical Genetics, Raleigh 2004:                     Genome Science

Genome information challenges

Link to Ensembl

Link to Entrez Genomes

Link to SachDB

Link to TAIR

Link to KEGG

Link to ExPASy

Link to FlyBase

• Data integration from sequence, mutant analysis, mapping, expression analysis, metabolic profiling, and other data types will be the primary challenge to biological science in the 21st century

• Informatics tools are in their infancy• The literature is growing at a rate surpassing sequence data• Importance of statistics cannot be overstated• Gene annotation is regressive

• Danger of balkanization of data?• Is natural language processing the holy grail?

The Summer Institute 2004