Upload
nguyenkhue
View
214
Download
1
Embed Size (px)
Citation preview
Introduction to Sequencing Technologies
August 22, 2014 Manpreet Katari
Why do we sequence?
• Genome Annotation: A complete genome sequence provides us with the raw data to
construct a "parts list".
• Comparative Genomics: Conserved regions in the genome are more likely to play an
important role in biology of the species.
• Functional Genomics: Sequencing the RNA provides us with an insight into the
transcriptionally active regions of the genome.
• Population Genetics and Genomics: Genetic structure and diversity reveals history and distribution of
phenotypic traits (e.g. disease susceptibility alleles)
• Genetic Analysis:
Map and characterize molecular basis of allelic variants 2
Examples of Large Genome Projects
• 1000 Genomes Project (www.1000genomes.org). An effort to sequence the genome of 1000 people to identify genetic variants that occur in atleast 1% of the human population.
• 1001 Arabidopsis thaliana Genomes Project (www.1001genomes.org) . Study the genomes and phenotypes of 1001 accession that can explain difference in phenotype caused by adaptation to different conditions.
• Metagenomics – Human Microbiome Project (http://commonfund.nih.gov/hmp/): Sequencing of DNA samples from environments, for example mouth, skin, and digestive system, to identify the different bacterial species present.
Evolution of Sequencing Technology
• Method
• Sanger sequencing (manual)
• Automated Sanger sequencing Lee Hood (first semi-automated machine)
99.999% accuracy
$0.50/kilobase
• Cyclic array sequencing • sequencing by synthesis
• sequencing by hybridization
• Throughput
• Tens of thousands of bp per person-year
~400-500bp/run
• Thousands of bp per run (multiplexed) ~ millions of bp per year
~700-800bp/run
• Millions of bp per run
4
1977
1986
2000's
⇒ A 3’ hydroxyl group is essential for chain elongation
5’
3’
5
Sanger DNA Sequencing
CHAIN TERMINATOR
32P labeled audioradiogram 6
Sanger DNA Sequencing
Denaturing gel Labeled strands
Add dNTPs, Polymerase,
ddNTP
Template + Product
Primer Denatured Template Run samples on
denaturing polyacrylamide gel
+
−
5'
3'
Capillary Gel Electrophoresis
7
3'- G A C T G A A G C T G T T
-5'
⇒ Fluorescent dye vs. radioactive label on dNTPs
⇒ Sequencing reaction is performed in a single tube with a mixture of fluorescently labeled ddNTPs
⇒ Reaction is electrophoresed in a single denaturing capillary gel (96 samples run at once using robotics)
⇒ Different wavelengths emitted by fluorescent dyes are automatically detected upon laser excitation
⇒ Computer software automatically reads sequence and assesses quality
Laser Detector
Trace file
ssDNA to be sequenced
5'- C G A A G T C A G -3'
Radioactive vs. Fluorescent Sequencing
8
• 32P-labeled dNTPs • One lane per base • Autoradiogram
• 4 different fluorophores • Single capillary gel • Laser detector
Automated Sequencing
• Perhaps the most important contribution to large-scale sequencing was the development of automated sequencers.
• The industry is currently transitioning from Sanger sequencers to second-generation cycle sequencers for large-scale applications.
• Automated Sanger sequencers still fill an important niche currently –
• but mostly for small-scale applications, like checking that clones contain the expected sequence
• Next-gen platforms are currently very expensive
• but table-top models will probably be standard laboratory equipment in 5-10 years.
9
MegaBACE • Made by Amersham • 96 capillaries • Robotic loading from 384–well
plate • Two to four hours per run • Up to 800 bases per read
Automated Sequencers
ABI 3700 • Made by Applied Biosystems • Most widely used :
• 96 capillaries • robotic loading from 384-well
plates • Two to three hours per run • 600–700 bases per read
ABI traces
Video Demos
Sanger Sequencing - http://youtu.be/nudG0r9zL2M
12
Workflow of conventional vs. second-generation sequencing
13
High-throughput shotgun Sanger sequencing
Cyclic array shotgun sequencing
96 or 384 long reads per run
Millions of short reads per run
Template immobilization Sanger cycle seq
(Template amplification)
Template amplification
Capillary electrophoresis
Seq by synthesis or hybridization
Workflow of conventional vs. second-generation sequencing
14
High-throughput shotgun Sanger sequencing
Cyclic array shotgun sequencing
96 or 384 long reads per run
Millions of short reads per run
Template immobilization Sanger cycle seq
(Template amplification)
Template amplification
Capillary electrophoresis
Seq by synthesis or hybridization
Cost of Sequence per megabase
Template Immobilization Strategies
16
Illumina
17 Figu
re fr
om M
. Met
zker
, Nat
Rev
Gen
et, J
an. 2
010
Video Demos
Illumina - http://youtu.be/womKfikWlxM
18
PacBio
19 Figu
re fr
om M
. Met
zker
, Nat
Rev
Gen
et, J
an. 2
010
Video Demos
PacBio - http://youtu.be/NHCJ8PtYCFc
20
J. Craig Venter Celera Genomics
Francis Collins Human Genome Project
Road to Human Genome
Map-‐based sequencing I
• Human Genome Project adopted a map-‐based strategy – Start with well-defined physical map – Produce shortest tiling path for large-insert clones – Assemble the sequence for each clone – Then assemble the entire sequence, based on the
physical map
Map-‐based sequencing II
Construct clone map and select mapped clones
Generate several thousand sequence reads per clone
Assemble
Physical mapping
• Determina@on of physical distance between two points on chromosome – Distance in base pairs
• Example: between physical marker and a gene • Need overlapping fragments of DNA
– Requires vectors that accommodate large inserts • Examples: cosmids, YACs, and BACs
BACs and PACs • BACs and PACs
– Most commonly used vectors for large-scale sequencing
– Good compromise between insert size and ease of use
– Growth and isolation similar to that for plasmids
Con@gs • Contigs are groups of overlapping pieces of chromosomal
DNA – Make contiguous clones
• For sequencing one wants to create “minimum tiling path” – Contig of smallest number of inserts that covers a region of
the chromosome
genomic DNA
con@g
minimum @ling path
Con@gs from overlapping restric@on fragments
• Cut inserts with restriction enzyme
• Look for similar pattern of restriction fragments – Known as
“fingerprinting” • Line up overlapping
fragments • Continue until a contig
is built
Gel image processing
FPC: fingerprint analysis window
The C. elegans genome project
The first mul@cellular organism to have its genome fully sequenced (97 million bases) The sequence was completed in 1998
⇒ The minimum @ling path, or “The Golden Path”
Mapless sequencing
• Alterna@ve solu@on: fragment en@re genome – Sequence each fragment – Assemble overlapping sequences to form con@guous sequence
• Focus on principles and techniques of mapping and sequencing
Whole-‐genome shotgun sequencing I • Developed by Celera
– Subsidiary of Applied Biosystems, maker of automated sequencers
• No mapping • Instead, the whole
genome is sheared • Randomly sequenced
Whole-‐genome shotgun sequencing II
Generate tens of millions of sequence reads
Assemble
Whole-‐genome shotgun sequencing III
• Major challenge: assembly – Repe@@ve elements are the biggest problem
• Performed on very high-‐speed computers, using novel soYware
• Key to assembly is paired reads – Sequence both ends of each clone
Milestone: 26 June 2000 -‐ White House press conference with Bill Clinton: HGP: Started 1990 ~22.1 billion nucleo@des of sequence data 7-‐fold coverage Unfinished (24% completely finished, 50% near-‐finished) Celera: Started 1998 ~14.5 billion nucleo@des of sequence data 4.6-‐fold coverage Complete assembled genome with >99% coverage First assembled draY of human genome was simultaneously published in Nature & Science 15 & 16 February 2001 (Nature published 1 day earlier).
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Hierarchical vs shotgun
Hybrid approach
• Combines aspects of both map-based and whole-genome shotgun approaches – Map clones – Sequence some of the mapped clones – Do whole-genome sequencing – Combine information from both methods
• Use sequence from mapped clones as scaffold to assemble whole-genome shotgun reads
• Used for sequencing the mouse genome
Genome Assembly
Whole Genome Shotgun Sequencing
Comparison of overlap graph and ���de Brujin graph for assembly
Shatz et al. Genome Research 2010, Analysis of large genomes
Example of Tour Bus error correction.
Zerbino D R , and Birney E Genome Res. 2008;18:821-829
Copyright © 2008, Cold Spring Harbor Laboratory Press
End Reads (Mates)
Primer
Central steps of the assembly
Finishing I • Process of assembling raw
sequence reads into accurate con@guous sequence – Required to achieve
1/10,000 accuracy • Manual process
– Look at sequence reads at posi@ons where programs can’t tell which base is the correct one
– Fill gaps – Ensure adequate coverage
Gap
Single stranded
Finishing II • To fill gaps in sequence,
design primers and sequence from primer
• To ensure adequate coverage, find regions where there is not sufficient coverage and use specific primers for those areas
GAP
Primer
Primer
Assembly Progression (Macro View)
Each nucleo@de sequenced many @mes
Lander-‐Waterman Model
Rough-draft and skimming sequence
• Rough-draft sequence refers to an average of 5x coverage
• Skimming is 1–3x coverage • Obtains 67%–97% of the sequence • On average, 99% accurate • Of greatest use when can compare the sequence
to a reference sequence • For example, chimpanzee genome compared
with human genome
DNA RNA
cDNA
phenotype protein
[1] Transcription [2] RNA processing (splicing) [3] RNA export [4] RNA surveillance
The Expressed Genome
• Sequencing mRNA: • gene discovery • define gene structures • define differences in mRNA processing
• alternative transcription start sites • alternative exons and splice junction usage • alternative polyA site usage
• mRNA profiling • characterize functional differences between developmental stages and
tissues
• Small RNA profiling: • discover small RNAs, e.g. miRNA, siRNA, piRNA etc…
• these often have regulatory functions
EST sequencing
• Extract RNA from different developmental stages and tissues
• Make cDNA library
• Select clones at random
• Sequence in from one or both ends
• One-pass sequencing
• The resulting sequence = expressed sequence tag (EST)
Muscle mRNA
cDNA libraries
LIMS
Robotic stations DNA sequencers
5’ 3’ cDNA
Partial sequence = EST
ESTs Mapped to Genome
53
EST sequencing: pros and cons
• Advantages
• Relatively inexpensive
• Certainty that sequence comes from transcribed gene
• Information about tissue and developmental stage
• Long contiguous sequence often spanning introns
• Can provide clear boundary for ends of transcripts
• Disadvantages
• No regulatory information
• Usually <60% of genes found in EST collections (random sampling of transcripts based on abundance)
• Most ESTs are not full-length (higher representation of 3' ends)
54
Questions that can be addressed with genome-wide expression analysis:
• What genes have similar function? • What regulatory pathways exist? • Can we subdivide experiments or genes into meaningful classes? • Can we correctly classify an unknown experiment or gene into a
known class? • Can we make better treatment decisions for a cancer patient based
on his or her gene expression profile?
Microarrays vs Northern blots: from Gene to Genome Science
• Northern blot: limited by number of lanes in gel
• Microarray: A large number
of DNA fragments are attached in a systematic way to a solid substrate, can measure mRNA levels for thousands of genes (~ every gene in a genome) in parallel
Three types of array manufacture
• On-chip oligonucleotide synthesis • Photolithography
• Affymetrix (~25-mers) • Ink-jet printing
• Agilent (~60-mers)
• Spotted microarrays • Long dsDNA (typically genomic PCR products)
Affymetrix gene chip
Probe pairs
• Oligos are selected from a region of the gene that has low similarity to other genes.
Perfect match: ATGTTTGACGCTGCGTAGATCCGAG Mismatch: ATGTTTGACGCTACGTAGATCCGAG
MicroArray
Microarray hybridization
Spotted microarrays • Competitive hybridization: two labeled
cDNA samples (experimental and control) hybridized to same slide
• Cy3 and Cy5 dye labeling, fluoresce at different wavelengths
Affymetrix GeneChips
• One labeled RNA population per chip • Biotin labeling, binds to fluorescently
labeled avidin (Comparison made between hybridization intensities of same oligonucleotides on different chips).
mRNA
cDNA
DNA microarray
samples
Microarray Animation
Spotted glass microarray
Comparisons of Gene Expression across samples
whole body liver
brain kidney liver lung
0
510
152 0
2 53 0
3 54 0
4 5
Gene A Gene B
0
5
1015
2 0
2 5
3 03 5
4 0
4 5
Gene A Gene B
Transcriptomics using RNA-seq
RNA-seq provides even more
Candidate new and revised exons
Reproducibility, linearity and sensitivity.
Reproducibility, linearity and sensitivity.
Reproducibility, linearity and sensitivity.
Reproducibility, linearity and sensitivity.
Illum
ina mRN
A sequ
encing
Log
(KN
O3/
KC
l)
Log (KNO3/KCl) Affymetrix ATH1 chips
N-‐regula@on of mRNA: Illumina vs ATH1
R2 = 0.85
Comparison of platforms for detecting gene expression
AFFY Gene Chip Illumina
All protein coding genes are represented X
Can detect all the different types of RNA X
Cost (including analyzing data) X
Can determine gene regulation X X
Requires pre-existing knowledge of gene sequence X