Upload
betrys
View
42
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Genome Alignment & Assembly. Chandrasekar A. Sequence Assembly. In bioinformatics, sequence assembly refers to aligning and merging fragments of a much longer DNA sequence in order to reconstruct the original order of the sequence - PowerPoint PPT Presentation
Citation preview
Genome Alignment & AssemblyGenome Alignment & Assembly
Chandrasekar A.
Sequence AssemblySequence Assembly
• In bioinformatics, sequence assembly refers to aligning and merging fragments of a much longer DNA sequence in order to reconstruct the original order of the sequence
• This is needed as DNA sequencing technology cannot read whole genomes in one stretch, but rather reads small pieces of between 20 and 1000 bases, depending on the technology used.
Genome AssemblersGenome Assemblers
• Variants of simpler sequence alignment programs to piece together vast quantities of fragments generated by automated sequencing instruments
Tools/ Software’s for AssemblyTools/ Software’s for Assembly
• TIGR Assembler• Velvet (Denovo)• Maq (Reference)• Reference assembly & Alignment using BWA
tool and Visualization of alignment using SAM
ChromosomeChromosomeSTSSTS
STS-mapped ScaffoldsSTS-mapped Scaffolds
ContigContig
Gap (mean & std. dev. Known)Gap (mean & std. dev. Known)Read pair (mates)Read pair (mates)
ConsensusConsensus
Reads (of several haplotypes)Reads (of several haplotypes)
SNPsSNPsExternal “Reads”External “Reads”
Anatomy of a WGS Assembly
Consensus (15- 30Kbp)Consensus (15- 30Kbp)
ReadsReads
ContigContigAssembly without pairs results Assembly without pairs results in contigs whose order and in contigs whose order and orientation are not known.orientation are not known.
??
Pairs, especially groups of corroborating Pairs, especially groups of corroborating ones, link the contigs into scaffolds where ones, link the contigs into scaffolds where the size of gaps is well characterized.the size of gaps is well characterized.
2-pair2-pair
Mean & Std.Dev.Mean & Std.Dev.is knownis known
ScaffoldScaffold
Order & Orientation
Overlap between two sequencesOverlap between two sequences
7
…AGCCTAGACCTACAGGATGCGCGGACACGTAGCCAGGAC CAGTACTTGGATGCGCTGACACGTAGCTTATCCGGT…
overlap (19 bases) overhang (6 bases)
overhangoverlap - region of similarity between regionsoverhang - un-aligned ends of the sequences
The assembler screens merges based on: • length of overlap• % identity in overlap region• maximum overhang size.
% identity = 18/19 % = 94.7%
Repeat Rez I, IIRepeat Rez I, II
Assembly PipelineAssembly Pipeline
OverlapperOverlapper
UnitigerUnitiger
ScaffolderScaffolder
AA
BB
impliesimplies
AA
BB
TRUE
OROR
AA BB
REPEAT-INDUCED
Find all overlaps Find all overlaps 40bp allowing 6% mismatch. 40bp allowing 6% mismatch.
Trim & ScreenTrim & Screen
Repeat Rez I, IIRepeat Rez I, II
Assembly PipelineAssembly Pipeline
Compute all overlap consistent sub-assemblies:Compute all overlap consistent sub-assemblies:Unitigs (Uniquely Assembled Contig)
OverlapperOverlapper
UnitigerUnitiger
ScaffolderScaffolder
Trim & ScreenTrim & Screen
OVERLAP GRAPHOVERLAP GRAPH
Edge Types:
AA
BB
AA
BB
AA
BB
BB
BB
BB
AA
AA
AA
Regular DovetailRegular Dovetail
Prefix DovetailPrefix Dovetail
Suffix DovetailSuffix Dovetail
E.G.:E.G.: Edges are annotated Edges are annotated with deltas of overlapswith deltas of overlaps
The Unitig ReductionThe Unitig Reduction
1. Remove “Transitively Inferrable” Overlaps:1. Remove “Transitively Inferrable” Overlaps:
AA
BB
CC AABB
CC
The Unitig ReductionThe Unitig Reduction
2. Collapse “Unique Connector” Overlaps:2. Collapse “Unique Connector” Overlaps:
AA BBAA
BB
412412 352352
4545
Arrival IntervalsArrival Intervals
Discriminator Statistic is log-odds ratio of probability unitig is is log-odds ratio of probability unitig is unique DNA versus 2-copy DNA.unique DNA versus 2-copy DNA.
Definitely UniqueDefinitely Repetitive Don’t Know
-10-10 +10+1000
Dist. For UniqueDist. For Repetitive
Unique DNA unitig Repetitive DNA unitig
Identifying Unique DNA Stretches
Repeat Rez I, IIRepeat Rez I, II
Assembly PipelineAssembly Pipeline
OverlapperOverlapper
UnitigerUnitiger
ScaffolderScaffolder
Scaffold U-unitigs with confirmed pairsScaffold U-unitigs with confirmed pairs
Mated reads
Trim & ScreenTrim & Screen
Repeat Rez I, IIRepeat Rez I, II
Assembly PipelineAssembly Pipeline
OverlapperOverlapper
UnitigerUnitiger
ScaffolderScaffolder
Fill repeat gaps with doubly anchored positive unitigsFill repeat gaps with doubly anchored positive unitigs
Unitig>0Unitig>0
Trim & ScreenTrim & Screen
Assembly gapsAssembly gaps
16
sequencing gap - we know the order and orientation of the contigs and have at least one clone spanning the gap
physical gap - no information known about the adjacent contigs, nor about the DNA spanning the gap
Sequencing gaps
Physical gaps
Assembly paradigmsAssembly paradigms
• Overlap-layout-consensus• greedy (TIGR Assembler, phrap, CAP3...)• graph-based (Celera Assembler, Arachne)
• Eulerian path (especially useful for short read sequencing)
17
TIGR Assembler/phrapTIGR Assembler/phrap
Greedy
• Build a rough map of fragment overlaps
• Pick the largest scoring overlap
• Merge the two fragments
• Repeat until no more merges can be done
18
Overlap-layout-consensusOverlap-layout-consensus
19
Main entity: readRelationship between reads: overlap
12
3
45
6
78
9
1 2 3 4 5 6 7 8 9
1 2 3
1 2 3
1 2 3 12
3
1 3
2
13
2
ACCTGAACCTGAAGCTGAACCAGA
Paths through graphs and Paths through graphs and assemblyassembly
• Hamiltonian circuit: visit each node (city) exactly once, returning to the start
20
A
B D C
E
H G
I
F
A
B
C
D H
I
F
G
E
Genome
All pairs alignmentAll pairs alignment• Needed by the assembler• Try all pairs – must consider ~ n2 pairs• Smarter solution: only n x coverage (e.g. 8)
pairs are possible• Build a table of k-mers contained in sequences
(single pass through the genome)• Generate the pairs from k-mer table (single pass
through k-mer table)
21
k-mer
A
B
C
D H
I
F
G
E
Assessing Assembly QualityAssessing Assembly Quality
• number and sizes of contigs• Assumption: few large contigs is better than
many small contigs.• True because there are fewer gaps in the
former, but, does not account for the possibility of misassembles.
Reference assembly – BWA toolReference assembly – BWA tool
• BWA - Burrows-Wheeler Aligner
• Aligns relatively short nucleotide sequences against a long reference sequence such as the human genome.
• It implements two algorithms, bwa-short and BWA-SW.
• The former works for query sequences shorter than 200bp and the latter for longer sequences up to around 100kbp.
• Both algorithms do gapped alignment.
• They are usually more accurate and faster on queries with low error rates
Reference assembly – BWA toolReference assembly – BWA tool
• Given high-quality reads, it is an order of magnitude faster than MAQ while achieving similar alignment accuracy.
• Platform: Illumina; SOLiD; 454; Sanger • Features: PET (paired end tags) mapping (short reads only);
gapped alignment; mapping quality; counting suboptimal occurrences (short reads only); SAM output
• Advantages: fast • Limitations: short read algorithm is slow for long reads and reads
with high error rate
• Availability: GPL
Reference assembly – SAMtoolReference assembly – SAMtool
• SAM (Sequence Alignment/Map) format is a generic format for storing large nucleotide sequence alignments.
• SAM Tools provide various utilities for manipulating alignments in the SAM format, including sorting, merging, indexing and generating alignments in a per-position format.
Reference assembly – SAMtoolReference assembly – SAMtool
• Is flexible enough to store all the alignment information generated by various alignment programs;
• Is simple enough to be easily generated by alignment programs or converted from existing alignment formats;
• Is compact in file size;
• Allows most of operations on the alignment to work on a stream without loading the whole alignment into memory;
• Allows the file to be indexed by genomic position to efficiently retrieve all reads aligning to a locus.
Denovo assembly- VelvetDenovo assembly- Velvet
• de novo genomic assembler specially designed for short read sequencing technologies, such as Solexa or 454,
• Developed by Daniel Zerbino and Ewan Birney at the European Bioinformatics Institute (EMBL-EBI), near Cambridge, in the United Kingdom.
• Currently takes in short read sequences, removes errors then produces high quality unique contigs.
• It then uses paired-end read and long read information, when available, to retrieve the repeated areas between contigs.
Applications of Genome assemblyApplications of Genome assembly
• Generating and interpreting alignment status and reports
• Genome variation calling (finding SNP's, indels)• Variation annotation and Viewing
THANK YOU