Genome Alignment & Assembly

Genome Alignment & AssemblyGenome Alignment & Assembly

Chandrasekar A.

Sequence AssemblySequence Assembly

• In bioinformatics, sequence assembly refers to aligning and merging fragments of a much longer DNA sequence in order to reconstruct the original order of the sequence

• This is needed as DNA sequencing technology cannot read whole genomes in one stretch, but rather reads small pieces of between 20 and 1000 bases, depending on the technology used.

Genome AssemblersGenome Assemblers

• Variants of simpler sequence alignment programs to piece together vast quantities of fragments generated by automated sequencing instruments

Tools/ Software’s for AssemblyTools/ Software’s for Assembly

• TIGR Assembler• Velvet (Denovo)• Maq (Reference)• Reference assembly & Alignment using BWA

tool and Visualization of alignment using SAM

ChromosomeChromosomeSTSSTS

STS-mapped ScaffoldsSTS-mapped Scaffolds

ContigContig

Gap (mean & std. dev. Known)Gap (mean & std. dev. Known)Read pair (mates)Read pair (mates)

ConsensusConsensus

Reads (of several haplotypes)Reads (of several haplotypes)

SNPsSNPsExternal “Reads”External “Reads”

Anatomy of a WGS Assembly

Consensus (15- 30Kbp)Consensus (15- 30Kbp)

ReadsReads

ContigContigAssembly without pairs results Assembly without pairs results in contigs whose order and in contigs whose order and orientation are not known.orientation are not known.

??

Pairs, especially groups of corroborating Pairs, especially groups of corroborating ones, link the contigs into scaffolds where ones, link the contigs into scaffolds where the size of gaps is well characterized.the size of gaps is well characterized.

2-pair2-pair

Mean & Std.Dev.Mean & Std.Dev.is knownis known

ScaffoldScaffold

Order & Orientation

Overlap between two sequencesOverlap between two sequences

7

…AGCCTAGACCTACAGGATGCGCGGACACGTAGCCAGGAC CAGTACTTGGATGCGCTGACACGTAGCTTATCCGGT…

overlap (19 bases) overhang (6 bases)

overhangoverlap - region of similarity between regionsoverhang - un-aligned ends of the sequences

The assembler screens merges based on: • length of overlap• % identity in overlap region• maximum overhang size.

% identity = 18/19 % = 94.7%

Repeat Rez I, IIRepeat Rez I, II

Assembly PipelineAssembly Pipeline

OverlapperOverlapper

UnitigerUnitiger

ScaffolderScaffolder

AA

BB

impliesimplies

AA

BB

TRUE

OROR

AA BB

REPEAT-INDUCED

Find all overlaps Find all overlaps 40bp allowing 6% mismatch. 40bp allowing 6% mismatch.

Trim & ScreenTrim & Screen



Compute all overlap consistent sub-assemblies:Compute all overlap consistent sub-assemblies:Unitigs (Uniquely Assembled Contig)


UnitigerUnitiger



OVERLAP GRAPHOVERLAP GRAPH

Edge Types:

AA

BB

AA

BB

AA

BB

BB

BB

BB

AA

AA

AA

Regular DovetailRegular Dovetail

Prefix DovetailPrefix Dovetail

Suffix DovetailSuffix Dovetail

E.G.:E.G.: Edges are annotated Edges are annotated with deltas of overlapswith deltas of overlaps

The Unitig ReductionThe Unitig Reduction

1. Remove “Transitively Inferrable” Overlaps:1. Remove “Transitively Inferrable” Overlaps:

AA

BB

CC AABB

CC

The Unitig ReductionThe Unitig Reduction

2. Collapse “Unique Connector” Overlaps:2. Collapse “Unique Connector” Overlaps:

AA BBAA

BB

412412 352352

4545

Arrival IntervalsArrival Intervals

Discriminator Statistic is log-odds ratio of probability unitig is is log-odds ratio of probability unitig is unique DNA versus 2-copy DNA.unique DNA versus 2-copy DNA.

Definitely UniqueDefinitely Repetitive Don’t Know

-10-10 +10+1000

Dist. For UniqueDist. For Repetitive

Unique DNA unitig Repetitive DNA unitig

Identifying Unique DNA Stretches




UnitigerUnitiger


Scaffold U-unitigs with confirmed pairsScaffold U-unitigs with confirmed pairs

Mated reads





UnitigerUnitiger


Fill repeat gaps with doubly anchored positive unitigsFill repeat gaps with doubly anchored positive unitigs

Unitig>0Unitig>0


Assembly gapsAssembly gaps

16

sequencing gap - we know the order and orientation of the contigs and have at least one clone spanning the gap

physical gap - no information known about the adjacent contigs, nor about the DNA spanning the gap

Sequencing gaps

Physical gaps

Assembly paradigmsAssembly paradigms

• Overlap-layout-consensus• greedy (TIGR Assembler, phrap, CAP3...)• graph-based (Celera Assembler, Arachne)

• Eulerian path (especially useful for short read sequencing)

17

TIGR Assembler/phrapTIGR Assembler/phrap

Greedy

• Build a rough map of fragment overlaps

• Pick the largest scoring overlap

• Merge the two fragments

• Repeat until no more merges can be done

18

Overlap-layout-consensusOverlap-layout-consensus

19

Main entity: readRelationship between reads: overlap

12

3

45

6

78

9

1 2 3 4 5 6 7 8 9

1 2 3

1 2 3

1 2 3 12

3

1 3

2

13

2

ACCTGAACCTGAAGCTGAACCAGA

Paths through graphs and Paths through graphs and assemblyassembly

• Hamiltonian circuit: visit each node (city) exactly once, returning to the start

20

A

B D C

E

H G

I

F

A

B

C

D H

I

F

G

E

Genome

All pairs alignmentAll pairs alignment• Needed by the assembler• Try all pairs – must consider ~ n2 pairs• Smarter solution: only n x coverage (e.g. 8)

pairs are possible• Build a table of k-mers contained in sequences

(single pass through the genome)• Generate the pairs from k-mer table (single pass

through k-mer table)

21

k-mer

A

B

C

D H

I

F

G

E

Assessing Assembly QualityAssessing Assembly Quality

• number and sizes of contigs• Assumption: few large contigs is better than

many small contigs.• True because there are fewer gaps in the

former, but, does not account for the possibility of misassembles.

Reference assembly – BWA toolReference assembly – BWA tool

• BWA - Burrows-Wheeler Aligner

• Aligns relatively short nucleotide sequences against a long reference sequence such as the human genome.

• It implements two algorithms, bwa-short and BWA-SW.

• The former works for query sequences shorter than 200bp and the latter for longer sequences up to around 100kbp.

• Both algorithms do gapped alignment.

• They are usually more accurate and faster on queries with low error rates

Reference assembly – BWA toolReference assembly – BWA tool

• Given high-quality reads, it is an order of magnitude faster than MAQ while achieving similar alignment accuracy.

• Platform: Illumina; SOLiD; 454; Sanger • Features: PET (paired end tags) mapping (short reads only);

gapped alignment; mapping quality; counting suboptimal occurrences (short reads only); SAM output

• Advantages: fast • Limitations: short read algorithm is slow for long reads and reads

with high error rate

• Availability: GPL

Reference assembly – SAMtoolReference assembly – SAMtool

• SAM (Sequence Alignment/Map) format is a generic format for storing large nucleotide sequence alignments.

• SAM Tools provide various utilities for manipulating alignments in the SAM format, including sorting, merging, indexing and generating alignments in a per-position format.

Reference assembly – SAMtoolReference assembly – SAMtool

• Is flexible enough to store all the alignment information generated by various alignment programs;

• Is simple enough to be easily generated by alignment programs or converted from existing alignment formats;

• Is compact in file size;

• Allows most of operations on the alignment to work on a stream without loading the whole alignment into memory;

• Allows the file to be indexed by genomic position to efficiently retrieve all reads aligning to a locus.

Denovo assembly- VelvetDenovo assembly- Velvet

• de novo genomic assembler specially designed for short read sequencing technologies, such as Solexa or 454,

• Developed by Daniel Zerbino and Ewan Birney at the European Bioinformatics Institute (EMBL-EBI), near Cambridge, in the United Kingdom.

• Currently takes in short read sequences, removes errors then produces high quality unique contigs.

• It then uses paired-end read and long read information, when available, to retrieve the repeated areas between contigs.

Applications of Genome assemblyApplications of Genome assembly

• Generating and interpreting alignment status and reports

• Genome variation calling (finding SNP's, indels)• Variation annotation and Viewing

THANK YOU

Documents

Genome Alignment & Assembly