29
Genome Alignment & Genome Alignment & Assembly Assembly Chandrasekar A.

Genome Alignment & Assembly

  • Upload
    betrys

  • View
    42

  • Download
    0

Embed Size (px)

DESCRIPTION

Genome Alignment & Assembly. Chandrasekar A. Sequence Assembly. In bioinformatics, sequence assembly refers to aligning and merging fragments of a much longer DNA sequence in order to reconstruct the original order of the sequence - PowerPoint PPT Presentation

Citation preview

Page 1: Genome Alignment & Assembly

Genome Alignment & AssemblyGenome Alignment & Assembly

Chandrasekar A.

Page 2: Genome Alignment & Assembly

Sequence AssemblySequence Assembly

• In bioinformatics, sequence assembly refers to aligning and merging fragments of a much longer DNA sequence in order to reconstruct the original order of the sequence

• This is needed as DNA sequencing technology cannot read whole genomes in one stretch, but rather reads small pieces of between 20 and 1000 bases, depending on the technology used.

Page 3: Genome Alignment & Assembly

Genome AssemblersGenome Assemblers

• Variants of simpler sequence alignment programs to piece together vast quantities of fragments generated by automated sequencing instruments

Page 4: Genome Alignment & Assembly

Tools/ Software’s for AssemblyTools/ Software’s for Assembly

• TIGR Assembler• Velvet (Denovo)• Maq (Reference)• Reference assembly & Alignment using BWA

tool and Visualization of alignment using SAM

Page 5: Genome Alignment & Assembly

ChromosomeChromosomeSTSSTS

STS-mapped ScaffoldsSTS-mapped Scaffolds

ContigContig

Gap (mean & std. dev. Known)Gap (mean & std. dev. Known)Read pair (mates)Read pair (mates)

ConsensusConsensus

Reads (of several haplotypes)Reads (of several haplotypes)

SNPsSNPsExternal “Reads”External “Reads”

Anatomy of a WGS Assembly

Page 6: Genome Alignment & Assembly

Consensus (15- 30Kbp)Consensus (15- 30Kbp)

ReadsReads

ContigContigAssembly without pairs results Assembly without pairs results in contigs whose order and in contigs whose order and orientation are not known.orientation are not known.

??

Pairs, especially groups of corroborating Pairs, especially groups of corroborating ones, link the contigs into scaffolds where ones, link the contigs into scaffolds where the size of gaps is well characterized.the size of gaps is well characterized.

2-pair2-pair

Mean & Std.Dev.Mean & Std.Dev.is knownis known

ScaffoldScaffold

Order & Orientation

Page 7: Genome Alignment & Assembly

Overlap between two sequencesOverlap between two sequences

7

…AGCCTAGACCTACAGGATGCGCGGACACGTAGCCAGGAC CAGTACTTGGATGCGCTGACACGTAGCTTATCCGGT…

overlap (19 bases) overhang (6 bases)

overhangoverlap - region of similarity between regionsoverhang - un-aligned ends of the sequences

The assembler screens merges based on: • length of overlap• % identity in overlap region• maximum overhang size.

% identity = 18/19 % = 94.7%

Page 8: Genome Alignment & Assembly

Repeat Rez I, IIRepeat Rez I, II

Assembly PipelineAssembly Pipeline

OverlapperOverlapper

UnitigerUnitiger

ScaffolderScaffolder

AA

BB

impliesimplies

AA

BB

TRUE

OROR

AA BB

REPEAT-INDUCED

Find all overlaps Find all overlaps 40bp allowing 6% mismatch. 40bp allowing 6% mismatch.

Trim & ScreenTrim & Screen

Page 9: Genome Alignment & Assembly

Repeat Rez I, IIRepeat Rez I, II

Assembly PipelineAssembly Pipeline

Compute all overlap consistent sub-assemblies:Compute all overlap consistent sub-assemblies:Unitigs (Uniquely Assembled Contig)

OverlapperOverlapper

UnitigerUnitiger

ScaffolderScaffolder

Trim & ScreenTrim & Screen

Page 10: Genome Alignment & Assembly

OVERLAP GRAPHOVERLAP GRAPH

Edge Types:

AA

BB

AA

BB

AA

BB

BB

BB

BB

AA

AA

AA

Regular DovetailRegular Dovetail

Prefix DovetailPrefix Dovetail

Suffix DovetailSuffix Dovetail

E.G.:E.G.: Edges are annotated Edges are annotated with deltas of overlapswith deltas of overlaps

Page 11: Genome Alignment & Assembly

The Unitig ReductionThe Unitig Reduction

1. Remove “Transitively Inferrable” Overlaps:1. Remove “Transitively Inferrable” Overlaps:

AA

BB

CC AABB

CC

Page 12: Genome Alignment & Assembly

The Unitig ReductionThe Unitig Reduction

2. Collapse “Unique Connector” Overlaps:2. Collapse “Unique Connector” Overlaps:

AA BBAA

BB

412412 352352

4545

Page 13: Genome Alignment & Assembly

Arrival IntervalsArrival Intervals

Discriminator Statistic is log-odds ratio of probability unitig is is log-odds ratio of probability unitig is unique DNA versus 2-copy DNA.unique DNA versus 2-copy DNA.

Definitely UniqueDefinitely Repetitive Don’t Know

-10-10 +10+1000

Dist. For UniqueDist. For Repetitive

Unique DNA unitig Repetitive DNA unitig

Identifying Unique DNA Stretches

Page 14: Genome Alignment & Assembly

Repeat Rez I, IIRepeat Rez I, II

Assembly PipelineAssembly Pipeline

OverlapperOverlapper

UnitigerUnitiger

ScaffolderScaffolder

Scaffold U-unitigs with confirmed pairsScaffold U-unitigs with confirmed pairs

Mated reads

Trim & ScreenTrim & Screen

Page 15: Genome Alignment & Assembly

Repeat Rez I, IIRepeat Rez I, II

Assembly PipelineAssembly Pipeline

OverlapperOverlapper

UnitigerUnitiger

ScaffolderScaffolder

Fill repeat gaps with doubly anchored positive unitigsFill repeat gaps with doubly anchored positive unitigs

Unitig>0Unitig>0

Trim & ScreenTrim & Screen

Page 16: Genome Alignment & Assembly

Assembly gapsAssembly gaps

16

sequencing gap - we know the order and orientation of the contigs and have at least one clone spanning the gap

physical gap - no information known about the adjacent contigs, nor about the DNA spanning the gap

Sequencing gaps

Physical gaps

Page 17: Genome Alignment & Assembly

Assembly paradigmsAssembly paradigms

• Overlap-layout-consensus• greedy (TIGR Assembler, phrap, CAP3...)• graph-based (Celera Assembler, Arachne)

• Eulerian path (especially useful for short read sequencing)

17

Page 18: Genome Alignment & Assembly

TIGR Assembler/phrapTIGR Assembler/phrap

Greedy

• Build a rough map of fragment overlaps

• Pick the largest scoring overlap

• Merge the two fragments

• Repeat until no more merges can be done

18

Page 19: Genome Alignment & Assembly

Overlap-layout-consensusOverlap-layout-consensus

19

Main entity: readRelationship between reads: overlap

12

3

45

6

78

9

1 2 3 4 5 6 7 8 9

1 2 3

1 2 3

1 2 3 12

3

1 3

2

13

2

ACCTGAACCTGAAGCTGAACCAGA

Page 20: Genome Alignment & Assembly

Paths through graphs and Paths through graphs and assemblyassembly

• Hamiltonian circuit: visit each node (city) exactly once, returning to the start

20

A

B D C

E

H G

I

F

A

B

C

D H

I

F

G

E

Genome

Page 21: Genome Alignment & Assembly

All pairs alignmentAll pairs alignment• Needed by the assembler• Try all pairs – must consider ~ n2 pairs• Smarter solution: only n x coverage (e.g. 8)

pairs are possible• Build a table of k-mers contained in sequences

(single pass through the genome)• Generate the pairs from k-mer table (single pass

through k-mer table)

21

k-mer

A

B

C

D H

I

F

G

E

Page 22: Genome Alignment & Assembly

Assessing Assembly QualityAssessing Assembly Quality

• number and sizes of contigs• Assumption: few large contigs is better than

many small contigs.• True because there are fewer gaps in the

former, but, does not account for the possibility of misassembles.

Page 23: Genome Alignment & Assembly

Reference assembly – BWA toolReference assembly – BWA tool

• BWA - Burrows-Wheeler Aligner

• Aligns relatively short nucleotide sequences against a long reference sequence such as the human genome.

• It implements two algorithms, bwa-short and BWA-SW.

• The former works for query sequences shorter than 200bp and the latter for longer sequences up to around 100kbp.

• Both algorithms do gapped alignment.

• They are usually more accurate and faster on queries with low error rates

Page 24: Genome Alignment & Assembly

Reference assembly – BWA toolReference assembly – BWA tool

• Given high-quality reads, it is an order of magnitude faster than MAQ while achieving similar alignment accuracy.

• Platform: Illumina; SOLiD; 454; Sanger • Features: PET (paired end tags) mapping (short reads only);

gapped alignment; mapping quality; counting suboptimal occurrences (short reads only); SAM output

• Advantages: fast • Limitations: short read algorithm is slow for long reads and reads

with high error rate

• Availability: GPL

Page 25: Genome Alignment & Assembly

Reference assembly – SAMtoolReference assembly – SAMtool

• SAM (Sequence Alignment/Map) format is a generic format for storing large nucleotide sequence alignments.

• SAM Tools provide various utilities for manipulating alignments in the SAM format, including sorting, merging, indexing and generating alignments in a per-position format.

Page 26: Genome Alignment & Assembly

Reference assembly – SAMtoolReference assembly – SAMtool

• Is flexible enough to store all the alignment information generated by various alignment programs;

• Is simple enough to be easily generated by alignment programs or converted from existing alignment formats;

• Is compact in file size;

• Allows most of operations on the alignment to work on a stream without loading the whole alignment into memory;

• Allows the file to be indexed by genomic position to efficiently retrieve all reads aligning to a locus.

Page 27: Genome Alignment & Assembly

Denovo assembly- VelvetDenovo assembly- Velvet

• de novo genomic assembler specially designed for short read sequencing technologies, such as Solexa or 454,

• Developed by Daniel Zerbino and Ewan Birney at the European Bioinformatics Institute (EMBL-EBI), near Cambridge, in the United Kingdom.

• Currently takes in short read sequences, removes errors then produces high quality unique contigs.

• It then uses paired-end read and long read information, when available, to retrieve the repeated areas between contigs.

Page 28: Genome Alignment & Assembly

Applications of Genome assemblyApplications of Genome assembly

• Generating and interpreting alignment status and reports

• Genome variation calling (finding SNP's, indels)• Variation annotation and Viewing

Page 29: Genome Alignment & Assembly

THANK YOU