Alignment of large genomic sequences

Alignment of large genomic sequences

Fragment-based alignment approach (DIALIGN) useful for alignment of genomic sequences.

Possible applications: Detection of regulatory elements Identification of pathogenic microorganisms Gene prediction

The DIALIGN approach

atctaatagttaaactcccccgtgcttag

cagtgcgtgtattactaacggttcaatcgcg

caaagagtatcacccctgaattgaataa






















atc------taatagttaaactcccccgtgcttag






caaa--gagtatcacccctgaattgaataa




caaa--gagtatcacc----------cctgaattgaataa


atc------taatagttaaactcccccgtgc-ttag

cagtgcgtgtattactaac----------gg-ttcaatcgcg



atc------taatagttaaactcccccgtgc-ttag

cagtgcgtgtattactaac----------gg-ttcaatcgcg


Consistency!


atc------TAATAGTTAaactccccCGTGC-TTag

cagtgcGTGTATTACTAAc----------GG-TTCAATcgcg

caaa--GAGTATCAcc----------CCTGaaTTGAATaa

First step in sequence comparison: alignment

S1 S2S3


For genomic sequences:Neither local nor global methods appropriate

S1 S2

S1’ S2’ S3’

S3


Local method finds single best local similarity

S1 S2

S1’ S2’ S3’

S3


Multiple application of local methods possible

S1 S2

S1’ S2’ S3’

S3


S1 S2

S1’ S2’ S3’

S3




S1 S2

S1’ S2’ S3’

S3



S1 S2

S1’ S2’ S3’

S3


Threshold has to be applied to filter alignments: reduced sensitivity!

S1 S2

S1’ S2’ S3’

S3


Alternative approach:

During evolution few large-scale re-arrangements

-> relative order homologies conserved

Search for chain of local homologies


Genomic alignment: chain of homologies

S1 S2

S1’ S2’ S3’

S3



S1 S2

S1’ S2’ S3’

S3



S1 S2

S1’ S2’ S3’

S3



S1 S2

S1’ S2’ S3’

S3


Novel approaches for genomic alignment:

WABA PipMaker MGA TBA Lagan Avid DIALIGN


Gene-regulatory sites identified by mulitple sequence alignment (phylogenetic footprinting)


Objective function for DIALIGN:

Weight score for every possible fragment f based on P-value: P(f) = probability of finding a fragment “like f” by chance

in random sequences with same length as input sequences

w(f) = -log P(f) (“weight score” of f)

”like f” means: at least same # matches (DNA, RNA) or sum of similarity values (proteins)

Objective function for DIALIGN:

Score of alignment: sum of weight scores of fragments – no gap penalty!

Optimization problem for DIALIGN:

Find consistent collection of fragments with maximum total weight score!

Alternative fragment weight scores for genomic sequences:

Calculate fragment scores at nucleotide level and at peptide level.

catcatatcttatcttacgttaactcccccgt

cagtgcgtgatagcccatatccgg





Standard score:Consider length, # matches, compute probability of random occurrence

Translation option:



Translation option:

L S Y V

catcatatc tta tct tac gtt aactcccccgt

cagtgcgtg ata gcc cat atc cgg

I A H I

DNA segments translated to peptide segments; fragment score based on peptide similarity:

Calculate probability of finding a fragment of the same length with (at least) the same sum of BLOSUM values

P-fragment (in both orientations)

L S Y V

catcatatc tta tct tac gtt aactcccccgt

cagtgcgtg ata gcc cat atc cgg

I A H I

N-fragment catcatatc ttatcttacgtt aactcccccgtgct || | | | cagtgcgtg atagcccatatc cg

For each fragment f three probability values calculated; Score of f based on smallest P value.

Alternative fragment weight scores for genomic sequences:

Calculate fragment scores at nucleotide level and at peptide level.

DIALIGN alignment of human and murine genomic sequences

DIALIGN alignment of tomato and Thaliana genomic sequences


Evaluation of signal detection methods: Apply method to data with known signals (correct answer is known!). E.g. experimentally verified genes for gene finding

TP = true positves = # signals correctly predicted (i.e. signal present)

FP = false positives = # signals predicted but wrong (i.e no signal present)

TN = true negative = # no signal predicted, no signal present

FN = false negative = # no signal predicted, signal present!


Sn = Sensitivity

= correctly predicted signals / present signals

= TP / (TP + FN)

Sp = Specificity

= correctly predicted signals / predicted signals = TP / (TP + FP)


Comprehensive evaluation of signal prediction method:

Method assigns score to predictions

Apply threshold parameter

High threshold -> high specificity (Sp), low sensitivity (Sn)

Low threshold -> high sensitivity , low specificity

ROC curve („receiver-operator curve“)

Vary threshold parameter, plot Sn against Sp

Performance of long-range alignment programs for exon discovery (human - mouse comparison)

DIALIGN alignment of tomato and Thaliana genomic sequences

AGenDA:

Alignment-based Gene Detection Algorithm

Bridge small gaps between DIALIGN fragments

-> cluster of fragments

AGenDA:




Search conserved splice sites and start/stop codons at cluster boundaries to Identify candidate exons

AGenDA:




Search conserved splice sites and start/stop codons at cluster boundaries to Identify candidate exons

Recursive algorithm finds biologically consistent chain of potential exons

Identification of candidate exons

Fragments in DIALIGN alignment


Build cluster of fragments


Identify conserved splice sites


Candidate exons bounded by conserved splice sites

Construct gene models using candidate exons

Score of candidate exon (E) based on DIALIGN scores for fragments, score of splice junctions and penalty for shortening / extending

Find biologically consistent chain of candidate exons (starting with start codon, ending with stop codon, no internal stop codons …) with maximal total score

)()()(

),()()( SPscfw

Clen

ECdisClenEsc

i

i

Find optimal consistent chain of candidate exons





atg gt ag gt ag tga atg tga



G1 G2


Recursive algorithm calculates optimal chain of candidate exons in O(N log N) time



G1 G2


Recursive algorithm calculates optimal chain of candidate exons in O(N log N) time

DIALIGN fragments

Candidate exons

Gene model

Result: 105 pairs of genomic sequences from human and mouse (Batzoglou et al., 2000)


0%10%20%30%40%50%60%70%80%90%

100%

sensitivity specificity

AGenDAGenScan

AGenDA

GenScan

64 %

12 % 17 %



DIALIGN used by tracker for phylogenetic footprinting (Prohaska et al., 2004)



Alignment of Hox gene cluster:




DIALIGN able to identify small regulatory elements, but





Entire genes totally mis-aligned





Entire genes totally mis-aligned Reason for mis-alignment: duplications !


The Hox gene cluster:

4 Hox gene clusters in pufferfish. 14 genes, different genes in different clusters!


The Hox gene cluster:

Complete mis-alignment of entire genes!

Alignment of sequence duplications

S1

S2


S1

S2

Conserved motivs; no similarity outside motifs


S1

S2

Duplication in two sequences


S1

S2



S1

S2



S1

S2

Mis-alignment would have lower score!


S1

S2

Duplication in one sequence


S1

S2



S1

S2


Possible mis-alignment


S1

S2


S3


S1

S2


S3


S1

S2


S3


S1

S2


S3


S1

S2

Consistency problem

S3


S1

S2

More plausible alignment – and higher score:

S3


S1

S2

Consistency problem

S3


S1

S2

Alternative alignment; probably biologically wrong;lower numerical score!

S3

Anchored sequence alignment

Biologically meaningful alignment often not possible by automated approaches.


Biologically meaningful alignment not possible by automated approaches.

Idea: use expert knowledge to guide alignment procedure


Biologically meaningful alignment not possible by automated approaches.

Idea: use expert knowledge to guide alignment procedure

User defines a set anchor points that are to be „respected“ by the alignment procedure


NLFVALYDFVASGDNTLSITKGEKLRVLGYNHN

IIHREDKGVIYALWDYEPQNDDELPMKEGDCMT







Use known homology as anchor point


NLFV ALYDFVASGDNTLSITKGEKLRVLGYNHN







Anchor point = anchored fragment (gap-free pair of segments)





Anchor point = anchored fragment (gap-free pair of segments)

Remainder of sequences aligned automatically


-------NLF VALYDFVASG DNTLSITKGE klrvlgynhn

iihredkGVI YALWDYEPQN DDELPMKEGD cmt-------

Anchored alignment




GYQYRALYDYKKEREEDIDLHLGDILTVNKGSLVALGFS

Anchor points in multiple alignment



IIHREDKGVIYALWDYEPQND DELPMKEGDCMT

GYQYRALYDYKKEREEDIDLHLGDILTVNKGSLVALGFS

Anchor points in multiple alignment


-------NLF V-ALYDFVAS GD-------- NTLSITKGEk lrvLGYNhn

iihredkGVI Y-ALWDYEPQ ND-------- DELPMKEGDC MT-------

-------GYQ YrALYDYKKE REedidlhlg DILTVNKGSL VA-LGFS--

Anchored multiple alignment

Algorithmic questions

Goal:

Find optimal alignment (=consistent set of fragments) under costraints given by user-specified anchor points!

Additional input file with anchor points:

1 3 215 231 5 4.5

2 3 34 78 23 1.23

1 4 317 402 8 8.5



NLFVALYDFVASGDNTLSITKGEKLRVLGYNHN IIHREDKGVIYALWDYEPQNDDELPMKEGDCMTGYQYRALYDYKKEREEDIDLHLGDILTVNKGSLVALGFS


1 3 215 231 5 4.5

2 3 34 78 23 1.23

1 4 317 402 8 8.5



1 3 215 231 5 4.5

2 3 34 78 23 1.23

1 4 317 402 8 8.5

Sequences



1 3 215 231 5 4.5

2 3 34 78 23 1.23

1 4 317 402 8 8.5

Sequences start positions



1 3 215 231 5 4.5

2 3 34 78 23 1.23

1 4 317 402 8 8.5

Sequences start positions length



1 3 215 231 5 4.5

2 3 34 78 23 1.23

1 4 317 402 8 8.5

Sequences start positions length score



Requirements:

Anchor points need to be consistent! – if necessary: select consistent subset from user-specified anchor points













Inconsistent anchor points!


atctaat---agttaaactcccccgtgcttag

Cagtgcgtgtattac-taacggttcaatcgcg


Inconsistent anchor points!


Requirements:



Requirements:


Find alignment under constraints given by anchor points!


Use data structures from multiple alignment









Greedy procedure for multiple alignment





Greedy procedure for multiple alignment





Question: which positions are still alignable ?


atctaatagttaaactcccccgtgcttag Si


caaagagtatcacccctgaattgaataa x

For each position x and each sequence Si exist an

upper bound ub(x,i) and a lower bound lb(x,i) for

residues y in Si that are alignable with x





For each position x and each sequence Si exist an

upper bound ub(x,i) and a lower bound lb(x,i) for

residues y in Si that are alignable with x





ub(x,i) and lb(x,i) updated during greedy procedure





Initial values of lb(x,i), ub(x,i)












Anchor points treated like fragments in greedy algorithm:



Sorted according to user-defined scores



Sorted according to user-defined scores Accepted if consistent with previously

accepted anchors



Sorted according to user-defined scores Accepted if consistent with previously

accepted anchors

ub(x,i) and lb(x,i) updated during greedy

procedure



Sorted according to user-defined scores Accepted if consistent with previously accepted

anchors

ub(x,i) and lb(x,i) updated during greedy

procedure

Resulting values of ub(x,i) and lb(x,i) used as initial

values for alignment procedure





Initial values of lb(x,i), ub(x,i)





Initial values of lb(x,i), ub(x,i) calculated using anchor

points


Ranking of anchor points to prioritize anchor points, e.g.

anchor points from verified homologies -- higher priority

automatically created anchor points (using CHAOS, BLAST, … ) -- lower priority

Application: Hox gene cluster


Use gene boundaries as anchor points


Use gene boundaries as anchor points

+ CHAOS / BLAST hits


no anchoring anchoring

Ali. Columns

2 seq 2958 3674

3 seq 668 1091

4 seq 244 195

Score 1166 1007

CPU time 4:22 0:19


Example:

Teleost Hox gene cluster:


Example:


Score of anchored alignment 15 % higher than score of non-anchored alignment !


Example:


Score of anchored alignment 15 % higher than score of non-anchored alignment !

Conclusion: Greedy optimization algorithm does a bad job!

Application: Improvement of Alignment programs

Two possible reasons for mis-alignments:



Wrong objective function: Biologically correct

alignment gets bad numerical score



Wrong objective function: Biologically correct

alignment gets bad numerical score

Bad optimization algorithms: Biologically correct

alignment gets best numerical score, but algorithm

fails to find this alignment



Anchored alignments can help to decide

Application: RNA alignment


aa----CCCC AGC---GUAa gucgcuaucc a

cacucuCCCA AGC---GGAG Aac------- -

ccg----CCA AaagauGGCG Acuuga---- -

non-anchored alignment


aa----CCCC AGC---GUAa gucgcuaucc a

cacucuCCCA AGC---GGAG Aac------- -

ccg----CCA AaagauGGCG Acuuga---- -

structural motif mis-aligned


aaCCCCAGCG UAAGUCGCUA UCca--

--CACUCUCC CAAGCGGAGA AC----

----CCGCCA AAAGAUGGCG ACuuga

3 conserved nucleotides as anchor points

WWW interface at GOBICS(Göttingen Bioinformatics Compute Server)

WWW interface at GOBICS (Göttingen Bioinformatics Compute Server)

Documents

Alignment of large genomic sequences