Finding Regulatory Signals in Genomic Sequences Bioinformatics and Natural Computing Group Dipartimento di Informatica, Sistemistica e Comunicazione Università

Finding Regulatory Signals inGenomic Sequences

Bioinformatics and Natural Computing GroupDipartimento di Informatica, Sistemistica e

ComunicazioneUniversità degli Studi di Milano-Bicocca

Giancarlo Mauri

Weeder

ProFind

04/21/23 2

Gene Expression Data

When and how much a gene is expressed under some given conditions (tissue, external stimuli, disease...)

We can group genes according to their expression profile

We can suspect a “common cause” for their expression

04/21/23 3

Transcription Factors

The expression of a gene starts with transcription from DNA to RNA

Transcription is modulated by dedicated proteins called transcription factors (TFs)

TFs bind to DNA in the regions surrounding the starting site of the gene (mostly upstream), and direct polymerases to the “right spot” to start transcription

Different effects: may enhance or block transcription

04/21/23 4

Transcription Factors

04/21/23 5

TF

Transcription Starts

TFBS

04/21/23 6

TF1


TFBS

AND

TF2

04/21/23 7

TF1


TFBS

NOT

TF2

04/21/23 8

TFs Binding Sites (TFBSs)

Bound by transcription factors

Short degenerate sequences, 5-16 nucleotides long, (gaps possible but rare)

Each TF does not recognize a single fragment but a set of them (similar to each other) called signal or motif

Can be illustrated by profiles and/or consensi (computational models)

Fundamental in regulatory analysis is the identification of potential TFBSs

04/21/23 9

TFs Binding Sites (TFBSs)

Fundamental in regulatory analysis is the identification of potential TFBSs

Bound by transcription factors

Short degenerate sequences, 5-16 nucleotides long, (gaps possible but rare)

Each TF does not recognize a single fragment but a set of them (similar to each other) called signal or motif

Can be illustrated by profiles and/or consensi (computational models)

04/21/23 10

Finding TFBSs

We have a set of related genes: similar expression profile

similar biological function

anything else..

We take their upstream regulatory regions

If they are regulated by the same TF(s), then we should find its (their) binding sites in the sequences

We should find short patterns conserved in the sequences

We could use the detected TFBS to predict the behavior of a gene

04/21/23 11

If sequences were built at random and/or we picked sequences at random, the group should not appear with

the same size/conservation

Finding Novel TFBSs

Over-representation: First, detect groups of similar oligos

Describe each group with a consensus or a

profile (or in some other smart way)

Find the most over-represented groups

04/21/23 12

Finding Novel TFBSs

Most of early research has focused on the first point: how to detect the best groups (unfortunately, there are thousands of candidates) given simple score measures

Recent research has followed the second point: which is the best measure to tell “significant” groups from random similarities?

Is it expected or not, to find a group that is conserved?

Can we take advantage from the wealth of sequence data available?

Weeder : A tool for pattern discovery in

Genomic Sequences

* Università degli Studi di Milano-Bicocca^ Università degli Studi di Milano

Giancarlo Mauri*Giulio Pavesi*Graziano Pesole^

04/21/23 14

References

G. Pavesi, G.Mauri, G.Pesole. An Algorithm for Finding Signals of Unknown Length in Unaligned DNA Sequences. Bioinformatics 17, S207-S214, 2001

G.Pavesi, P.Mereghetti, G.Mauri, G.Pesole. WeederWeb: Discovery of Transcription Factor Binding Sites in a Set of Sequences from Co-Regulated Genes. Nucleic Acids Research Web Server Issue 2004, 32: W199-W203

http://159.149.109.16:8080/weederWeb/

04/21/23 15

Weeder (2001)

Idea: instead of reducing the set of candidate patterns, reduce the set of possible matches for each pattern, trying to save a “significant” number of valid occurrences

Instead of searching exhaustively for patterns that occur in every sequence, we “short-sightedly” look for patterns that occur in a subset of them

The algorithm needs as input only a given error ratio

04/21/23 16

Suffix Trees

A suffix tree is a data structure that exposes the internal structure of a string in a very deep and meaningful way

Suffix tree T for S = s1…sn

rooted directed tree

exactly n leaves numbered 1 to n

internal nodes with at least two sons

edges labeled by non empty substrings of S

labels out of the same node begin with different symbols

the concatenation of the edge labels on the path from

the root to any leaf i exactly spells the suffix of S

starting at position i, i.e., s1…sn

04/21/23 17

Suffix Trees

The same structure can be built also for a set of k sequences To distinguish which sequence a suffix belongs to,

it appends a different marker symbol, not

occurring elsewhere, to each sequence in the set.

It is also possible to annotate each node of the

tree with a k-bit string, where the i-th bit is

set if the word spelled by the path ending at the

node occurs in the i-th sequence.

04/21/23 18

Suffix Trees

C

CC

C

G

G

G

G

G

#

#

#

##

A

A A

A

AA

A

$

$

$

$

Suffix tree for ACCA (end with $) and CCAAG (end with #)

04/21/23 19

Suffix Trees

A generalized suffix tree can be built in O(N) time and takes O(N) space, where N is the overall length of the sequences

Annotating it with the bit strings takes additional O(kN) time

Each pattern occurring in the strings is spelled by a path starting from the root of the tree

The time needed to search for a pattern depends only on the length of the pattern

The structure allows to implement recursively the exhaustive enumeration of all the candidate patterns of a given length

The time complexity is thus reduced to be exponential in the maximum number of mutations allowed (Sagot, 1998)

04/21/23 20

Searching for an Exact Pattern

Given a set of sequences and the annotated suffix tree, every pattern appearing in at least one sequence of the set is spelled by a unique path starting from the root

We match the symbols of pattern p along the unique path in the tree until p is exhausted

•In this case, the bit string on the next node on the path specifies which sequences p appears in

no more matches are possible

04/21/23 21

Searching for a Pattern with Mismatches

We can also search for a pattern p with at most e mismatches in a similar way.

We match p along different paths on the tree at the same time, keeping track of the number of mismatches encountered on each path.

Whenever the number of errors on a path is greater than e, we discard that path.

The sequences p appears in are given by the logical-OR of the bit strings corresponding to the different paths.

04/21/23 22

Searching for (M, e) Patterns

The algorithm starts with the empty pattern from the root of the tree, and recursively expands it

Let us suppose we have found on the tree the endpoints of paths corresponding to the occurrences of a pattern p=p1…pm in the sequences, where all the paths spell words within distance e from p, with m<M

If p occurs in at least q sequences, we try and expand it by one symbol

04/21/23 23


Expanding a pattern by one symbol For each character b {A, C, G, T}, we

match b against the next symbol on each path

If a path ends just before a node V of the

tree, we match b against the first symbol on

each edge leaving V

When we encounter a mismatch, we increase

the previous error along the path by one

If the new error is greater than e, we

discard the path

04/21/23 24


Once all paths have been checked, the surviving ones represent the approximate occurrences of p’=p1…pmb

If p’ occurs in at least q sequences, and is shorter than M, we expand p’ as well. Otherwise, we continue with p and the next character in

04/21/23 25


For example It matches the first symbol on each edge leaving the root against A.

If A is valid, i.e., A occurs in at least q sequences, it is expanded to AA.

If also AA is valid, we move to AAA, and so on. If it is not valid, we proceed to look for occurrences of AC.

In this method, patterns don’t have to occur exactly in the sequences.

04/21/23 26


The main drawback is that every pattern of length e satisfies the input constraints, since every other pattern of length e found in the tree is a valid occurrence for it

Thus, the method works well only for small values of e

04/21/23 27


e

At the beginning of the search, all paths of length e are valid

04/21/23 28


To apply the algorithm also to longer patterns with higher values of e, instead of reducing the set of patterns that have to be searched, we restrict the number of paths that have to be followed for each pattern.

That is, we narrow down the set of valid occurrences－ the WEEDER algorithm.

04/21/23 29

Searching for Approximate Occurrences of Patterns

Problem Definition: Given a set of k sequences on the

alphabet = { A, C, G, T } , we

want to find all (M, e) patterns

(M, e) patterns: patterns of length M

that occur with at most e mismatches in

at least q sequences of the set

04/21/23 30

The Outline of WEEDER

WEEDER fixes an initial error ratio Given a pattern p, a path is valid if the

distance from p to the path is not greater than

|p|• |p| is the length of the pattern

When we expand p by one symbol, the error

threshold is set to (|p|+1)

04/21/23 31

Block Decomposition of a Pattern

Each block size is 1/

Let p = p1…pm. We can see p as composed of m blocks

0 4 8 12 16

12

3

4

= 0.25

04/21/23 32

Valid Occurrences

For every pattern p = p1…pm, valid occurrences are words si+1…si+m occurring in the sequences for which: j {1,…, m} d(p1…pj, si+1…si+j) j d(p1…pj, si+1…si+j) is the number of mismatches

between p1…pj and si+1…si+j

si+1…si+m is a valid occurrence for p if it is a valid occurrence for all its prefixes

{p1, p1p2, …, p1p2…pm-1}

04/21/23 33

An Example for WEEDER

A&

GCTCA&

ATCACGC#CGC#

TC$T

CACGC#

GCT%

C

GC#

T

%C

A&

$

$A

CGC#&

#

T

C

%

$A

&CGC#

GCT%

#

T%

CA&

S1: AATCACGC# S2: AGCTCA&

S3: ATGCT% S4: ACTC$

q = 2, =0.25

CG

04/21/23 34

ACTC ACTCA

A&

GCTCA&

ATCACGC#C GC#

TC$T

CACGC#

GCT%

C

GC#T

%C

A&

$

$A

CGC#&

#

T

C

%

$A

&CGC#

GCT%

C

#

T%

CA&

S1: AATCACGC# S2: AGCTCA& S3: ATGCT% S4: ACTC$

ACTC: error max =1. S1, S2, S4 contain ACTC.

ACTCA: error max =2. S1, S2, S4 contain ACTCA.

G

04/21/23 35

ACTCA ACTCAA

A&

GCTCA&

ATCACGC#CGC#

TC$T

CACGC#

GCT%

C

GC#

T

%C

A&

$

$A

CGC#&

#

T

C

%

$A

&CGC#

GCT%

#

T%

CA&

ACTCAA: error max =2. S1, S2 contain ACTCAA.

ACTCAC, ACTCAG, ACTCAT are also patterns.S1: AATCACGC# S2: AGCTCA& S3: ATGCT% S4: ACTC$

CG

04/21/23 36

Weeder (2001)

Given a pattern P = p1p2....pm, the algorithm can find all the valid occurrences of P (with at most |P| mutations), such that at most i mutations occur in the first i letters of the pattern

But: some occurrences of a pattern can be missed altogether

Are DNA signals always so polite to show up in “blocks-decomposed” form?

The answer is no, but we can use Weeder with a grain of salt

04/21/23 37

Using Weeder

Example: (15,4) pattern occurring in 20 sequences Valid (block decomposed) possible occurrences: 829

Total possible occurrences:1365

Probability if “hitting” a possible occurrence in

a sequence: phit=.61

Probability of finding the pattern in every

sequence: like trying to win the national lottery

If we search for patterns occurring in at least 10

sequences, the probability of “seeing” at least 10

times the pattern is:

Phit(20,10) = .89

04/21/23 38

Using Weeder

Thus, we can use Weeder as a sieve, to filter the set of candidate patterns

All patterns that are found to occur in at least q of the sequences by Weeder can be searched again in the sequences, but this time with no restriction on the position of mismatches

We expect the number of patterns (random patterns other than the real signal) passed to the second phase to be much smaller than the original number (and no longer exponential)

04/21/23 39

Using Weeder

The probability of finding a pattern in a sequence depends on its length and the error ratio

The probability of finding a pattern in a set of sequences (and thus the choice of the quorum q for the first phase) depends on the number of sequences

The same approach can be applied also when the signal does not show up in each sequence

04/21/23 40

Using Weeder

When the signal to be found is expected to be short, the algorithm can be used in “exact” mode

For longer signal, the lower is the quorum q, the higher is the probability of finding the signal

But: also the number of patterns satisfying the input constraints is higher, and the program is slower

Users can choose a suitable trade-off between time and accuracy

04/21/23 41

Theoretical Time Complexity

Naïve approach: O(4men)

Suffix tree approach (Sagot, 1998): O(4emekn)

where n is the input size, m is the pattern

length, and e is the number of mutations

allowed

Weeder:

O((1/)e4ekn)

where e is the number of mutations occurring

in the longest pattern found

04/21/23 42

Weeder Web

Weeder Web is a web interface to the Weeder algorithm, where all the parameters concerning the motifs are automatically set for the discovery of transcription factor binding sites

Although there is no pre-set limit on the length of the input sequences, feasible results can be obtained by submitting sequences of "typical" length for regulatory/promoter regions (i.e. from 500 to 5000 bps)

A priori, there's no limit on the number of sequences you can input. Also, for the moment we do not consider correlations among different motifs (i.e. cis-regulatory modules)

04/21/23 43

Weeder Web

All the statistical measures (background oligo frequencies, expected occurrences and so on) used to score/rank motifs and to post-process the output have been derived from the analysis of promoter/enhancer and 5'UTR regions only (taken from different organisms)

If you submit something else (i.e. 3'UTRs, coding regions, noncoding RNAs, and so on) the statistical evaluation probably will not be consistent with your data, and thus produce unreliable results

http://www.pesolelab.it/Tool/ind.php

04/21/23 44

Post-Processing

Real motifs have different degrees of variation in different positions

Some admit “any” nucleotide

Some are (almost) perfectly conserved

We should find “redundant” motifs among the highest-scoring ones

Pieces of a long motif should appear also in shorter results

04/21/23 45

Post-Processing

Look for “redundant” (either in length or in conservation) motifs in the reports of each run

Collect the instances of each one and build a frequency matrix

Scan the sequences looking for matches

Report the best matches (with no constraint on the substitutions allowed)

04/21/23 46

Assessment results

Combined over all data sets

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

ANN-SpecAlignACEConsensus

GLAMIITD

Improbizer

MEMEMEME3

MITRAMOE

MotifSampler

Multiple-family-analysis

QuickScoreSeSiMCMC

Weeder

YMF

nSn

nPPV

nPC

nCC

sSn

sPPV

sASP

Tompa et.al.,Nat Biotechnol. 2005 Jan;23(1):137-44.

04/21/23 47

Assessment results

sSn categorized by species

0

0.1

0.2

0.3

0.4

0.5

0.6

ANN-SpecAlignACEConsensus

GLAMIITD

Improbizer

MEME MEME3 MITRAMOE

MotifSampler

Multiple-family-analysis

QuickScoreSeSiMCMCWeeder

YMF

Combined sSn

w holeset

f ly

human

mouse

yeast

ProFind : A GA Approach to the Definition

of Regulatory Signals inGenomic Sequences

Characterization of CAP and TATA-box through probability matrices with a

genetic algorithm

Bioinformatics and Natural Computing GroupDipartimento di Informatica, Sistemistica e

ComunicazioneUniversità degli studi di Milano-Bicocca

Giancarlo Mauri, Roberto Mosca and Giulio Pavesi

04/21/23 49

TATA Box and CAP Binding Sites

A large number of genes present two characteristic signals: TATA-box

• 25-35 bp upstream of the TSS• When discovered it was given a TATA consensus• Bound by the TATA Binding Protein (TBP) part of a large complex of some 50 different proteins including TFIID and TFIIB

CAP (also called Initiator or Inr)• Straddles the TSS• Experimental evidence that it is bound by TFIID too• Previous characterization by Bucher [1990] with a CA[Py] consensus

Very strong positional preference for the two signals with respect to the TSS

04/21/23 50

.........0...........

...CGTGCCATTTGTTGT...

...TCCTACAGTGCAGCA...

...TCACATATTATTGTC...

...GAAAGCAACAACTAA...

...TAAATCGTCAGTGTA...

...CCGACCAGAGTGAAA...

...GGGTTTGGTTTGATA...

...GCGTGCAGTTGTGAA...

...GTCGCCATATACACA...

...GTGGCCGTATGCGCT...

.........0...........

...CGTGCCATTTGTTGT...

...TCCTACAGTGCAGCA...

...TCACATATTATTGTC...

...GAAAGCAACAACTAA...

...TAAATCGTCAGTGTA...

...CCGACCAGAGTGAAA...

...GGGTTTGGTTTGATA...

...GCGTGCAGTTGTGAA...

...GTCGCCATATACACA...

...GTGGCCGTATGCGCT...

Describing Binding Sites

Frequency Matrix nij

-2 -1 0 1 2

A 2 0 7 1 3

C 4 8 0 0 2

G 2 0 3 4 0

T 2 2 0 5 5

-2 -1 0 1 2

A 0.2 0 0.7 0.1 0.3

C 0.4 0.8 0 0 0.2

G 0.2 0 0.3 0.4 0

T 0.2 0.2 0 0.5 0.5

04/21/23 51

Frequency Matrix

Given a signal of length m, a matrix P of dimension 4m is built up, taking Pi,j=P[nucleotide i at position j]

We suppose the existence of two different models The signal The background

Signal occurence

1

1

1,

>

∏

∏

=

=m

jS

m

jjS

j

j

b

PmSSSSS K321= { }TG,C,A,],bg in the nucleotide[P ∈= iibi

0log1

, >⎟⎟

⎠

⎞

⎜⎜

⎝

⎛∑

=

m

j S

jS

j

j

b

P

Weights storedin the matrix jiw ,

04/21/23 52

... TTTTGTTTTTTTATTTCCTGTATTTTT ...

... TCCAGCCCGAACAAAATCGATCAAAAT ...

... ATCCCTCTGGCCATTGGCAATCGATCC ...

... AGAAACAAAACGGCTTGTAAAACAAAC ...

... GTGCAGTGAGTCAGTGTGTTGTGTGCC ...

... GAGCGTAAGCAAGAGAGAGAGGTGAAG ...

... AGGTGAAGCCAGGGGCGGAGGCGCAAG ...

... AGAAAAGAGAGAGTGAAAGCATAGTCC ...

... AGTTTTCATATTGTTACCGTTTGAGTT ...

... TTCACCAGCCACTTTCAGTCGGTTTAT ...

... GCATAACGAATCACTCTGATCGCTGTC ...

... GGTCCAGCGACCACTCGCAGTTCTACA ...

... GATCGGCGTGCCATTTGTTGTTGAATC ...

... CCGCTCTCCTCCAGTGCAGCAGCAGCA ...

... TCCAAGTCACCGATTATTGTCTCAGTG ...

... GAACTGGAAACCAACAACTAACGGAGC ...

... TCAGTCTAAATTTACCCTGTAAAATTC ...

... GTAGTTCCGACCAGAGTGAAACTGAAC ...

... CTTTATGGGTTTAGTTTGATAGGAGTC ...

... TCACTGGCGTTGTTAGAGTTGTGAATG ...





















The Problem

Dataset { }nS SSSSD ,,,, 321 K=

-2 -1 0 1 2

A 0.2 0 0.7 0.1 0.2

C 0.3 0.7 0 0 0.2

G 0.2 0 0.2 0.4 0

T 0.2 0.2 0 0.4 0.5

04/21/23 53

The Score Function

The score function is made up of two terms:

A positive term

A negative term









Positivescore

Negativescore

Negativescore

Total score = positive score – negative score

04/21/23 54

The Score Function

0

0,5

1

1,5

2

2,5

-50 -45 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40

Position

Mean score

DCPD

0

0,5

1

1,5

2

2,5

-50 -45 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40

Position

Mean score

DCPD

04/21/23 55





















The Genetic Algorithm

1

1

1

0

1

0

0

0

0

0

1

1

1

1

0

1

0

1

1

0

-2 -1 0 1 2

A 0.2 0 0.7 0.1 0.2

C 0.3 0.7 0 0 0.2

G 0.2 0 0.2 0.4 0

T 0.2 0.2 0 0.4 0.5

04/21/23 56

The Genetic Algorithm

One-point crossover

Mutation operator flips one bit with a probability pm

Given a genome the frequency matrix is calculated from the sequences selected by the genome itself

The matrix is scored according to the previously defined score function

A local optimization procedure is applied to the resulting best individual

04/21/23 57

Results

Three different datasets:

Eukaryotic Promoter Database(EPD, www.epd.isb-sib.ch, rel. 74)

• Vertebrates: 2199 seqs, from -50 to 50• Homo Sapiens: 1796 seqs (included in the previous one)

Drosophila Core Promoters Database(DCPD, www-biology.ucsd.edui/labs/kadonaga/DCPD.htm)

• 205 seqs, from -47 to 44

04/21/23 58

Results - CAP

Start position: -2

Length: 6

Population: 500

Generations: 20000

Crossover Probability: 0.9

Mutation Probability: 0.04

Start position: -2

Length: 6

Population: 500

Generations: 20000



Bucher, P.: Weight matrix descriptions of four eukaryotic RNApolymerase II promoter elements derived from 502 unrelatedpromoter sequences. J. Mol. Biol. 212 (1990) 563–78

04/21/23 59

Results - CAP

CA, CG, TA and TG dinucleotides starting at position -1 in for the EPD dataset

CA, TA dinucleotides starting at position -1 for the DCPD dataset (consistently with [Kadonaga,2000]

Kutach, A., Kadonaga, J.: The downstream promoter element DPE appears to be as widely used as the TATA box in Drosophila core promoters. Mol. Cell Biol. 20 (2000) 4754–64

04/21/23 60

Results - CAP

DCPD

0

0,05

0,1

0,15

0,2

0,25

0,3

0,35

0,4

-50 -45 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40

Position

Mean score

EPD Vertebrates

0

0,05

0,1

0,15

0,2

0,25

0,3

0,35

0,4

-50 -45 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40

Position

Mean score

EPD Homo Sapiens

0

0,5

1

1,5

2

2,5

-50 -45 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40

Position

Mean score

DCPD

04/21/23 61

Results – TATA Box

Start position: -30

Length: 8

Population: 500

Generations: 20000



Start position: -30

Length: 8

Population: 500

Generations: 20000



0

0,05

0,1

0,15

0,2

0,25

0,3

0,35

0,4

0,45

-47-43-39-35-31-27-23-19-15-11-7 -3 1 5 9 13 17 21 25 29 33

Position

Mean score

EPD Vertebrates

Consistent with the finding that the TBP recognizes the minor grove of DNA, where

protein-DNA interactions are typically influenced by A/T-

content, but not by the specific nucleotide

sequence.

Kim, J.L., Nikolov, D.B., Burley, S.I.: Co-crystal structureof TBP Recognizing the minor groove of a TATA element.Nature 365 (1993) 520–527

Lo, K., Smale, S.T.: Generality of a functional initiatorconsensus sequence. Gene 182 (1996) 13–22

04/21/23 62

The signals found have proved to be consistent with those described experimentally, as in the case of fruit fly signals

Differently from previous approaches to the same problem: it does not make any prior assumption about the signals structure it takes advantage of the specific localization of the signals

considered

We do not need to know in advance which sequences contain the signal, this is taken care of by the algorithm

Further improvements include: Development of models taking into account correlation between

adjacent positions Application of the method to new datasets with a better

characterization of the TSS to investigate positional correlation

between the TATA-box and the TSS and the presence of alternative

TSSs

SeQuAl

Large Scale Multiple Alignment of Genomic Sequences

04/21/23 64

Summary

Multiple Sequence Alignment

The Methods So Far

SeQuAl project

Producing a Threaded Blockset Alignment

Results & possible improvements

04/21/23 65

Multiple Sequence Alignment (1)

The multiple sequence alignment problem is the process of taking three or more input sequences

and forcing them to have the same length by inserting a universal gap symbol, in order to maximize their similarity as measured by a

score function.

04/21/23 66

Multiple Sequence Alignment (2)

04/21/23 67

Prohibitively time-consuming for sequencesof length exceeding 10 kb

Dynamic Programming Vs Heuristics

Dynamic Programming Techniques O(NxM) for two sequences of length N and M Exponential for multiple sequences O(nxM2) for n sequences of length approximately M using

some heuristics (ClustalW)

Need for more heuristics

to speed up the alignment

04/21/23 68

The Methods So Far

Pairwise sequence alignment tools MUMmer, GLASS, WABA, BLASTZ

Multiple sequence alignment tools MAVID, MLAGAN, MGA, Mauve

Anchoring techniques

Seed ungapped alignments and extensions

Filling gaps using dynamic programming

04/21/23 69

SeQuAl project

Goals: Ability to find conservation even in the

presence of mutations on amino acids

Ability to find conservation in subsets of

the original set

Anchoring technique based on text indexing structures

Hashing functions on nucleotides and amino acids

04/21/23 70

S1

S2

S3

S4

SeQuAl project

Four steps alignment: Anchors search Anchors chaining Anchors refinement Gaps filling

New!

New!

S1

S2

S3

S4

Anchors Discovery (1)

Discovery of MUMs and MEMs using Suffix Trees Longer than a minimum threshold Shorter than a maximum length for speeding up the time

complexity Statistically significant

Searched on the nucleotide sequence the amino acid sequence on the sequence of classes of aminoacids

Fragments generation

New!

New!

04/21/23 72

Anchors Discovery (2)

S: ACGTCA$R: TGTCTA$

(S,6)(R,6)

A

$C

GTCA

$

C

TA

$

$

A

GT

C

A$

G

T

C

T

A

$

A

$

T

A

GTCTA$

C

A $

TA$

(S,1)(R,4)

(S,5)

(S,2)

(R,2)

(S,4)

(R,3)(S,3)

$

(R,5)

(R,1)

{S,R}

{S,R}{S,R}

{S,R}

{S,R}

{S,R}

Fragments Chaining

Ordering relation between fragments

S1

S2

S3

S4

A B C D E< < < <

stop

A

BD

EC

start

Graph induced by the ordering relation

Shortest Path on a DAG

04/21/23 74

Partial Anchors (1)

Ordering relation?

A B C< <

S1

S2

S3

S4

S4

A<

A

B

C

Order the partial fragments based on their “barycenter”

Use a shifting barrier to impose a total ordering between partial fragments

04/21/23 75

Partial Anchors (2)

04/21/23 76

Producing a Threaded Blockset Alignment

Block is a rectangular array of symbols such that removing dashes from any row produces a run of one or more consecutive positions in one of the original sequences

Blockset: a set of such blocks

A “ref-blockset" consists of a blockset in which every block has a designated row, all of which come from the same original sequence, called the reference for that ref-blockset.

If a blockset is threaded by each of the original sequences, we call it a threaded blockset.

S1

S2

S3

S4

Local Alignmentwith ClustalW

04/21/23 77

Results (1)

Results (2)

SeQuAl Vs MGA

74,20%

80,70%

91,12%

94,04%

88,90%

92,60%

83,04%

93,48%

73,63%

96,06%

50%

55%

60%

65%

70%

75%

80%

85%

90%

95%

100%

Adenovirus 5 E. coli 3 C. pneumoniae 3 S. aureus 3 S. aureus 4

Dataset

MGA SeQuAl

SeQuAl Vs MGA

74,20%

80,70%

91,12%

94,04%

88,90%

92,60%

83,04%

93,48%

73,63%

96,06%

50%

55%

60%

65%

70%

75%

80%

85%

90%

95%

100%

Adenovirus 5 E. coli 3 C. pneumoniae 3 S. aureus 3 S. aureus 4

Dataset

MGA SeQuAl

04/21/23 79

Approximate anchors in non-coding areas of the genome Developing a hashing function for regulatory

regions

Important evolutionary events Modeling rearrangements and inversions

Memory-saving implementation of suffix trees

Custom visualization tool and classification of aligned regions based on the nature of anchors

04/21/23 80

Conclusions

Exhaustive methods: Suitable for short patterns and a once-for-

all analysis of data (e.g. whole genomes)

Sequence driven methods: Give faster but less accurate answers

Limited data sizes

It is possible to choose a trade-off between

time and accuracy

04/21/23 81

Block Decomposition

Where (3-4-4-4) comes from:0.26*1 = 10.26*2 = 10.26*3 = 10.26*4 = 20.26*5 = 2 …..0.26*15 = 4

3 4 4 4

04/21/23 82

Number of Patterns

829*34

3 4 4 4

0 0 0 41 0 0 30 1 0 30 0 1 30 0 3 1

After iteration

1+12+48+288+216+192+72 = 829

829*34 for all

0 0 2 2

4434813⎛⎞⎛⎞∗∗=⎜⎟⎜⎟⎝⎠⎝⎠

0 1 1 20 1 2 10 2 1 11 0 1 21 0 2 11 1 0 21 1 1 10 2 0 2

–Probability Probability pphithit = 829/1365 = 0.61 = 829/1365 = 0.61

–Probability of finding pattern in all 20 sequences is (0.61)Probability of finding pattern in all 20 sequences is (0.61)20 20 ≈≈ 0 0

414⎛⎞=⎜⎟⎝⎠43123⎛⎞∗=⎜⎟⎝⎠

04/21/23 83

It works…

Use the algorithm as a sift, i.e. we run it to find all patterns that occur in at least half of the sequences.

Search the patterns reported by the algorithm and use the suffix tree to check which ones actually appear in all the sequences.

04/21/23 84

The next…

The probability that a pattern of length 15 occurs with up to four mutations of a give position of a random sequence:

The probability is seen by WEEDER is

The probability that a pattern occurs in a position in a form valid for WEEDER is

(15)401531(15,4)44iiipi−=⎛⎞⎛⎞⎛⎞=⎜⎟⎜⎟⎜⎟⎝⎠⎝⎠⎝⎠∑

*0.63hitp=

**(15,4)(15,4)hitppp=∗

04/21/23 85

Finally…

The pattern occurs at least once in a sequence of length n is:

The probability that the pattern occurs at least half of the sequence of set and is found by WEEDER is:

The expected number of patterns passed by WEEDER to the second phase is :

*(14)(15,4)(15,4)1(1)seqnpp−=−−

20(20)(15,4)(15,4)(15,4)1020(20,10)()(1)seqseqiseqiipppi−=⎛⎞=−⎜⎟⎝⎠∑

15(15,4)4(20,10)seqp∗

04/21/23 86

How good it works…..

A set of 20 sequences of length 600 raises less than 10 patterns.

A set of 20 sequences of length 1000 raises to about 300 patterns.

If we look for patterns that occur in at least 9 sequences: 50 for sequences of length 600

3500 for sequence of length 1000

04/21/23 87

Performance Evaluation (cont’d)

We could partition the set of sequences in two subsets of ten sequences each and the probability that the pattern will pop up among the ones found by the algorithm in either subset is:

where pmiss = 1 - phit

10(10)26101((1))0.98iimissmissippi−=⎛⎞−−=⎜⎟⎝⎠∑

04/21/23 88


A pattern is expanded whenever at least one of the two counters is greater than q.

q is a minimum number of sequences for each subset

04/21/23 89


This approach can be pushed even further.

Partition the set of sequences as long as the parameters are such that only a few patterns satisfy them.

And it works well on long signals, where the phit value is lower and random patterns are unlikely to be picked up by the sifting phase.

04/21/23 90

Thus…

The final exact search has to be performed on a limited number of patterns.

When the signal length and the number of mutations are known in advance, we can determine the best parameter setting and search strategy for WEEDER.

04/21/23 91

If the length of the pattern is not known in advance…

There’re some additional problems to be faced…

If we choose :

The probability of hitting a (16, 4) pattern is 0.54.

But the chances of finding a (15, 4) have dropped to 0.45. (since it’s decomposed as (4-4-4-3))

=>The probabilities of finding a pattern depends on the block decomposition induced by .

0.25=

04/21/23 92

Moreover…

Setting a lower threshold value q would increase the number of candidates.

For example:

Run a algorithm with on 20 sequences of length 400 hoping to find a (16, 4) pattern.

=>The constraints are satisfied by hundreds of patterns of length 12.

04/21/23 93

What is the solution?

One possible solution could be to investigate only the longest patterns found.

=>But a significant signal could be hidden also among the shorter ones.

04/21/23 94

The solution we adopt…

Expand all patterns that appear in at least q sequences, but report only those that occur in q(m), which is set according to the pattern length.

In the previous example:

A pattern of length 12 with 3 mutations can be found with probability of about 90% by setting q=11.

=>Thus, if a pattern of length 12 appears in at least q(m)=11 sequences, we can pass it to the second phase.

Since q(m) can be set according to the number of sequences given as input, and the parameters q and .

04/21/23 95

If the nucleotide composition of the sequences is not uniform…

For example, 1:1:1:2 (T occurs twice as often as the others)

In order to avoid spending too much time in the final phase, we can set the threshold q(m) according to the pattern probability.

04/21/23 96

The points just discussed

At the end, the algorithm might report more than one pattern satisfying the constraints.

=>We need to introduce significance

measures to sort the output.

04/21/23 97

Sorting the Output

The algorithm may output more than one pattern under: The length of the signal to be found is

unknown

The sequences contain more than one signal

04/21/23 98

Example

A successful (15,4) pattern can be expanded by one symbol and becomes a (16,5) pattern.

All its occurrences are also valid occurrences of the longer one.

04/21/23 99

A Method for Sorting Outputs

A pattern P is given

is the best instance of P in the sequence i

is the total number of sequences P occurs in

is 1 if P appears in sequence i, otherwise 0

means the distance between P and

1||2(,)(,)kPPiiSPNdPPIPi==⋅−∑

iP

(,)idPP

PN

(,)IPi

iP

04/21/23 100

Relative Entropy

When the nucleotide composition of the sequences is biased, we use the background probabilities to define new match premiums and penalties.

,,1{,,,}logmrjPrjjrACGTrPEPb=∈=∑∑

04/21/23 101

Relative Entropy (cont’d)

is the frequency with which residue r occurs in position j in the occurrences of P

is the frequency of r in the sequences

,,1{,,,}logmrjPrjjrACGTrPEPb=∈=∑∑

,rjP

rb

04/21/23 102

Relative Entropy (cont’d)

It is suitable for sorting patterns under following situations: Patterns that appear the same number of

times, for example once in every sequence of

the set.

Sequences containing multiple signals.

04/21/23 103

Another Measure of Significance

We can define statistical measures of significance, that compare the actual number of occurrences of a pattern with the expected value.

(,)(,)(,)(1)PPePPePeNNZNπππ−=−

04/21/23 104

Measure of Significance (cont’d)

is the probability that P occurs in a sequence of the set with at most e mutations.

N is the overall length of the sequences.

This value can be computed explicitly (Tompa,1999)

(,)(,)(,)(1)PPePPePeNNZNπππ−=−

(,)Peπ

04/21/23 105

Software Implementation

Machine: Pentium II class computer with 64MB RAM

OS: Linux

Program Codes: written in ANSI C, about 2500 lines long

Testing Data: challenge problem as described before, varying the length from 100 to 1000 nucleotides

04/21/23 106

Experiment Results -- Time

Construction of the suffix tree: always within one second

To find the (15,4) signal with 89% success probability, and run the algorithm with = 0.26 and q = 10. for sequence lengths up to 400 nucleotides. it took less than one minute (including the final exact search).

Length 500: execution time grows significantly

Length 500 and 600: average time 125 and 200 seconds

q = 11: execution time drops to 100 and 120

04/21/23 107

Experiment Results – Time (cont’d)

Increasing the number of sequences did not influence the execution time very much.

Example For every sequence length, the algorithm took just a few more

seconds when run over 30 or 40 sequences with q set to 16 and 21,

respectively.

Sequence length set to 800, the program took on the average 320 and

450 seconds to complete the execution with q set to 11 and 10

respectively. When length is 1000 long, it took about 15 minutes.

Thus the WEEDER algorithm is suited to work on a large set of relatively short (up to 600 nucleotide) sequences than a small set of very long sequences.

Documents

Finding Regulatory Signals in Genomic Sequences Bioinformatics and Natural Computing Group Dipartimento di Informatica, Sistemistica e Comunicazione Università