Issues in the Discovery and Use of Motif Patterns
Alberto ApostolicoUniversity of Padova and Purdue University
A. Apostolico - AofA04
General Form of Pattern Discovery
•Find-exploit a priori unknown patterns or associations thereof in a Data Base
• With some prior domain-specific knowledge• Without any domain-specific prior knowledge
•Tenet: a pattern or association (rule) that occurs more frequently than one would expect is potentially informative and thus interesting frequent = interesting
A. Apostolico - AofA04
Motifs a motif is a recurring pattern with some solid and some ``don’t
care’’ characters or ``gaps’’
Typical PROBLEM Input: textstring Output: repeated motifs
``don’t care’’ characters solid character
T AA G A G G T A G A T AG T
T AA G A G G T A G A T AG T
T AA G A G G T A G A T AG T T AA G A G G T A G A T AG T
Motif discovery is beset by the circumstance that typically there are exponentially many candidate motifs in a sequence
A. Apostolico - AofA04
Motifs a motif is a recurring pattern with some solid and some ``don’t
care’’ characters or ``gaps’’, together with its list of occurrences
Self-correlation Motifs
``don’t care’’ characters solid character
B AA D A D D B A D A B AD B
B AA G A D D B A D A B AD B
B A A D D B A A B B B BA B
Motif discovery is beset by the circumstance that typically there are exponentially many candidate motifs in a sequence
B A D A D D B A D A B AC B
A. Apostolico - AofA04
Controlling Motif Growth: Redundant Motifs(Parida)
A motif is • maximal in composition if specifying more solid characters implies an alteration to its occurrence list• maximal in length if making the motif longer implies an alteration to the cardinality or displacement of its occurrence list
A maximal motif such that the motif and its list can be inferred from studying other motifs is redundant
A. Apostolico - AofA04
Maximal, Redundant, Irredundant Motifs (examples)
Let s= abcdabcd
m_1 = ab with L_1 = { 1, 5 }m_2 = bc with L_2 = { 2, 6 }m_3 = cd with L_3 = { 3, 7 }m_4 = abc with L_4 = { 1, 5 }m_5 = bcd with L_5 = { 2, 6 }m_6 = abcd with L_6 = { 1, 5 }
Notice that L_1 = L_4 = L_6 and L_2 = L_5.
Denoting by L + i the list of j+i such that j is in L, L_5 = L_6 + 1 and L_3 = L_6 + 2
Motif m_6 is maximal as |m_6| > |m_1| , |m_4| and |m_5| > |m_2|. Motifs m_1, m_2, m_3, m_4 and m_5 are non-maximal motifs.
A. Apostolico - AofA04
Maximal, Redundant Irredundant Motifs (examples, cont.)
Let s= aaXbaYdZZZaaVbaWcXXXXaaYbdXc s= aaXbaYdZZZaaVbaWcXXXXaaYbdXc s= aaXbaYdZZZaaVbaWcXXXXaaYbdXc
m_1 = aa . b with L_1 = { 1, 11, 22}m_2 = aa . ba with L_2 = {1, 11}m_3 = aa . b . c with L_3 = {11, 22}
m_1 = aa . b is redundant, since 1) m_1 is a sub-motif of m_2 and of m_3 and 2) L_1 is the union of L_2 and L_3.
A. Apostolico - AofA04
Controlling Motif Growth : HOW MANY Irredundant Motifs
Recall that a motif is • maximal in composition if specifying more solid characters implies an alteration to its occurrence list• maximal in length if making it longer implies an alteration to the cardinality of its occurrence list
A maximal motif such that the motif and its list can be inferred from studying other motifs is redundant
A motif that occurs at least k times in the textstring is a k-motif
TheoremIn any textstring x the number of irredundant 2-motifs is O(|x|)(PROBLEM: How to find irredundant motifs as fast as possible)
A. Apostolico - AofA04
Suffix Consensus, Suffix Meet
suf4
s = suf1
The consensus of suf1 and suff4 is not a motif
The meet of suf1 and suf4 is a maximal motif
a
b
c
a
a
a a
a
a
a
a
aa
a
bb
bbb
ccc
c c b
c
c
A. Apostolico - AofA04
Suffix Consensus, Suffix Meet
suf4
s = suf1
The consensus of suf1 and suff4 is not a motif
The meet of suf1 and suf4 is a maximal motif
TheoremEvery irredundant 2-motif of x is the meet of two suffixes of x
a
b
c
a
a
a a
a
a
a
a
aa
a
bb
bbb
ccc
c c
c
c
A. Apostolico - AofA04
Suffix Consensus, Suffix Meet
suf4
s = suf1
The consensus of suf1 and suff4 is not a motif
The meet of suf1 and suf4 is a maximal motif
TheoremEvery irredundant 2-motif of x is the meet of two suffixes of x
a
b
c
a
a
a a
a
a
a
a
aa
a
bb
bbb
ccc
c c
c
c
A. Apostolico - AofA04
1 Detect Repeated Patterns 2 Set up Dictionary 3 Use Pointers to Dictionary to Encode Replicas
• Most schemes are NP complete (Storer, 78) , • few exceptions (LZ is linear)
Data Compression by Textual Substitution
A. Apostolico - AofA04
LZW
LZW PARADIGM: build a dictionary trie as you scan the input
ROUTINE•Find the next phrase as the longest matching entry in the trie
•Add to the trie the unit symbol extension of this phrase
A. Apostolico - AofA04
LZW
LZW PARADIGM: build a dictionary trie as you scan the input
ROUTINE•Find the next phrase as the longest matching entry in the trie
•Add to the trie the unit symbol extension of this phrase
Magics: •It works, no need to send trie•Coding & decoding are symmetric
A. Apostolico - AofA04
Fast and Lossy is Hard
``All universal lossy coding schemes found to date lack the relative simplicity that imbues Lempel-Ziv codes and arithmetic codes with economic viability. Perhaps as a consequence of the fact that approximate matches abound whereas exact matches are unique,it is inherently much faster to look for an exact match that it is to search from a plethora of approximate matches looking for the best, or even nearly the best, among them. The right way to trade off search effort in a poorly understood environment against the degree to which the product of the search possesses desired criteria has long been a human enigma. This suggests it is unlikely that the ``holy grail'' of implementable universal lossy source coding will be discovered soon.''
T. Berger and J.D. Gibson,``Lossy Source Coding,'‘ IEEE Trans. on Inform. Theory, vol. 44, No. 6, pp. 2693--2723, 1998.
A. Apostolico - AofA04
Why Fast and Lossy is Hard
Routine: Find longest prefix of incoming string matching past occurrence within some distortion
PROBLEMS
• Defining the Gaps• Encoding where are the gaps• Finding the longest match
A. Apostolico - AofA04
Fast and Lossy is Hard
``All universal lossy coding schemes found to date lack the relative simplicity that imbues Lempel-Ziv codes and arithmetic codes with economic viability. Perhaps as a consequence of the fact that approximate matches abound whereas exact matches are unique,it is inherently much faster to look for an exact match that it is to search from a plethora of approximate matches looking for the best, or even nearly the best, among them. The right way to trade off search effort in a poorly understood environment against the degree to which the product of the search possesses desired criteria has long been a human enigma. This suggests it is unlikely that the ``holy grail'' of implementable universal lossy source coding will be discovered soon.''
T. Berger and J.D. Gibson,``Lossy Source Coding,'‘ IEEE Trans. on Inform. Theory, vol. 44, No. 6, pp. 2693--2723, 1998.
A. Apostolico - AofA04
Why Fast and Lossy is Hard: LZW Recap
LZW PARADIGM: build a dictionary trie as you scan the input
ROUTINE•Find the next phrase as the longest matching entry in the trie
•Add to the trie the unit symbol extension of this phrase
A. Apostolico - AofA04
Motif Disambiguation• By Guessing
DESCRIPTION OF FARMER OAK -- AN INCIDENT When Farmer Oak smile., the corners .fhis mouth spread till the. were within an unimportant distance .f his ears, his eye. were reduced to chinks, and ...erging wrinkle—red round them, extending upon... countenance li.e the rays in a rudimentary sketch of the rising sun. HisChristian name was Gabriel, and on working days he was a young man of soundjudgment,easy motions, proper dress, and ...eral good character. On Sundays,he was a man of misty views rather given to postponing, and .ampered by his bestclotes and umbrella : upon ... whole, one who felt himself to occupy morally that... middle space of Laodicean neutrality which ... between the Communion people ofthe parish and the drunken section, -- that ... he went to church, but yawnedprivately by the t.ime the cong.egation reached the Nicene creed,- and thoughtof what there would be for dinner when he meant to be listening to the sermon.
DESCRIPTION OF FARMER OAK -- AN INCIDENT When Farmer Oak smiled, the corners ofhis mouth spread till they were within an unimportant distance of his ears, hiseyes were reduced to chinks, and diverging wrinkles appeared round them, extending uponhis countenance like the rays in a rudimentary sketch of the rising sun. HisChristian name was Gabriel, and on working days he was a young man of soundjudgment, easy motions, proper dress, and general good character. On Sundayshe was a man of misty views, rather given to postponing, and hampered by his bestclothes and umbrella : upon the whole, one who felt himself to occupy morally thatvast middle space of Laodicean neutrality which lay between the Communion people ofthe parish and the drunken section, -- that is, he went to church, but yawnedprivately by the time the congregation reached the Nicene creed,- and thoughtof what there would be for dinner when he meant to be listening to the sermon.
A. Apostolico - AofA04
Giving up on ``longest match’’1 – Expected length with exact matches
2 – Expected length with with distortion d
=
A. Apostolico - AofA04
…but LZ works because phrases are DISTINCT
• A most crowded parse
• Achieves maximum number of phrases in a parse
• #phrases < n / log n
A. Apostolico - AofA04
LZWA is not necessarily better than LZW
x = aaaaaaaaaaaa………a
But compare vocabularies under alphabet compression
A. Apostolico - AofA04
Conclusions
-Self-correlation Motifs give versatile compression schemata for a variety of inpus
- “Plier la machine’’ approach, bridges lossless and lossy
- Linear time lossy variant with reasonable performance
- Deeper analysis, broad experimentation of fine tuned variants and several extensions needed, some under way
A. Apostolico - AofA04
Main References
• A. Apostolico ``Pattern Discovery and the Algorithmics of Surprise'' Proceedings of the NATO ASI on Artificial Intelligence and Heuristic Methods for Bioinformatics, (P. Frasconi and R. Shamir, eds.) IOS Press, 111--127 (2003).
• A. Apostolico and L. Parida ``Incremental Paradigms of Motif Discovery'', Journal of Computational Biology 11:1, 15--25 (2004).
• A. Apostolico M. Comin and L. Parida. ``Motifs in Ziv-Lempel-Welch Clef'‘ Proceedings of IEEE DCC Data Compression Conference, pp. 72—81 (2004).
• A. Apostolico. ``Fast Gapped Variants for Lempel-Ziv-Welch Compression'',in preparation.