97
Monotony and Surprise Monotony and Surprise Algorithmic and Combinatorial Algorithmic and Combinatorial Foundations of Pattern Discovery Foundations of Pattern Discovery Alberto Apostolico University of Padova and Georgia Inst. Of Tech

Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

  • Upload
    benoit

  • View
    35

  • Download
    3

Embed Size (px)

DESCRIPTION

Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery. Alberto Apostolico University of Padova and Georgia Inst. Of Tech. http://www.cc.gatech.edu/~axa/papers A) Specialized Material - PowerPoint PPT Presentation

Citation preview

Page 1: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Monotony and SurpriseMonotony and SurpriseAlgorithmic and Combinatorial Foundations Algorithmic and Combinatorial Foundations

of Pattern Discoveryof Pattern Discovery

Alberto ApostolicoUniversity of Padova and Georgia Inst. Of Tech.

Page 2: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 22

http://www.cc.gatech.edu/~axa/papers

A) Specialized Material A. Apostolico and G. Bejerano ``Optimal Amnesic Probabilistic Automata, or How to Learn and Classify Proteins in linear Time and Space '', RECOMB 2000 and Journal of Computational Biology, 7(3/4):381--393, 2000.

A.Apostolico, M.E. Bock, S. Lonardi and X. Xu. ``Efficient Detection of Unusual Words'', Proceedings of RECOMB 2002 and Journal of Computational Biology, 7(1/2):71--94, 2000.

A. Apostolico, F. Gong and S. Lonardi. ``Verbumculus and the Detection of Unusual Words'',

Journal of Computer Science and Technology, 19:1 ( Special Issue on Bioinformatics), 22-41 (2004).

A. Apostolico, L. Parida. ``Incremental Paradigms of Motif Discovery'',

Journal of Computational Biology 11:1, 15--25 (2004).

A. Apostolico, M.E. Bock and S. Lonardi. ``Monotony of Surprise and Large Scale Quest for Unusual Words.'‘ Journal of Computational Biology, 10, 3-4, 283-311 (2003).

A. Apostolico, C. Pizzi.``Monotone Scoring of Patterns with Mismatches'‘

Proceedings of the 4th Workshop on Algorithms in Bioinformatics, Bergen, Norway, Springer Verlag

LNCS 3240, 87-98, (2004)

Page 3: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 33

http://www.cc.gatech.edu/~axa/papers

B) Introductory Material A. Apostolico and M. Crochemore ``String Pattern Matching for a Deluge Survival Kit'' Handbook of Massive Data Sets, J. Abello et al, Eds. Kluver Acad. Publishers, to appear.

A. Apostolico ``General Pattern Matching'', Handbook of Algorithms and Theory of Computation, M.J. Atallah, ed., CRC Press Ch. 13, pp. 1--22 (1999).

A. Apostolico ``Of Maps Bigger than the Empire'', Keynote, SPIRE2001, IEEE Press (2001)

A. Apostolico ``Pattern Discovery and the Algorithmics of Surprise'' Artificial Intelligence and Heuristic Methods for Bioinformatics, (P. Frasconi and R. Shamir, eds.) IOS Press, 111--127 (2003).

A. Apostolico ``Pattern Discovery in the Crib of Procrustes'' Imagination and Rigor, Essays on Eduardo R. Caianiello's Scientific Heritage Ten Years after his Death ,( S. Termini, ed.), Springer-Verlag, to appear 2005.

Page 4: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 44

Acknowledgements

Gill Bejerano Dept. of Computer Science - The Hebrew University

Mary Ellen Bock Dept. of Statistics - Purdue University

Matteo Comin Univ. of Padova

Jianhua Dong Dept. of Industrial Technology, Purdue University

S. Lonardi Dept. of Comp. Science and Eng. - UC Riverside

Fu Lu Celera

FangCheng Gong Celera

Laxmi Parida IBM

Cinzia Pizzi Univ. of Padova

Xuyan Xu CapitalOne

Page 5: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 55

A hemoglobin molecule consists of four polypeptide chains: two globin chains (shown in green and blue) and two globin chains (shown in yellow and orange). Each globin chain contains a heme (shown in red).

Hemoglobin is the protein that carries oxygen from the lungs to the tissues and carries carbon dioxide from the tissues back to the lungs. In order to function most efficiently, hemoglobin needs to bind oxygen tightly in the oxygen-rich atmosphere of the lungs and be able to release oxygen rapidly in the relatively oxygen-poor environment of the tissues.

Form = FunctionForm = Function

Page 6: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 66

Page 7: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 77

Bioinformatics the Road Ahead

‘’. . . more than any other single factor, the sheer volume of data poses the most serious challenge --many problems that are ordinarily quite manageablebecome seemingly insurmountable when scaled up to these extents. For these reasons, it is evident that imaginative new applications of technologies designed for dealing with problems of scale will be required. For example, it may be imagined thatdata mining techniques will have to supplant manual search, intelligent data base integration will be needed in place of hyperlink browsing, scientific visualization will replace conventional interface to the data, and knowledge-based systems will have to supervise high-throughput annotation of the [sequence] data’’

[ D.B. Searls, Grand Challenges in Computational Biology Salzberg Searls Kasif eds Elsevier 1998]

Page 8: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 88

•At a joint EU - US panel meeting on large scientific data bases held in Annapolis in 1999, I was invited with the physicists and earth observators to represent the needs of computational biologyIn honest to my duty, I said time and again that the kind of data available to biology was a tiny fracvtion of what is produced in earth observation and high energy physics. Just as the others were disposing of me saying that swe did not need money, I said : don’t worry we will make up for it with the data we will generate

biology is a natural science, it dissects and multipliesformal sciences synthesize and clusterthere is no telling where these two will go together

Page 9: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 99

Which Information AnywayWhich Information Anyway - Greek ``- Greek ``" is form, appearance, or, in Latin, " is form, appearance, or, in Latin, speciesspecies

- information is modern, quantified version of what the Greek- information is modern, quantified version of what the Greek called `` called ``" - it is a measure of the amount of structure" - it is a measure of the amount of structure

the three dimensions of information:the three dimensions of information:

syntactic (formal medium without meaning)syntactic (formal medium without meaning) semantic (dualism of subject and object semantic (dualism of subject and object

invented invented by modern philosophy)by modern philosophy)

pragmatic (attempt to describe the understanding of pragmatic (attempt to describe the understanding of meaning as a natural process) meaning as a natural process)

Page 10: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 1010

KKinging P Philliphillip C Cameame O Overver F Foror G Greenreen S Soupoup((Kingdom, ingdom, Phylum, hylum, Class, lass, Order, rder, Family, amily, Genus, enus, Species)pecies)

biologists group organisms by biologists group organisms by morphology to represent similarities and propose relationships to represent similarities and propose relationships

Linnaeus’ Taxonomy (partial)Linnaeus’ Taxonomy (partial)

Page 11: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 1111

The “Chinese” TaxonomyThe “Chinese” Taxonomy

attributed by a Dr. Franz Kuhn to theChinese Encyclopedia entitled Celestial Emporium of Benevolent Knowledge.

Animals are divided into(a) those that belong to the Emperor, (b) embalmed ones, (c) those that are trained, (d) suckling pigs, (e) mermaids, (f) fabulous ones, (g) stray dogs, (h) those that are included in this classification, (i) those that tremble as if they were mad, (j) innumerable ones, (k) those drawn with a very fine camel's hair brush, (l) others,(m) those that have just broken a flower vase, (n) those that resemble flies from afar''

J.L. Borges, "The Analytical Language of John Wilkins," fromOtras Inquisiciones (Other Inquisitions 1937-1952, London: Souvenir Press, 1973)

Page 12: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 1212

SummarySummary

Form and Information Form and Information To Classify and GenerateTo Classify and Generate Of Free Lunches, Ugly Ducklings, and Little Of Free Lunches, Ugly Ducklings, and Little

Green MenGreen Men Privileging Syntactic InformationPrivileging Syntactic Information Avoidable and Unavoidable RegularitiesAvoidable and Unavoidable Regularities Periods, Palindromes, Squares, etc.Periods, Palindromes, Squares, etc. Theories Bigger than LifeTheories Bigger than Life Motifs, Profiles and Weigh MatricesMotifs, Profiles and Weigh Matrices The Emperor’s New MapThe Emperor’s New Map

Page 13: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 1313

Defining ‘’Class’’Defining ‘’Class’’From Watanabe’s pattern recognition as information compression in Frontiers in pattern ‘recognition

Class can be defined by 1 intension ( = list of properties or predicates) or 2 extension ( = list of names of individual members)

Class can be also defined by 3 paradigm ( = show a few members and, optionally, few non-members) This is what brain does well (and what pattern recognition does poorly)

Finally, Class can be defined by 4 clustering ( = we are not even given paradigms but rather sets of objects and asked to isolate subsets with strong coherence)

Page 14: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 1414

Class by IntensionClass by IntensionFrom Watanabe’s pattern recognition as information compression in Frontiers in pattern ‘recognition Types of class intension: Vectorial approach (statistical pattern recognition) divides into two = In the conventional zone a class is characterized by a predicate of type: belongs

to such and such volume of n-dimensional representation space IIn the subspace method, a class is characterized by a predicate of type: belongs

to such and such subspace in n-dimensional representation space

Structural or grammatical approach = a class is characterized by a predicate of the type: consists of such and such elementary components which are arranged together in such and such ways

Note: structural and vector description are not uncorrelated, on the contraryFor example, multiple sequence alignment can be considered as a search for the

discovery of dimensions along which paradigms of noisy vectors exhibit same value

Page 15: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 1515

Statistical ClassificationStatistical Classification

A A ClassClass is formed by is formed by ObjectsObjects with many with many PredicatesPredicates in in commoncommon

Theorem of the Ugly Duckling (S. Watanabe):Theorem of the Ugly Duckling (S. Watanabe): as long as as long as all of the predicates characterizing the objects to be all of the predicates characterizing the objects to be classified are given the same importance or ``weight", then classified are given the same importance or ``weight", then a swan will be found to be just as similar to a duck as to a swan will be found to be just as similar to a duck as to another swan.another swan.

Classification as experienced on an empirical basis is only Classification as experienced on an empirical basis is only possible to the extent that the various predicates possible to the extent that the various predicates characterizing objects are given non-uniform weights.characterizing objects are given non-uniform weights.

Page 16: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 1616

Statistical ClassificationStatistical Classification

Theorem of the Ugly Duckling (S. Watanabe):Theorem of the Ugly Duckling (S. Watanabe): as long as all as long as all of the predicates characterizing the objects to be of the predicates characterizing the objects to be classified are given the same importance or ``weight", classified are given the same importance or ``weight", then a swan will be found to be just as similar to a duck then a swan will be found to be just as similar to a duck as to another swan.as to another swan.

Cannot measure similarity by # of shared features: a Cannot measure similarity by # of shared features: a member with only the left eye is more similar to one with member with only the left eye is more similar to one with no eye than to one with only the right eyeno eye than to one with only the right eye

Must measure similarity by # of shared predicatesMust measure similarity by # of shared predicatesBut this number is irrespective of the number of objects and But this number is irrespective of the number of objects and

same for all pairssame for all pairs

Page 17: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 1717

Statistical ClassificationStatistical ClassificationCannot measure similarity by # of shared features: a member with only the left eye is more similar to one with no eye than to one Cannot measure similarity by # of shared features: a member with only the left eye is more similar to one with no eye than to one

with only the right eyewith only the right eye

Must measure similarity by # of shared predicatesMust measure similarity by # of shared predicatesBut this number is irrespective of the number of objects and same for all pairsBut this number is irrespective of the number of objects and same for all pairs

nn

Total # of predicates =Total # of predicates = r=0r=0nnnn

d-2d-2

Total # of predicates =Total # of predicates = r=2r=2d-2d-2d-2d-2

shared by ANY two patternsshared by ANY two patterns

Theorem of the Ugly Duckling (S. Watanabe):Theorem of the Ugly Duckling (S. Watanabe): as long as all of the predicates as long as all of the predicates characterizing the objects to be classified are given the same importance characterizing the objects to be classified are given the same importance or ``weight", then a swan will be found to be just as similar to a duck as to another or ``weight", then a swan will be found to be just as similar to a duck as to another swan.swan.

nr

d-2 r-2

Page 18: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 1818

Inferring GrammarsInferring Grammars

grammatical inference problem:

Input: a finite set of symbol strings from some language L and possibly a finite set of strings from the complement of L

Output: a grammar for the language

``Precisely the same problem arises in trying to choose a model or theory to explain a collection of sample data.

This is one of the most important information processing problems known and it is surprising that there has been so little work on its formalization.’’

( Bierman- Feldman, 1972)

Page 19: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 1919

Regular, Anomalous, Entropy, NegentropyRegular, Anomalous, Entropy, Negentropy

Shannon: information is entropyShannon: information is entropy Brillouin: info is negentropy, entropy is chaosBrillouin: info is negentropy, entropy is chaos Key to the paradox: actual versus potential informationKey to the paradox: actual versus potential information How can we express gain in information? (difference How can we express gain in information? (difference

between two distributions ?)between two distributions ?)

This measure is global and can be either positive or This measure is global and can be either positive or negativenegative

A better measure (Alfred Renyi - always positive)A better measure (Alfred Renyi - always positive)

Page 20: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 2020

Random, Regular, CompressibleRandom, Regular, Compressible Measuring Measuring structurestructure in finite objects presupposes the ability to measure in finite objects presupposes the ability to measure

randomness randomness in such objects.in such objects.

Defining randomness has been an elusive goal for statisticians since the turn of Defining randomness has been an elusive goal for statisticians since the turn of the last century.the last century.

Kolmogorov's definition of information (note resemblance to molecular Kolmogorov's definition of information (note resemblance to molecular evolution): information (alternatively, conditional information) is evolution): information (alternatively, conditional information) is the length of the the length of the recorded sequence of zeroes and ones that constitute a shortest program by recorded sequence of zeroes and ones that constitute a shortest program by which a universal machine produces one stringwhich a universal machine produces one string from scratch (alt., from another from scratch (alt., from another string).string).

Page 21: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 2121

Random, Regular, CompressibleRandom, Regular, Compressible

Kolmogorov's definition of information (note resemblance to molecular Kolmogorov's definition of information (note resemblance to molecular evolution): information (alternatively, conditional information) is evolution): information (alternatively, conditional information) is the length of the the length of the recorded sequence of zeroes and ones that constitute a shortest program by recorded sequence of zeroes and ones that constitute a shortest program by which a universal machine produces one stringwhich a universal machine produces one string from scratch (alt., from another from scratch (alt., from another string).string).

The programs of length less than k are at most The programs of length less than k are at most 0, 1, 00, 01, 10, 11, …, ..., 11…1 (or k `1’) 0, 1, 00, 01, 10, 11, …, ..., 11…1 (or k `1’) The number of strings with a program of length less than k is The number of strings with a program of length less than k is 1+2+…+4 + 21+2+…+4 + 2k-1k-1 = 2 = 2kk -1 < 2 -1 < 2kk Bad News: Bad News: there is hardly such a notion as that of a finite random sequencethere is hardly such a notion as that of a finite random sequence and yet and yet

most very long strings are complexmost very long strings are complex – any given short sequence seems to exhibit – any given short sequence seems to exhibit some kind of regularity,some kind of regularity, however, in the limit, a great many sequences of however, in the limit, a great many sequences of sufficiently large length are seen to be incompressible and hence to appear as sufficiently large length are seen to be incompressible and hence to appear as random random

It appears thus that we attribute and measure structure in finite objects only to the It appears thus that we attribute and measure structure in finite objects only to the extent that we privilege (i.e., assign a high weight to) certain regularities and extent that we privilege (i.e., assign a high weight to) certain regularities and neglect others (is this the structural classification pendant to the theorem of the neglect others (is this the structural classification pendant to the theorem of the ugly duckling?)ugly duckling?)

Page 22: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 2222

SummarySummary

Form and Information Form and Information To Classify and GenerateTo Classify and Generate Of Free Lunches, Ugly Ducklings, and Little Of Free Lunches, Ugly Ducklings, and Little

Green MenGreen Men Privileging Syntactic InformationPrivileging Syntactic Information Avoidable and Unavoidable RegularitiesAvoidable and Unavoidable Regularities Periods, Palindromes, Squares, etc.Periods, Palindromes, Squares, etc. Theories Bigger than LifeTheories Bigger than Life Motifs, Profiles and Weigh MatricesMotifs, Profiles and Weigh Matrices

Page 23: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 2323

Privileging Syntactic Regularities in StringsPrivileging Syntactic Regularities in Strings

Syntactic regularities in strings are pervasive notions in Computer Science Syntactic regularities in strings are pervasive notions in Computer Science and its applications. In Molecular Biology, regularities are variously and its applications. In Molecular Biology, regularities are variously implicated in diverse facets of biological function and structureimplicated in diverse facets of biological function and structure

Typical string regularities: Typical string regularities:

-cadences-cadences

-periods-periods

-squares or tandem repeats -squares or tandem repeats

-repetitions-repetitions

-palindromes-palindromes

-episodes -episodes

-motifs -motifs

-other exact variants and approximate versions thereof-other exact variants and approximate versions thereof

There are There are avoidableavoidable and and unavoidableunavoidable regularities ! regularities !

Page 24: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 2424

Unavoidable RegularitiesUnavoidable Regularities If If NN is partitioned into is partitioned into kk classes, one of the classes contains classes, one of the classes contains

arbitrarily long arithmetic progressions (arbitrarily long arithmetic progressions ( Baudet-Artin-vanDer Waerden 1926-27 Baudet-Artin-vanDer Waerden 1926-27 ))

Page 25: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 2525

Avoidable RegularitiesAvoidable Regularities Periods, BordersPeriods, Borders periodicities are pervasive notions of string algorithmics, e.g., KMP string periodicities are pervasive notions of string algorithmics, e.g., KMP string

searchingsearching

abaabaababaabaababaabaabaabaababaabaababaabaababaabaabaabaab

A string can have many periodsA string can have many periods abacabacaba abacabacaba abacabac

abacabacabacabac abacabacababacabacab

The smallest one is THE period of the stringThe smallest one is THE period of the string

Page 26: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 2626

Periods cannot coexist too longPeriods cannot coexist too long A string can have many periods abacabacabaA string can have many periods abacabacaba abacabac abacabacabacabac abacabacababacabacab

Periodicity LemmaPeriodicity Lemma (Lyndon-Schutzemberger, 62) (Lyndon-Schutzemberger, 62)

If If ww has two periods of length has two periods of length p p andand q q and |and |ww| is at least | is at least p+qp+q, ,

then w has period gcd(then w has period gcd(p,qp,q))

Proof Proof assume wlog assume wlog p>qp>q, take , take x[i]x[i]

either 1) either 1) i-qi-q is not smaller than 1 is not smaller than 1

or 2) or 2) i+pi+p is not larger than is not larger than n n

case 1: case 1: x[i] = x[i-q] = x[i-q+p]x[i] = x[i-q] = x[i-q+p]

case 2: case 2: x[i] = x[i+p] = x[i+p-q]x[i] = x[i+p] = x[i+p-q]

so so p-qp-q is a period ----> now repeat on is a period ----> now repeat on qq and and p-qp-q

q

p

Page 27: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 2727

Avoidable RegularitiesAvoidable Regularities

Periods Periods and periodicities are pervasive notions of string algorithmics,, e.g., KMP string and periodicities are pervasive notions of string algorithmics,, e.g., KMP string searchingsearching

abaabaababaabaababaabaabaabaababaabaababaabaababaabaabaabaab A string can have many periods abacabacabaA string can have many periods abacabacaba abacabac abacabac abacabac abacabacababacabacab

The smallest one is THE period of the stringThe smallest one is THE period of the string Palindromes Palindromes w = w w = w RR

Once we know how to compute optimally ALL periods of a string we an also compute Once we know how to compute optimally ALL periods of a string we an also compute all initial palindromesall initial palindromes

Proof: run the algorithm on Proof: run the algorithm on ww**w w RR abab ... * ... baba abab ... * ... baba

(In fact, (In fact, allall palindromes of a string can be computed in serial linear time: Manacher, 76) palindromes of a string can be computed in serial linear time: Manacher, 76)

Page 28: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 2828

Squares or Tandem RepeatsSquares or Tandem Repeatsor

why does genetic code need more than 2 characters

SquareSquare: a string in the form : a string in the form wwww with with ww a primitive string a primitive string

Primitive stringPrimitive string: a string that cannot be rewritten in the form v : a string that cannot be rewritten in the form v kk with k > 1 with k > 1

Square free stringsSquare free strings : a string that contains no square : a string that contains no square

•i •j

Longest squarefree string on two symbols 010 ?

Thue (1906): On an alphabet of at least 3 symbols we can write indefinitely long square free strings square free morphism rew(a) -> abcab rew(b) -> acabcb rew(c) -> acbcacb

there are about n2 ways of choosing indices i and j , thus n2 squares ?

Istrail’s morphism (square free on ``a’’) rew(a) -> abc rew(b) -> ac rew(c) -> b

Page 29: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 2929

Detecting SquaresDetecting Squares How many squares? How many squares? there can be there can be cncnloglognn squares in a string (Crochemore, 81) squares in a string (Crochemore, 81)

Example: Fibonacci wordsExample: Fibonacci words

FFoo = a = a

FF11 = b = b

FFi i = F= Fi-1i-1 F Fi-2i-2

a b ba bab babba babbabab babbababbabba ...a b ba bab babba babbabab babbababbabba ...

Recent (Kosaraju, Gusfield) Parallel (AA, Crochemore-Rytter, AA-Breslauer)

Optimal nlogn algorithms since early 80's (Main-Lorentz, AA-Preparata, Rabin, Crochemore)

Page 30: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 3030

Tandem Repeats, Repeated Episodes Tandem Repeats, Repeated Episodes (Myers ‘87, Kannan-Myers ‘92, Landau-Schmidt ‘93, Benson ’98, (Myers ‘87, Kannan-Myers ‘92, Landau-Schmidt ‘93, Benson ’98,

Ap.-Federico ’98, Myers-Sagot ’99, Ap-Atallah `99)Ap.-Federico ’98, Myers-Sagot ’99, Ap-Atallah `99)

Max 12 pos

Input: textstring Output: repeated episode (within constaints)

(worst-case quadratic or nk with max k errors)

Max 30 pos

Page 31: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 3131

Pattern Discovery in WAKAPattern Discovery in WAKAalluded to: Kokin-shu #315

(Minamoto-no-Muneyuki)

ya-ma-sa-to-ha

fu-yu-so-sa-hi-shi-sa

ma-sa-ri-ke-ru

hi-to-me-mo-ku-sa-mo

ka-re-nu-to-o-mo-he-ha

allusive variation shugyoku-shu #3528

(Jien)

ya-to-sa-hi-te

hi-to-me-mo-ku-sa-mo

ka-re-nu-re-ha

so-te-ni-so-no-ko-ru

a-ki-no-shi-ra-tsu-yu

alluded to: Kokin-shu #315

A hamlet in mountain is the drearier in winter.

I feel that there is no one to see

and no green around

allusive variation shugyoku-shu #3528

My home has been desertedMy home has been deserted

Now in autumn, there is no one to seeNow in autumn, there is no one to see

And no green aroundAnd no green around

There is a pearl dew left in my sleeveThere is a pearl dew left in my sleeve

Page 32: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 3232

Discovering instances of poetic allusion from anthologies of Discovering instances of poetic allusion from anthologies of classical Japanese poemsclassical Japanese poems

Theoretical Computer Science Volume 292 ,  Issue 2  Masayuki Takeda  Tomoko Fukuda  Ichiro Nanri  Mayumi Yamasaki  Koichi Tamari  

ABSTRACTABSTRACT Waka is a form of traditional Japanese poetry with a 1300-year

history. In this paper, we attempt to semi-automatically discover instances of poetic allusion, or more generally, to find similar poems in anthologies of Waka poems. One reasonable approach would be to arrange all possible pairs of poems in two anthologies in decreasing order of similarity values, and to scrutinize high-ranked pairs by human effort. The means of defining similarity between Waka poems plays a key role in this approach. In this paper, we generalize existing (dis)similarity measures into a uniform framework, called string resemblance systems, and using this framework, we develop new similarity measures suitable for finding similar poems. Using the measures, we report successful results in finding instances of poetic allusion between two anthologies Kokin-Shu and Shin-Kokin-Shu. Most interestingly, we have found an instance of poetic allusion that has never before been pointed out in the long history of Waka research.

Page 33: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 3333

Cheating by SchoolteachersCheating by Schoolteachers(the longest substring common to (the longest substring common to kk of of nn strings) strings)

112a4a342cb214d000112a4a342cb214d0001acd24a3a12dadbcb4a00000000000000d4a2341cacbddad3142a2344a2ac23421c00adb4b3cbd4a2341cacbddad3142a2344a2ac23421c00adb4b3cb1b2a34d4ac42d23b141b2a34d4ac42d23b141acd24a3a12dadbcb4a21341412134141dba23dad1abbac1db1dba23dad1abbac1db11acd24a3a12dadbcb4a21db20021db200dbbbd21d3aac11da42dadcc000adcd21c4b4421dd000dbbbd21d3aac11da42dadcc000adcd21c4b4421dd000121a4a2dcc2cadc11a121a4a2dcc2cadc11a1acd24a3a12dadbcb4a11da01111da0111421acbbdba23dad121421acbbdba23dad121acd24a3a12dadbcb4aa000214a000214cacb1dadbc42dd1122cacb1dadbc42dd11221acd24a3a12dadbcb4acacb1dacacb1dadbbbd21d3aac11da421dadcc000adcd21c4b4421dd00dbbbd21d3aac11da421dadcc000adcd21c4b4421dd002baaab3dad2aadca222baaab3dad2aadca221acd24a3a12dadbcb4a23421c023421c01baaab3dcacb1dadbc42ac2cc31012dadbcb4ad400001baaab3dcacb1dadbc42ac2cc31012dadbcb4ad40000

From: S.D.Levit and S.J Dubner, Freakanomics Morrow, 2005

Page 34: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 3434

SummarySummary

Form and Information Form and Information To Classify and GenerateTo Classify and Generate Of Free Lunches, Ugly Ducklings, and Little Of Free Lunches, Ugly Ducklings, and Little Green MenGreen Men Privileging Syntactic InformationPrivileging Syntactic Information Avoidable and Unavoidable RegularitiesAvoidable and Unavoidable Regularities Periods, Palindromes, Squares, etc.Periods, Palindromes, Squares, etc. Theories Bigger than LifeTheories Bigger than Life Motifs, Profiles and Weigh MatricesMotifs, Profiles and Weigh Matrices

Page 35: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 3535

General Form of Pattern General Form of Pattern DiscoveryDiscovery

•Find-exploit a priori unknown patterns or associations thereof in a Data Base

• With some prior domain-specific knowledge• Without any domain-specific prior knowledge

•Tenet: a pattern or association (rule) that occurs more frequently than one would expect is potentially informative and thus interesting frequent = interesting

Page 36: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 3636

1 Detect Repeated Patterns 1 Detect Repeated Patterns

2 Set up Dictionary 2 Set up Dictionary

3 Use Pointers to Dictionary to Encode Replicas3 Use Pointers to Dictionary to Encode Replicas

Redundancy (repetitiveness) is sought in order to remove itRedundancy (repetitiveness) is sought in order to remove it

Data Compression by Textual Data Compression by Textual SubstitutionSubstitution

Page 37: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 3737

Consumer Prediction (Data Mining)Consumer Prediction (Data Mining)Intrusion Detection (Security)Intrusion Detection (Security)

Protein Classification (Bio-Informatics)Protein Classification (Bio-Informatics) Infer consistent behavior from protocol of past recordInfer consistent behavior from protocol of past record Use to predict future behavior or detect malicious practicesUse to predict future behavior or detect malicious practices

1) Collect a set of behavioral sequences (normal profile) 1) Collect a set of behavioral sequences (normal profile) into a repository or dictionaryinto a repository or dictionary

2) Define measure(s) of sequence similarity2) Define measure(s) of sequence similarity

3) Compare any new sequence to the dictionary, using 3) Compare any new sequence to the dictionary, using similarity to past behavior as a a basis for similarity to past behavior as a a basis for classification as normal or anomalousclassification as normal or anomalous

Anomaly is sought Anomaly is sought as a carrier of informationas a carrier of information

Similarity or predictability Similarity or predictability equals fitness to the modelequals fitness to the model Learning from positive & negative samplesLearning from positive & negative samples

Page 38: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 3838

Of Exactitude in Science

...In that Empire, the craft of Cartography attained such Perfection that the Map of a Single province covered the space of an entire City, and the Map of the Empire itself an entire Province. In the course of Time, these Extensive maps were found somehow wanting, and so the College of Cartographers evolved a Map of the Empire that was of the same Scale as the Empire and that coincided with it point for point. Less attentive to the Study of Cartography, succeeding Generations came to judge a map of such Magnitude cumbersome, and, not without Irreverence, they abandoned it to the Rigours of Sun and Rain. In the western Deserts, tattered Fragments of the Map are still to be found, Sheltering an occasional Beast or beggar; in the whole Nation, no other relic is left of the Discipline of Geography.

From Travels of Praiseworthy Men (1658) by J. A. Suarez Miranda

The piece was written by Jorge Luis Borges and Adolfo Bioy Casares. English translation quoted from J. L. Borges, A Universal History of Infamy, Penguin Books, London, 1975.

Page 39: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 3939

Detection and Analysis of GeneDetection and Analysis of Gene Regulatory RegionsRegulatory Regions

((Jacques van Helden,Jacques van Helden, http://copan.cifn.unam.mx/Computational_Biology/yeast-toolshttp://copan.cifn.unam.mx/Computational_Biology/yeast-tools))

`` Starting from the simple knowledge that a set of genes share some regulatory behavior, one can suppose that some elements are shared by their upstream region, and one would like to detect such elements.

We implemented a simple and fast method to extract such elements, based on a detection of over-represented oligonucleotides.

J. Mol. Biol. (1998) 281, 827-842. ‘’

Page 40: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 4040

A table of mono-mersonly contains 4 lines

http://www.ucmb.ulb.ac.be/bioinformatics/rsa-tools/

seq observed_freq occ a 0.2879006655447 301075

c 0.2120993344553 221805

g 0.2120993344553 221805

t 0.2879006655447 301075

Index of /bioinformatics/rsa-tools/data/Escherichia_coli_K12/oligo-frequencies

Page 41: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 4141

A table of 2-merscontains 16 lines

•;seq observed_freq occ

•aa 0.0996514874362 103508

•ac 0.0516799845961 53680

•ag 0.0522951766631 54319

•at 0.0840396649658 87292

•ca 0.0630865504958 65528

•cc 0.0474795417349 49317

•cg 0.0490959853663 50996

•ct 0.0522951766631 54319

•ga 0.0559112351978 58075

•gc 0.0573659381920 59586

•gg 0.0474795417349 49317

•gt 0.0516799845961 53680

•ta 0.0692904592279 71972

•tc 0.0559112351978 58075

•tg 0.0630865504958 65528

•tt 0.0996514874362 103508

http://www.ucmb.ulb.ac.be/bioinformatics/rsa-tools/

Index of /bioinformatics/rsa-tools/data/Escherichia_coli_K12/oligo-frequencies

Page 42: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 4242

With increasing k,a table of k-mersgrows rapidly out of proportions

How many k-mersin total, for all k?

http://www.ucmb.ulb.ac.be/bioinformatics/rsa-tools/ RSA-tools - menu.htm

;seq observed_freq occ•gct 0.0161176513919

16629

•ctt 0.016398733772316919

•gaa 0.018041611823318614

•gac 0.00964500264619951

•gag 0.010881765119811227

•gat 0.017236165416017783

•gca 0.016634261422117162

•gcc 0.013343659072313767

•gcg 0.014738409228815206

•gct 0.012721400837013125

•gga 0.012376347983912769

•ggc 0.013343659072313767

•ggg 0.010394232577310724

•ggt 0.011420667890511783

•gta 0.012328854754112720

•gtc 0.00964500264619951

•gtg 0.011703688770112075

•gtt 0.018063904563818637

•taa 0.025967165701026791

•tac 0.012328854754112720

•tag 0.00884343323719124

•tat 0.022173522815222877

•tca 0.019030246402619634

•tcc 0.012376347983912769

•tcg 0.010272107129210598

•tct 0.014193690960614644

•tga 0.019030246402619634

•tgc 0.016634261422117162

•tgg 0.011325681430911685

•tgt 0.016274669825116791

•tta 0.025967165701026791

•ttc 0.018041611823318614

•ttg 0.018135629033318711

•ttt 0.037414003330338601

•;seq observed_freq occ

•aaa 0.037414003330338601

•aac 0.018063904563818637

•aag 0.016398733772316919

•aat 0.027655598482528533

•aca 0.016274669825116791

•acc 0.011420667890511783

•acg 0.011818060221412193

•act 0.012128220089412513

•aga 0.014193690960614644

•agc 0.012721400837013125

•agg 0.013335905075613759

•agt 0.012128220089412513

•ata 0.022173522815222877

•atc 0.017236165416017783

•atg 0.017094654976217637

•att 0.027655598482528533

•caa 0.018135629033318711

•cac 0.011703688770112075

•cag 0.016117651391916629

•cat 0.017094654976217637

•cca 0.011325681430911685

•ccc 0.010394232577310724

•ccg 0.012291054020212681

•cct 0.013335905075613759

•cga 0.010272107129210598

•cgc 0.014738409228815206

•cgg 0.012291054020212681

•cgt 0.011818060221412193

•cta 0.00884343323719124

•ctc 0.010881765119811227

•ctg 0.016117651391916629

•ctt 0.016398733772316919

•gaa

Page 43: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 4343

•;seq observed_freq occ

•aaaa 0.014924921702015297

•aaac 0.00618481262136339

•aaag 0.00624432888106400

•aaat 0.009986047827710235

•aaca 0.00591064755646058

•aacc 0.00391831637284016

•aacg 0.00442371674164534

•aact 0.00377294059113867

•aaga 0.00449591679434608

•aagc 0.00398758939634087

•aagg 0.00428517069464392

•aagt 0.00365293239543744

•aata 0.00825031953408456

•aatc 0.00525694437675388

•aatg 0.00584039885655986

•aatt 0.00834788717288556

•acaa 0.00554769594025686

•acac 0.00267335330222740

•acag 0.00388319202293980

•acat 0.00413394085454237

•acca 0.00314753202663226

•accc 0.00246065584972522

•accg 0.00271823441602786

•acct 0.00303337788923109

•acga 0.00261286136612678

•acgc 0.00394465963534043

•acgg 0.00256700457592631

•acgt 0.00266945059662736

•acta 0.00238455309142444

•actc 0.00272799117992796

•actg 0.00333486189303418

•actt 0.00365293239543744

•agaa 0.00473105480374849

•agac 0.00239333417892453

•agag 0.00310948064753187

•agat 0.00392026772564018

•agca 0.00400027318944100

•agcc 0.00287043993252942

•agcg 0.00368415403983776

•agct 0.00218356375562238

•agga 0.00367927565783771

•aggc 0.00375830544523852

•aggg 0.00288117237272953

•aggt 0.00303337788923109

•agta 0.00303240221283108

•agtc 0.00222161513472277

•agtg 0.00311143200023189

•agtt 0.00377294059113867

•ataa 0.00925819324259489

•atac 0.00333974027493423

•atag 0.00307045359203147

•atat 0.00650971285846672

•atca 0.00614871259506302

•atcc 0.00400027318944100

•atcg 0.00313484823353213

•atct 0.00392026772564018

•atga 0.00525889572955390

•atgc 0.00459738713864712

•atgg 0.00318363205293263

•atgt 0.00413394085454237

•atta 0.00727366747007455

•attc 0.00499936581035124

•attg 0.00536524445575499

•attt 0.009986047827710235

•caaa 0.00648824797796650

•caac 0.00380513791193900

•caag 0.00247919370102541

•caat 0.00536524445575499

•caca 0.00347048091093557

•cacc 0.00295434810183028

•cacg 0.00217380699172228

•cact 0.00311143200023189

•caga 0.00434858965984457

•cagc 0.00364415130793735

•cagg 0.00482764676614948

•cagt 0.00333486189303418

•cata 0.00387538661183972

•catc 0.00464812231084764

•catg 0.00276701823542836

•catt 0.00584039885655986

•ccaa 0.00226356921942320

•ccac 0.00234259900682401

•ccag 0.00354072961083629

•ccat 0.00318363205293263

•ccca 0.00199818524192048

•cccc 0.00261969110092685

•cccg 0.00288019669642952

•ccct 0.00288117237272953

•ccga 0.00224600704442302

•ccgc 0.00350267823173590

•ccgg 0.00400027318944100

•ccgt 0.00256700457592631

•ccta 0.00159620657021636

•cctc 0.00262066677722686

•cctg 0.00482764676614948

•cctt 0.00428517069464392

•ccta 0.00159620657021636

•cctc 0.00262066677722686

•cctg 0.00482764676614948

•cctt 0.00428517069464392

•cgaa 0.00318558340573265

•cgac 0.00226454489572321

•cgag 0.00165572282991697

•cgat 0.00313484823353213

•cgca 0.00420028684894305

•cgcc 0.00393978125344038

•cgcg 0.00293093186853004

•cgct 0.00368415403983776

•cgga 0.00309777253083175

•cggc 0.00360609992883696

•cggg 0.00288019669642952

•cggt 0.00271823441602786

•cgta 0.00272018576882788

•cgtc 0.00251529372742578

•cgtg 0.00217380699172228

•cgtt 0.00442371674164534

•ctaa 0.00304996438783126

•ctac 0.00225771516102314

•ctag 0.0004624706077474

•ctat 0.00307045359203147

•ctca 0.00319046178763270

•ctcc 0.00291336969352986

•ctcg 0.00165572282991697

•ctct 0.00310948064753187

•ctga 0.00500717122145132

•ctgc 0.00369683783283789

•ctgg 0.00354072961083629

•ctgt 0.00388319202293980

•ctta 0.00446567082634577

•cttc 0.00322070775573301

•cttg 0.00247919370102541

•cttt 0.00624432888106400

•gaaa 0.0069331564107 7106

•gaac 0.0028587318158 2930

•gaag 0.0032207077557 3301

•gaat 0.0049993658103 5124

•gaca 0.0031611914960 3240

•gacc 0.0017367039700 1780

•gacg 0.0025152937274 2578

•gact 0.0022216151347 2277

•gaga 0.0032968105139 3379

•gagc 0.0022313718986 2287

•gagg 0.0026206667772 2686

•gagt 0.0027279911799 2796

•gata 0.0051232767116 5251

•gatc 0.0022206394583 2276

•gatg 0.0046481223108 4764

•gatt 0.0052569443767 5388

•gcaa 0.0055896500249 5729

•gcac 0.0027260398271 2794

•gcag 0.0036968378328 3789

•gcat 0.0045973871386 4712

•gcca 0.0035651215205 3654

•gccc 0.0024147990594 2475

•gccg 0.0036060999288 3696

•gcct 0.0037583054452 3852

•gcga 0.0033680348902 3452

•gcgc 0.0039378299006 4036

•gcgg 0.0035026782317 3590

•gcgt 0.0039446596353 4043

•gcta 0.0028333642298 2904

•gctc 0.0022313718986 2287

•gctg 0.0036441513079 3735

•gctt 0.0039875893963 4087

•gcgc 0.0039378299006 4036

•gcgg 0.0035026782317 3590

•gcgt 0.0039446596353 4043

•gcta 0.0028333642298 2904

•gctc 0.0022313718986 2287

•gctg 0.0036441513079 3735

•gctt 0.0039875893963 4087

•ggaa 0.0039446596353 4043

•ggac 0.0014869308148 1524

•ggag 0.0029133696935 2986

•ggat 0.0040002731894 4100

•ggca 0.0042198003766 4325

•ggcc 0.0023182070971 2376

•ggcg 0.0039397812534 4038

•ggct 0.0028704399325 2942

•ggga 0.0029016615769 2974

•gggc 0.0024147990594 2475

•gggg 0.0026196911009 2685

•gggt 0.0024606558497 2522

•ggta 0.0028021425853 2872

•ggtc 0.0017367039700 1780

•ggtg 0.0029543481018 3028

•ggtt 0.0039183163728 4016

•gtaa 0.0050003414867 5125

•gtac 0.0017210931478 1764

•gtag 0.0022577151610 2314

•gtat 0.0033397402749 3423

•gtca 0.0034831647039 3570

•gtcc 0.0014869308148 1524

•gtcg 0.0022645448957 2321

•gtct 0.0023933341789 2453

•gtga 0.0040032002186 4103

•gtgc 0.0027260398271 2794

•gtgg 0.0023425990068 2401

•gtgt 0.0026733533022 2740

•gtta 0.0051515713268 5280

•gttc 0.0028587318158 2930

•gttg 0.0038051379119 3900

•gttt 0.0061848126213 6339

•taaa 0.0090659849941 9292

•taac 0.0051515713268 5280

•taag 0.0044656708263 4577

•taat 0.0072736674700 7455

•taca 0.0037748919438 3869

•tacc 0.0028021425853 2872

•tacg 0.0027201857688 2788

•tact 0.0030324022128 3108

•taga 0.0020537987960 2105

•tagc 0.0028333642298 2904

•tagg 0.0015962065702 1636

•tagt 0.0023845530914 2444

•tata 0.0049661928132 5090

•tatc 0.0051232767116 5251

•tatg 0.0038753866118 3972

•tatt 0.0082503195340 8456

•tcaa 0.0047778872704 4897

•tcac 0.0040032002186 4103

•tcag 0.0050071712214 5132

•tcat 0.0052588957295 5390

•tcca 0.0026372532758 2703

•tccc 0.0029016615769 2974

•tccg 0.0030977725308 3175

•tcct 0.0036792756578 3771

•tcga 0.0020567258252 2108

•tcgc 0.0033680348902 3452

•tcgg 0.0022460070444 2302

•tcgt 0.0026128613661 2678

•tcta 0.0020537987960 2105

•tctc 0.0032968105139 3379

•tctg 0.0043485896598 4457

•tctt 0.0044959167943 4608

•tatg 0.00387538661183972

•tatt 0.00825031953408456

•tcaa 0.00477788727044897

•tcac 0.00400320021864103

•tcag 0.00500717122145132

•tcat 0.00525889572955390

•tcca 0.00263725327582703

•tccc 0.00290166157692974

•tccg 0.00309777253083175

•tcct 0.00367927565783771

•tcga 0.00205672582522108

•tcgc 0.00336803489023452

•tcgg 0.00224600704442302

•tcgt 0.00261286136612678

•tcta 0.00205379879602105

•tctc 0.00329681051393379

•tctg 0.00434858965984457

•tctt 0.00449591679434608

•tgaa 0.00620822885476363

•tgac 0.00348316470393570

•tgag 0.00319046178763270

•tgat 0.00614871259506302

•tgca 0.00424419228634350

•tgcc 0.00421980037664325

•tgcg 0.00420028684894305

•tgct 0.00400027318944100

•tgga 0.00263725327582703

•tggc 0.00356512152053654

A table of k-mersgrows rapidly out of proportionsor out of sight

How many k-mersin total, for all k?

• http://www.ucmb.ulb.ac.be/bioinformatics/rsa-tools/ • RSA-tools - menu.htm

•tggg 0.0019981852419 2048

•tggt 0.0031475320266 3226

•tgta 0.0037748919438 3869

•tgtc 0.0031611914960 3240

•tgtg 0.0034704809109 3557

•tgtt 0.0059106475564 6058

•ttaa 0.0086815684974 8898

•ttac 0.0050003414867 5125

•ttag 0.0030499643878 3126

•ttat 0.0092581932425 9489

•ttca 0.0062082288547 6363

•ttcc 0.0039446596353 4043

•ttcg 0.0031855834057 3265

•ttct 0.0047310548037 4849

•ttga 0.0047778872704 4897

•ttgc 0.0055896500249 5729

•ttgg 0.0022635692194 2320

•ttgt 0.0055476959402 5686

•ttta 0.0090659849941 9292

•tttc 0.0069331564107 7106

•tttg 0.0064882479779 6650

•tttt 0.0149249217020 15297

Page 44: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 4444

How many distinct substrings in a How many distinct substrings in a string of string of nn symbols symbols

A: no more than (A: no more than (n n x x n)/2n)/2( n ways to choose beginning or ( n ways to choose beginning or ii, then , then n-in-i ways to choose end or ways to choose end or j j ))

i1

jn

Page 45: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 4545

How many How many surprisingsurprising substrings in a substrings in a string of string of nn symbols symbols

Agree on a model for the source: e.g., the source emits symbols Agree on a model for the source: e.g., the source emits symbols independently with identical distributionindependently with identical distribution

A: possibly, all (n x n)/2 of them !

• Agree on some measure of surprise, e.g., departure from

expected number of occurrences exceeds a certain threshold

• For a given observed string of n symbols, how many substrings may turn out to be surprising?

Page 46: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 4646

Source Modeling by Probabilistic Finite State Source Modeling by Probabilistic Finite State AutomataAutomata

0.2501

00 10

11

0.25

0.75

0.25

0.75

0.75

0.5

0.250.5

Order-2 Markov Chain

1

00 10

1

0.25

0.75

0.25

0.5

0.25

Probabilistic Suffix Automaton

(0.75, 0.25)

(0.25, 0.75)

(0.5, 0.5)

(0.5, 0.5)

00

0

10

1

(0.5, 0.5)

Prob Suffix Tree

Page 47: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 4747

FindingFinding surprising surprising substrings with substrings with mismatchesmismatches

Input: a sequence or set of sequences, integers m and kInput: a sequence or set of sequences, integers m and k Out: all substrings of length m that occur unusually often, up to Out: all substrings of length m that occur unusually often, up to

k mismatches, as a replica of the same patternk mismatches, as a replica of the same pattern

How many patterns should one try ?

• NOTE: the pattern might never occur exactly in the input

Approximate Patterns

Page 48: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 4848

From the Special Issue for the 50From the Special Issue for the 50thth Shannon Shannon Anniversary of IEEE Trans. ITAnniversary of IEEE Trans. IT

``Perhaps as a consequence of the fact that ``Perhaps as a consequence of the fact that approximate matches approximate matches abound whereas exact matches are uniqueabound whereas exact matches are unique, it is inherently much , it is inherently much faster to look for an exact match that it is to search from a plethora faster to look for an exact match that it is to search from a plethora of approximate matches looking for the best, or even nearly the of approximate matches looking for the best, or even nearly the best, among them. best, among them. The right way to trade off search effort in a The right way to trade off search effort in a poorly understood environment against the degree to which the poorly understood environment against the degree to which the product of the search possesses desired criteria has long been a product of the search possesses desired criteria has long been a human enigmahuman enigma.''.''

T. Berger and J.D. Gibson,T. Berger and J.D. Gibson, ``Lossy Source Coding,'‘ ``Lossy Source Coding,'‘ IEEE Trans. on Inform. TheoryIEEE Trans. on Inform. Theory, vol. 44, No. , vol. 44, No.

6, pp. 2693--2723, 1998.6, pp. 2693--2723, 1998.

Page 49: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 4949

Syntactic Motif: Syntactic Motif: a recurring pattern with some solid characters and some a recurring pattern with some solid characters and some characters that are a subset of the alphabet, or a ‘’don’t characters that are a subset of the alphabet, or a ‘’don’t

care’’ or ‘’gap’’care’’ or ‘’gap’’

PROBLEM Input: textstring Output: repeated motifs T A G A G G T A G A T AG T

T A G A G G T A G A T AG T T A G A G G T A G A T A T

Motifs may be rigid or extensible (sometimes also called flexible)

``don’t care’’ characters solid character

T A G A G G T A G A T AG T

Page 50: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 5050

From Syntax to Stat: From Syntax to Stat: Extracting a Profile Matrix & Consensus Extracting a Profile Matrix & Consensus

(From Hertz-Stormo 99)

A A T T G A A G G T C C A G G A T G A G G C G T

4 1 0 1 0 1 Alignment Matrix 0 0 0 1 1 1 0 3 3 0 2 1 0 0 1 2 1 1

A G G T G ? (Consensus - by majority rule )

ni,j = times letter i is observed at jth position in alignment

N = number of sequences = 4

NOTE: While each sequence is a ``realization’’ of the consensus the consensus itself might not be any of the sequences

Page 51: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 5151

From Syntax to Stat, continued: From Syntax to Stat, continued:

Computing Weight MatrixComputing Weight Matrix

A A T T G A A G G T C C A G G A T G A G G C G T

4 1 0 1 0 1 Alignment Matrix 0 0 0 1 1 1 0 3 3 0 2 1 0 0 1 2 1 1

A G G T G ? (Consensus - by majority rule )

Compute ln [[(ni,j + pi ) / (N + 1)] / pi ] ~ ln (fi,j / pi)

ni,j = times letter i is observed at jth position in alignment

N = number of sequences = 4

pi = a priori probability (.25 in example )

f i,j = frequency of letter i at position j

this is like taking the ratio of the empirical frequencies, compensated by p i to avoid infinity or zero,

to the hypothetical probabilities or flat distribution(popular measure among statisticians: how much the observed distribution deviates from chance)

Page 52: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 5252

From Syntax to Stat, continued: From Syntax to Stat, continued:

Weighing a Test SequenceWeighing a Test Sequence

A A T T G A

A G G T C C Weight Matrix A G G A T G A G G C G T

4 1 0 1 0 1 1.2 0 -1.6 0 -1.6 0

0 0 0 1 1 1 ln (fi,j / pi) -1.6 -1.6 -1.6 0 0 0

0 3 3 0 2 1 -1.6 .96 .96 -1.6 .59 0 0 0 1 2 1 1 -1.6 -1.6 0 .59 0 0

A G G T G ? A G G T G C (test sequence)

ln [ [ (ni,j + pi ) / (N + 1) ] / pi ] ~ ln (fi,j / pi)

Page 53: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 5353

From Syntax to Stat, continued: From Syntax to Stat, continued:

Weighing a Test SequenceWeighing a Test Sequence

A A T T G A

A G G T C C Weight Matrix A G G A T G A G G C G T

A 4 1 0 1 0 1 1.2 0 -1.6 0 -1.6 0

C 0 0 0 1 1 1 ln (fi,j / pi) -1.6 -1.6 -1.6 0 0 0

G 0 3 3 0 2 1 -1.6 .96 .96 -1.6 .59 0T 0 0 1 2 1 1 -1.6 -1.6 0 .59 0 0

A G G T G C (test sequence, score = 4.3)

ln [[ (ni,j + pi ) / (N + 1) ] / pi ] ~ ln (fi,j / pi)

Page 54: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 5454

From Stat to Syntax: extracting a “full consensus” from sampleFrom Stat to Syntax: extracting a “full consensus” from sample (daf-19 binding sites in C. elegans - Peter Swoboda)(daf-19 binding sites in C. elegans - Peter Swoboda)

GTTGTTGTGTCATG GTCATG GTGGACACGTTGTTTTCCATG GCCATG GAAAACAACGGCCTTAACCATG GCCATG GCCAACAACGTTGTTAACCATCCATA A GTAACGTAACGTTGTTTTCCATG GTAACCCATG GTAAC

-150 -1

osm-1osm-1

osm-6osm-6

daf-19daf-19

che-2che-2

F02D8.3F02D8.3

GTT__CATGGT_ACGTT_CCATGG_AACG_T_CCATGG_AACGTT_CCAT_ GTAACGTT_CCATG GTAAC

Now the model describes also GATCCCATCGGAACwhich did not belong to the data

Consensus at all costs generates monsters

Model: G_T__CAT_G__AC

Page 55: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 5555

Episodes and extensible motifs Episodes and extensible motifs Mannila et al., 95; Das et al., 97Mannila et al., 95; Das et al., 97

Max 10 pos

Input: textstring and pattern string Output: episode realization

(quadratic worst-case)

Page 56: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 5656

Extensible MotifsExtensible Motifs

Definition: Definition: Extensible Motifs Extensible Motifs are patterns are patterns which allow variable-length don’t careswhich allow variable-length don’t cares

e.g., Prosite e.g., Prosite F…..G-(2,4)G.HF…..G-(2,4)G.H Note that the length of these patterns is Note that the length of these patterns is

variablevariable High expressive powerHigh expressive power Huge pattern spaceHuge pattern space

Page 57: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 5757

An Example from PrositeAn Example from Prosite

Entry name: HIPIPEntry name: HIPIP

Accession number: PS00596Accession number: PS00596

Description: High potential iron-sulfur proteins signature.Description: High potential iron-sulfur proteins signature.

Pattern: Pattern: C-(6,9)[LIVM]…G[YW]C..[FYW]C-(6,9)[LIVM]…G[YW]C..[FYW] PDB 1PIJ PDB 1HLQ PDB 1PIJ PDB 1HLQ

Page 58: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 5858

Extensible MotifsExtensible Motifs(Implications of Variable-Gaps)(Implications of Variable-Gaps)

s = axbcaxxbcaxxxbc m = a-[1-3]bc at pos 1, 5 and 10Main Issues

1) a location list corresponds to multiple patterns Eg. axbcpdaycbqd (at positions 1 and 7) m1 = a-[1-2]b-[1-2]d m2 = a-[1-2]c-[1-2]d2) multiple occurrences at a location Eg. axbbxc (at position 1) m = a-[1-2]b-[1-2]c

Page 59: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 5959

SummarySummary

Form and Information Form and Information To Classify and GenerateTo Classify and Generate Of Free Lunches, Ugly Ducklings, and Little Of Free Lunches, Ugly Ducklings, and Little Green MenGreen Men Privileging Syntactic InformationPrivileging Syntactic Information Avoidable and Unavoidable RegularitiesAvoidable and Unavoidable Regularities Periods, Palindromes, Squares, etc.Periods, Palindromes, Squares, etc. Theories Bigger than LifeTheories Bigger than Life Motifs, Profiles and Weigh MatricesMotifs, Profiles and Weigh Matrices The Emperor’s New MapThe Emperor’s New Map

Page 60: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 6060

Detection and Analysis of GeneDetection and Analysis of Gene Regulatory RegionsRegulatory Regions

((Jacques van Helden,Jacques van Helden, http://copan.cifn.unam.mx/Computational_Biology/yeast-toolshttp://copan.cifn.unam.mx/Computational_Biology/yeast-tools))

`` Starting from the simple knowledge that a set of genes share some regulatory behavior, one can suppose that some elements are shared by their upstream region, and one would like to detect such elements.

We implemented a simple and fast method to extract such elements, based on a detection of over-represented oligonucleotides.

J. Mol. Biol. (1998) 281, 827-842. ‘’

Page 61: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 6161

Over-represented sequences in the 800 bps upstream segments of two Over-represented sequences in the 800 bps upstream segments of two families of co-regulated genes in the yeast:families of co-regulated genes in the yeast:

superposition of circled words yields known motifssuperposition of circled words yields known motifs

TCACGTGAAAACTGTGG

TCCGCGGA

Page 62: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 6262

Question: how many of the8-mers in a sequence 106 bases longcould be surprisinglyover-represented?

How many k-mersin total, for all k?

Page 63: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 6363

Index of /bioinformatics/rsa-tools/data/Escherichia_coli_K12/oligo-frequencies

        Name Last modified Size      

1nt_non-coding_Esche..> 24-Dec-2001 06:56 1k     2nt_non-coding_Esche..> 24-Dec-2001 06:56 1k     3nt_non-coding_Esche..> 24-Dec-2001 06:56 2k     4nt_non-coding_Esche..> 24-Dec-2001 06:56 7k    5nt_non-coding_Esche..> 24-Dec-2001 06:56 26k    6nt_non-coding_Esche..> 24-Dec-2001 06:56 108k    7nt_non-coding_Esche..> 24-Dec-2001 06:56 434k    8nt_non-coding_Esche..> 24-Dec-2001 06:57 1.7M     dyads_3nt_sp0-20_non..> 24-Dec-2001 07:11 2.9M

http://www.ucmb.ulb.ac.be/bioinformatics/rsa-tools/

Page 64: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 6464

Index of /bioinformatics/rsa-tools/data/Escherichia_coli_K12/oligo-frequencies

        Name Last modified Size      

1nt_non-coding_Esche..> 24-Dec-2001 06:56 1k     2nt_non-coding_Esche..> 24-Dec-2001 06:56 1k     3nt_non-coding_Esche..> 24-Dec-2001 06:56 2k     4nt_non-coding_Esche..> 24-Dec-2001 06:56 7k    5nt_non-coding_Esche..> 24-Dec-2001 06:56 26k    6nt_non-coding_Esche..> 24-Dec-2001 06:56 108k    7nt_non-coding_Esche..> 24-Dec-2001 06:56 434k    8nt_non-coding_Esche..> 24-Dec-2001 06:57 1.7M     dyads_3nt_sp0-20_non..> 24-Dec-2001 07:11 2.9M

http://www.ucmb.ulb.ac.be/bioinformatics/rsa-tools/

Page 65: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 6565

Theories bigger than Life:Theories bigger than Life:Assume we wanted to build a statistical table counting Assume we wanted to build a statistical table counting occurrences of all surprising substrings in a genomeoccurrences of all surprising substrings in a genome

Q: How many distinct substrings in a string of Q: How many distinct substrings in a string of nn symbolssymbols

A: no more than (A: no more than (n n x x n)/2n)/2( n ways to choose beginning or ( n ways to choose beginning or ii, then , then n-in-i ways to choose end or ways to choose end or j j ))

i1

jn

Page 66: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 6666

Theories bigger than Life: Theories bigger than Life: How many How many surprisingsurprising substrings in a string of substrings in a string of nn symbols symbols

Agree on a model for the source: e.g., the source emits Agree on a model for the source: e.g., the source emits symbols independently with identical distributionsymbols independently with identical distribution

Agree on some measure of surprise, e.g., departure from Agree on some measure of surprise, e.g., departure from

expected number of occurrences exceeds a certain thresholdexpected number of occurrences exceeds a certain threshold

For a given For a given observedobserved string of string of nn symbols, how many symbols, how many substrings may turn out to be surprising? substrings may turn out to be surprising?

A: possibly, all (n x n)/2 of them !

Page 67: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 6767

Z-scores as measures of Z-scores as measures of surprisesurprise

Page 68: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 6868

Three easy conditions on surpriseThree easy conditions on surprise

•1) always:

•2) for absent words:(note asymmetry of surprise)

•3) for over-represented words: (longer word = bigger surprise)

From 1-3 together

Page 69: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 6969

Monotony of SurpriseMonotony of Surprise

A score such that :

will be called monotone

Page 70: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 7070

Main pointMain point

For many monotone scores

where

``surprising’’

Page 71: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 7171

DAWGsDAWGsthe set of words reaching a node is a burst of the set of words reaching a node is a burst of

consecutive suffixes of a same wordconsecutive suffixes of a same word

•Each state corresponds to a set of strings, the set of all strings that have occurrences ending precisely at the same positions in x

•The sequence of labels on each distinct path from source to sink spells a suffix of x

•|x| < Q < 2|x| - 1 |x| -1 < E < 3|x| -3

A T A A A ATT T

Page 72: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 7272

DAWGsDAWGs

With monotone scores, it suffices to publish scores only at the longest word in each one of the O(n) equivalence class

(Often, however, we still need to compute all O(n ) scores )2

Page 73: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 7373

The Size of Tables for Substring The Size of Tables for Substring

StatisticsStatistics

Page 74: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 7474

Substring Statistics with Suffix TreesSubstring Statistics with Suffix Trees

A partial view (all suffixes starting with ``a'') of the weighted suffix tree for the stringx = abaababaabaababaababa: the weight of each internal node reports the number of(possibly overlapping) occurrences in x of the substring having locus at that node.

•1 Counts do not change along an arc•2 If aw ends at a node so does w (suffix links)

`The Myriad Virtues of Suffix Trees’’A.Apostolico Combinatoral Algorithms On Words A.A and Z.Galil eds, Springer 1985

Page 75: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 7575

Detecting Squares with Suffix TreesDetecting Squares with Suffix Trees

There is a square iff there is a node with two consecutive leaves in its subtree too close for comfort. 14 - 12 = 2 > 3 = |aba|

(A. Apostolico & FP Preparata, 83)

Page 76: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 7676

Combining Saturation and Monotony of Scores Combining Saturation and Monotony of Scores over ST Arcs over ST Arcs yieldsyields Surprising Solid Words in Surprising Solid Words in

Linear Time and SpaceLinear Time and Space

Verbumculus Verbumculus (AA, Bock, Gong, Lonardi, Xu, JCB2000, (AA, Bock, Gong, Lonardi, Xu, JCB2000, JCB2003, Recomb 2003, ..)JCB2003, Recomb 2003, ..)

Based on Suffix tree and iidBased on Suffix tree and iid Partitions the O(nPartitions the O(n22) substrigs into O(n) “equivalence classes ) substrigs into O(n) “equivalence classes

of monotone score”,then computes expected frequencies, of monotone score”,then computes expected frequencies, variances and scores for the most surprising word in each variances and scores for the most surprising word in each class in time O(n) overall. class in time O(n) overall.

For any word For any word vv without a score, there is a scored extension without a score, there is a scored extension v yv y which is at least equally surprising. which is at least equally surprising.

Page 77: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 7777

Z-scores and measures of surpriseZ-scores and measures of surprise

Page 78: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 7878

Main pointMain point

For any measure of surprise

where

and conditions 1-3 are satisfied:

``surprising’’

Page 79: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 7979

Exercise: i.i.d. variablesExercise: i.i.d. variables

We are interested in the expected number of occurrences We are interested in the expected number of occurrences of of yy in in XX, and the corresponding variance., and the corresponding variance.

Page 80: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 8080

Over- and Under-represented words: Z-ScoresOver- and Under-represented words: Z-Scores!@#&!!$!!

Page 81: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 8181

Under the Hood: Periods and VarianceUnder the Hood: Periods and Variance

Page 82: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 8282

•;seq observed_freq occ

•aaaa 0.014924921702015297

•aaac 0.00618481262136339

•aaag 0.00624432888106400

•aaat 0.009986047827710235

•aaca 0.00591064755646058

•aacc 0.00391831637284016

•aacg 0.00442371674164534

•aact 0.00377294059113867

•aaga 0.00449591679434608

•aagc 0.00398758939634087

•aagg 0.00428517069464392

•aagt 0.00365293239543744

•aata 0.00825031953408456

•aatc 0.00525694437675388

•aatg 0.00584039885655986

•aatt 0.00834788717288556

•acaa 0.00554769594025686

•acac 0.00267335330222740

•acag 0.00388319202293980

•acat 0.00413394085454237

•acca 0.00314753202663226

•accc 0.00246065584972522

•accg 0.00271823441602786

•acct 0.00303337788923109

•acga 0.00261286136612678

•acgc 0.00394465963534043

•acgg 0.00256700457592631

•acgt 0.00266945059662736

•acta 0.00238455309142444

•actc 0.00272799117992796

•actg 0.00333486189303418

•actt 0.00365293239543744

•agaa 0.00473105480374849

•agac 0.00239333417892453

•agag 0.00310948064753187

•agat 0.00392026772564018

•agca 0.00400027318944100

•agcc 0.00287043993252942

•agcg 0.00368415403983776

•agct 0.00218356375562238

•agga 0.00367927565783771

•aggc 0.00375830544523852

•aggg 0.00288117237272953

•aggt 0.00303337788923109

•agta 0.00303240221283108

•agtc 0.00222161513472277

•agtg 0.00311143200023189

•agtt 0.00377294059113867

•ataa 0.00925819324259489

•atac 0.00333974027493423

•atag 0.00307045359203147

•atat 0.00650971285846672

•atca 0.00614871259506302

•atcc 0.00400027318944100

•atcg 0.00313484823353213

•atct 0.00392026772564018

•atga 0.00525889572955390

•atgc 0.00459738713864712

•atgg 0.00318363205293263

•atgt 0.00413394085454237

•atta 0.00727366747007455

•attc 0.00499936581035124

•attg 0.00536524445575499

•attt 0.009986047827710235

•caaa 0.00648824797796650

•caac 0.00380513791193900

•caag 0.00247919370102541

•caat 0.00536524445575499

•caca 0.00347048091093557

•cacc 0.00295434810183028

•cacg 0.00217380699172228

•cact 0.00311143200023189

•caga 0.00434858965984457

•cagc 0.00364415130793735

•cagg 0.00482764676614948

•cagt 0.00333486189303418

•cata 0.00387538661183972

•catc 0.00464812231084764

•catg 0.00276701823542836

•catt 0.00584039885655986

•ccaa 0.00226356921942320

•ccac 0.00234259900682401

•ccag 0.00354072961083629

•ccat 0.00318363205293263

•ccca 0.00199818524192048

•cccc 0.00261969110092685

•cccg 0.00288019669642952

•ccct 0.00288117237272953

•ccga 0.00224600704442302

•ccgc 0.00350267823173590

•ccgg 0.00400027318944100

•ccgt 0.00256700457592631

•ccta 0.00159620657021636

•cctc 0.00262066677722686

•cctg 0.00482764676614948

•cctt 0.00428517069464392

•ccta 0.00159620657021636

•cctc 0.00262066677722686

•cctg 0.00482764676614948

•cctt 0.00428517069464392

•cgaa 0.00318558340573265

•cgac 0.00226454489572321

•cgag 0.00165572282991697

•cgat 0.00313484823353213

•cgca 0.00420028684894305

•cgcc 0.00393978125344038

•cgcg 0.00293093186853004

•cgct 0.00368415403983776

•cgga 0.00309777253083175

•cggc 0.00360609992883696

•cggg 0.00288019669642952

•cggt 0.00271823441602786

•cgta 0.00272018576882788

•cgtc 0.00251529372742578

•cgtg 0.00217380699172228

•cgtt 0.00442371674164534

•ctaa 0.00304996438783126

•ctac 0.00225771516102314

•ctag 0.0004624706077474

•ctat 0.00307045359203147

•ctca 0.00319046178763270

•ctcc 0.00291336969352986

•ctcg 0.00165572282991697

•ctct 0.00310948064753187

•ctga 0.00500717122145132

•ctgc 0.00369683783283789

•ctgg 0.00354072961083629

•ctgt 0.00388319202293980

•ctta 0.00446567082634577

•cttc 0.00322070775573301

•cttg 0.00247919370102541

•cttt 0.00624432888106400

•gaaa 0.0069331564107 7106

•gaac 0.0028587318158 2930

•gaag 0.0032207077557 3301

•gaat 0.0049993658103 5124

•gaca 0.0031611914960 3240

•gacc 0.0017367039700 1780

•gacg 0.0025152937274 2578

•gact 0.0022216151347 2277

•gaga 0.0032968105139 3379

•gagc 0.0022313718986 2287

•gagg 0.0026206667772 2686

•gagt 0.0027279911799 2796

•gata 0.0051232767116 5251

•gatc 0.0022206394583 2276

•gatg 0.0046481223108 4764

•gatt 0.0052569443767 5388

•gcaa 0.0055896500249 5729

•gcac 0.0027260398271 2794

•gcag 0.0036968378328 3789

•gcat 0.0045973871386 4712

•gcca 0.0035651215205 3654

•gccc 0.0024147990594 2475

•gccg 0.0036060999288 3696

•gcct 0.0037583054452 3852

•gcga 0.0033680348902 3452

•gcgc 0.0039378299006 4036

•gcgg 0.0035026782317 3590

•gcgt 0.0039446596353 4043

•gcta 0.0028333642298 2904

•gctc 0.0022313718986 2287

•gctg 0.0036441513079 3735

•gctt 0.0039875893963 4087

•gcgc 0.0039378299006 4036

•gcgg 0.0035026782317 3590

•gcgt 0.0039446596353 4043

•gcta 0.0028333642298 2904

•gctc 0.0022313718986 2287

•gctg 0.0036441513079 3735

•gctt 0.0039875893963 4087

•ggaa 0.0039446596353 4043

•ggac 0.0014869308148 1524

•ggag 0.0029133696935 2986

•ggat 0.0040002731894 4100

•ggca 0.0042198003766 4325

•ggcc 0.0023182070971 2376

•ggcg 0.0039397812534 4038

•ggct 0.0028704399325 2942

•ggga 0.0029016615769 2974

•gggc 0.0024147990594 2475

•gggg 0.0026196911009 2685

•gggt 0.0024606558497 2522

•ggta 0.0028021425853 2872

•ggtc 0.0017367039700 1780

•ggtg 0.0029543481018 3028

•ggtt 0.0039183163728 4016

•gtaa 0.0050003414867 5125

•gtac 0.0017210931478 1764

•gtag 0.0022577151610 2314

•gtat 0.0033397402749 3423

•gtca 0.0034831647039 3570

•gtcc 0.0014869308148 1524

•gtcg 0.0022645448957 2321

•gtct 0.0023933341789 2453

•gtga 0.0040032002186 4103

•gtgc 0.0027260398271 2794

•gtgg 0.0023425990068 2401

•gtgt 0.0026733533022 2740

•gtta 0.0051515713268 5280

•gttc 0.0028587318158 2930

•gttg 0.0038051379119 3900

•gttt 0.0061848126213 6339

•taaa 0.0090659849941 9292

•taac 0.0051515713268 5280

•taag 0.0044656708263 4577

•taat 0.0072736674700 7455

•taca 0.0037748919438 3869

•tacc 0.0028021425853 2872

•tacg 0.0027201857688 2788

•tact 0.0030324022128 3108

•taga 0.0020537987960 2105

•tagc 0.0028333642298 2904

•tagg 0.0015962065702 1636

•tagt 0.0023845530914 2444

•tata 0.0049661928132 5090

•tatc 0.0051232767116 5251

•tatg 0.0038753866118 3972

•tatt 0.0082503195340 8456

•tcaa 0.0047778872704 4897

•tcac 0.0040032002186 4103

•tcag 0.0050071712214 5132

•tcat 0.0052588957295 5390

•tcca 0.0026372532758 2703

•tccc 0.0029016615769 2974

•tccg 0.0030977725308 3175

•tcct 0.0036792756578 3771

•tcga 0.0020567258252 2108

•tcgc 0.0033680348902 3452

•tcgg 0.0022460070444 2302

•tcgt 0.0026128613661 2678

•tcta 0.0020537987960 2105

•tctc 0.0032968105139 3379

•tctg 0.0043485896598 4457

•tctt 0.0044959167943 4608

•tatg 0.00387538661183972

•tatt 0.00825031953408456

•tcaa 0.00477788727044897

•tcac 0.00400320021864103

•tcag 0.00500717122145132

•tcat 0.00525889572955390

•tcca 0.00263725327582703

•tccc 0.00290166157692974

•tccg 0.00309777253083175

•tcct 0.00367927565783771

•tcga 0.00205672582522108

•tcgc 0.00336803489023452

•tcgg 0.00224600704442302

•tcgt 0.00261286136612678

•tcta 0.00205379879602105

•tctc 0.00329681051393379

•tctg 0.00434858965984457

•tctt 0.00449591679434608

•tgaa 0.00620822885476363

•tgac 0.00348316470393570

•tgag 0.00319046178763270

•tgat 0.00614871259506302

•tgca 0.00424419228634350

•tgcc 0.00421980037664325

•tgcg 0.00420028684894305

•tgct 0.00400027318944100

•tgga 0.00263725327582703

•tggc 0.00356512152053654

A table of k-mersgrows rapidly out of proportionsor out of sight

How many k-mersin total, for all k?

• http://www.ucmb.ulb.ac.be/bioinformatics/rsa-tools/ • RSA-tools - menu.htm

•tggg 0.0019981852419 2048

•tggt 0.0031475320266 3226

•tgta 0.0037748919438 3869

•tgtc 0.0031611914960 3240

•tgtg 0.0034704809109 3557

•tgtt 0.0059106475564 6058

•ttaa 0.0086815684974 8898

•ttac 0.0050003414867 5125

•ttag 0.0030499643878 3126

•ttat 0.0092581932425 9489

•ttca 0.0062082288547 6363

•ttcc 0.0039446596353 4043

•ttcg 0.0031855834057 3265

•ttct 0.0047310548037 4849

•ttga 0.0047778872704 4897

•ttgc 0.0055896500249 5729

•ttgg 0.0022635692194 2320

•ttgt 0.0055476959402 5686

•ttta 0.0090659849941 9292

•tttc 0.0069331564107 7106

•tttg 0.0064882479779 6650

•tttt 0.0149249217020 15297

Page 83: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 8383

Verbumculus + Dot Verbumculus + Dot on first 512 bps of Yeast Mitochondrial DNAon first 512 bps of Yeast Mitochondrial DNA

Page 84: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 8484

Counting occurrences of gagga in HSV1Counting occurrences of gagga in HSV1

Page 85: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 8585

Alternate CountingAlternate Counting

Page 86: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 8686

Counting occurrences of ccgct in HSV1Counting occurrences of ccgct in HSV1

Page 87: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico – Cinzia Pizzi – Giorgio SattaAlberto Apostolico – Cinzia Pizzi – Giorgio Satta

Dyads Detection in Biology

Part of Speech Tagging in NLP

Although preliminary findings were reported more than a year ago, the latest results appear…

Dyads are the composition of two solid components separated by a variable gap

IN JJ NNS VBD VBN RBR IN DT NN IN , DT JJS NNSVBP...

Automatic Tagging• Set of correctly classified examples• Infer rules• Classify new texts

Drawback: ambiguity• Limited size contest centered on a

word can fail to give a unique tag assignment

Goal: efficient counting of subword co-occurrences within distance d,with no interleaving occurrences of one or the other

ACCGTAAG

+ = Rules

Possible Solution: BarriersNN/JJ

NN/JJ

CORRECT CLASSIFICATIONTEXT

NN or JJ ?

NN

JJ

B1

B2TAGGING

DISAMBIGUATION

Page 88: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico – Cinzia Pizzi – Giorgio SattaAlberto Apostolico – Cinzia Pizzi – Giorgio Satta

• X is a string of n symbols over the alphabet

• d is a fixed non-negative integer• y and z are subwords of X• Tandem Index I(y,z) is the

number of times that z has a closest occurrence within a distance d from a corresponding closest occurrence of y to its left

• Relaxed Tandem Index Î(y,z): all the occurrences of z within distance d are counted

Notation

Goal: efficient counting of subword co-occurrences within distance d,with no interleaving occurrences of one or the other

In principle there are O(n2) substrings in x, and thus O(n4) distinct pair of substrings; however, it suffices to consider a family containing only O(n2) pairs. Then, for any neglected pair (y’,z’) there is a pair (y,z) in the family such that: (i) y’ and z’ are prefixes of y and z respectively, and (ii) the tandem index of (y’,z’) equals the tandem index of (y,z).

Result : O(n2) algorithm for building a tandem index table ( previous results O(n3) [Arimura et al., Wang et al.], in case the of two words from a generalized version of the

problem)

Key Observation

Page 89: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 8989

Towards a theory of Towards a theory of saturatedsaturated motifs: motifs: here a motif is a recurring pattern with some solid and here a motif is a recurring pattern with some solid and some ``don’t care’’ characters some ``don’t care’’ characters together with itstogether with its set of set of

occurrencesoccurrences

PROBLEM Input: textstring Output: repeated motifs

``don’t care’’ characters solid character

T A G A G G T A G A T AG T

T A G A G G T A G A T A T

T A G A G G T A G A T A T T A G A G G T A G T AG T

Is motif discovery still beset by the circumstance that typically there are exponentially many candidate motifs in a sequence ?

Page 90: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 9090

Controlling Motif Growth: Irredundant MotifsControlling Motif Growth: Irredundant Motifs(L.Parida)(L.Parida)

A motif is • maximal in composition if specifying more solid characters implies an alteration to its occurrence list• maximal in length if making the motif longer implies an alteration to the cardinality or displacement of its occurrence list

A maximal motif such that the motif and its list can be inferred from studying other motifs is redundant

Page 91: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 9191

Maximal, Redundant, Irredundant Motifs Maximal, Redundant, Irredundant Motifs (examples, cont.)(examples, cont.)

Let s= aaXtaYgZZZaaVtaWcXXXXaaYtgXc s= aaXtaYgZZZaaVtaWcXXXXaaYtgXc s= aaXbaYgZZZaaVbaWcXXXXaaYbgXc

m_1 = aa . t with L_1 = { 1, 11, 22}m_2 = aa . ta with L_2 = {1, 11}m_3 = aa . t . c with L_3 = {11, 22}

m_1 = aa . t is redundant, since 1) m_1 is a sub-motif of m_2 and of m_3 and 2) L_1 is the union of L_2 and L_3.

Page 92: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 9292

Controlling Motif GrowthControlling Motif Growth : : HOW MANYHOW MANY Irredundant Irredundant

MotifsMotifs

Recall that a motif is • maximal in composition if specifying more solid characters implies an alteration to its occurrence list• maximal in length if making it longer implies an alteration to the cardinality of its occurrence list

A maximal motif such that the motif and its list can be inferred from studying other motifs is redundant

A motif that occurs at least k times in the textstring is a k-motif

TheoremIn any textstring x the number of irredundant 2-motifs is O(|x|)(PROBLEM: How to find irredundant motifs as fast as possible)

Page 93: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 9393

Suffix Consensus, Suffix MeetSuffix Consensus, Suffix Meet

suf4

s = suf1

The consensus of suf1 and suff4 is not a motif

The meet of suf1 and suf4 is a maximal motif

TheoremEvery irredundant 2-motif of x is the meet of two suffixes of x

a

b

c

a

a

a a

a

a

a

a

aa

a

bb

bbb

ccc

c c

c

c

Page 94: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 9494

FindingFinding surprising surprising substrings with substrings with mismatchesmismatches

Input: a sequence or set of sequences, integers m and kInput: a sequence or set of sequences, integers m and k Out: all substrings of length m that occur unusually often as a Out: all substrings of length m that occur unusually often as a

replica of the same pattern with up to k mismatchesreplica of the same pattern with up to k mismatches

How many patterns should one try

• NOTE: the pattern might never occur exactly in the input

Approximate Patterns Lazy

have frequent

s

Page 95: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 9595

Problem StatementProblem Statement

Given a source text X and an error threshold Given a source text X and an error threshold kk, extract , extract substringssubstrings of X that occur of X that occur unusually oftenunusually often in X within in X within kk substitutions or substitutions or mismatches.mismatches.

Measure of Surprise: compare counts with Measure of Surprise: compare counts with expectationsexpectations

w

www

N

EF www EF

w

ww

E

F

Page 96: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 9696

SubProblem: Compute Expected SubProblem: Compute Expected Frequencies under I.I.D. DistributionFrequencies under I.I.D. Distribution

Two results for expected frequenciesTwo results for expected frequencies

O(nk) preprocessing of text, then report expected O(nk) preprocessing of text, then report expected frequency for any substring in O(kfrequency for any substring in O(k22))

Report expected frequency of all substrings of a Report expected frequency of all substrings of a given length in O(nk)given length in O(nk)

Page 97: Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery

Alberto Apostolico - Erice05Alberto Apostolico - Erice05 9797

JACM 50, 1, January 2003 pp 25-26 JACM 50, 1, January 2003 pp 25-26 Special Issue: Problems for the Next 50 YearsSpecial Issue: Problems for the Next 50 Years

page 1 paper 1 problem 1page 1 paper 1 problem 1

’’ ’’ Shannon and Weaver performed an inestimable service by giving us Shannon and Weaver performed an inestimable service by giving us a definition of information and a metric for information as a definition of information and a metric for information as communicated from place to place.communicated from place to place.

We have no theory however that gives us a metric for the information We have no theory however that gives us a metric for the information embodied in structure...embodied in structure...

...this is the most fundamental gap in the theoretical underpinning of ...this is the most fundamental gap in the theoretical underpinning of information and computer science .information and computer science .

A young information theory scholar willing to spend years on a deeply A young information theory scholar willing to spend years on a deeply fundamental problem need look no further . ’’fundamental problem need look no further . ’’

Frederick P. Brooks , jr Frederick P. Brooks , jr The Great Challenges for Half Century Old Computer ScienceThe Great Challenges for Half Century Old Computer Science