Suffix Structures and Circular Pattern Problems

Graduate Theses, Dissertations, and Problem Reports

2011

Suffix Structures and Circular Pattern Problems Suffix Structures and Circular Pattern Problems

Jie Lin West Virginia University

Follow this and additional works at: https://researchrepository.wvu.edu/etd

Recommended Citation Recommended Citation Lin, Jie, "Suffix Structures and Circular Pattern Problems" (2011). Graduate Theses, Dissertations, and Problem Reports. 3402. https://researchrepository.wvu.edu/etd/3402

This Dissertation is protected by copyright and/or related rights. It has been brought to you by the The Research Repository @ WVU with permission from the rights-holder(s). You are free to use this Dissertation in any way that is permitted by the copyright and related rights legislation that applies to your use. For other uses you must obtain permission from the rights-holder(s) directly, unless additional rights are indicated by a Creative Commons license in the record and/ or on the work itself. This Dissertation has been accepted for inclusion in WVU Graduate Theses, Dissertations, and Problem Reports collection by an authorized administrator of The Research Repository @ WVU. For more information, please contact [email protected].

https://researchrepository.wvu.edu/

https://researchrepository.wvu.edu/

https://researchrepository.wvu.edu/etd

https://researchrepository.wvu.edu/etd?utm_source=researchrepository.wvu.edu%2Fetd%2F3402&utm_medium=PDF&utm_campaign=PDFCoverPages

https://researchrepository.wvu.edu/etd/3402?utm_source=researchrepository.wvu.edu%2Fetd%2F3402&utm_medium=PDF&utm_campaign=PDFCoverPages

mailto:[email protected]

Suffix Structures and Circular Pattern Problems

Jie Lin

Dissertation submitted to theCollege of Engineering and Mineral Resources

at West Virginia Universityin partial fulfillment of the requirements

for the degree of

Doctor of Philosophyin

Computer and Information Sciences

Dr. Donald Adjeroh, Ph.D., ChairDr. Elaine M Eschen , Ph.D.

Dr. Arun Ross, Ph.D.Dr. James Harner, Ph.D.

Dr. Cun-Quan Zhang, Ph.D

Lane Department of Computer Science and Electrical EngineeringMorgantown, West Virginia, 2011

Keywords: Suffix Array, Suffix Tree, Pattern Matching, Text Mining, Probabilistic SuffixTrees, Probabilistic Suffix Arrays, Markov Models, Space Efficiency, Circular Patterns, Mul-tidomain Proteins, Circular Pattern Discovery

Copyright@ 2011 Jie Lin

ABSTRACT

The suffix tree is a data structure used to represent all the suffixes in a string. However, a majorproblem with the suffix tree is its practical space requirement. In this dissertation, we propose an efficientdata structure – the virtual suffix tree (VST) – which requires less space than other recently proposed datastructures for suffix trees and suffix arrays. On average, the space requirement (including that for suffixarrays and suffix links) is 13.8n bytes for the regular VST, and 12.05n bytes in its compact form, wheren is the length of the sequence.

Markov models are very popular for modeling complex sequences. In this dissertation, we presentthe probabilistic suffix array (PSA), a space-efficient alternative to the probabilistic suffix tree (PST) usedto represent Markov models. The PSA provides all the capabilities of the PST, such as learning and pre-diction, and maintains the same linear time construction (linearity with respect to sequence length). ThePSA, however, has a significantly smaller memory requirement than the PST, for both the constructionstage, and at the time of usage.

Using the proposed suffix data structures, we study the circular pattern matching (CPM) problem.

We provide a linear time, linear space algorithm to solve the exact circular pattern matching problem. We

then present four algorithms to address the approximate circular pattern matching (ACPM) problem. Our

bidirectional ACPM algorithm provides the best time complexity when compared with other algorithms

proposed in the literature. Further, we define the circular pattern discovery (CPD) problem and present

algorithms to solve this problem. Using the proposed circular pattern matching algorithms, we perform

experiments on computational analysis and function prediction for multidomain proteins.

Acknowledgement

I would like to thank my advisor, Dr. Don Adjeroh, for his guidance, advice, and contin-ued encouragement. It has been a pleasure to work under his supervision. Without him, thisdissertation could not have come about.

I would also like to thank my other committee members: Dr. Elaine Eschen, Dr. ArunRoss, Dr. James Harner, and Dr. Cun-Quan Zhang for their help during my studies.

And finally, I thank my family members for their constant support, encouragement, andhelp.

The work reported in this thesis was partly supported by a DOE CAREER award (No:DE-FG02-02ER25541 ), an NSF ITR award (No: 0312484), and a WV-EPSCoR RCG grant.

iii

Contents

1 Introduction 1

1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 The Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2.1 Suffix Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2.2 Markov Models and Probabilistic Suffix Tree . . . . . . . . . . . . . . 3

1.2.3 Circular Pattern Matching and Circular Pattern Discovery . . . . . . . 5

1.3 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.4 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Related Work 9

2.1 Suffix Tree and Suffix Array . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.1.1 Basic Notations and Definitions . . . . . . . . . . . . . . . . . . . . . 10

2.1.2 Suffix Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.1.3 Suffix Links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

iv

2.1.4 Implementation and Problems with the Suffix Tree . . . . . . . . . . . 12

2.1.5 Suffix Array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2 Space-Efficient Suffix Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2.1 ESA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2.2 LST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.3 Markov Models and Probabilistic Suffix Tree . . . . . . . . . . . . . . . . . . 15

2.3.1 Variable Length Markov Models . . . . . . . . . . . . . . . . . . . . . 15

2.3.2 Probabilistic Suffix Tree . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.3.3 Computing T F and DF via Suffix Arrays . . . . . . . . . . . . . . . . 16

2.4 Circular Pattern Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.4.1 String Pattern Matching . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.4.2 Circular Pattern Matching Problems . . . . . . . . . . . . . . . . . . . 23

2.4.3 Exact Circular Pattern Matching (ECPM) . . . . . . . . . . . . . . . . 25

2.4.4 Approximate Circular Pattern Matching (ACPM) . . . . . . . . . . . . 26

2.4.5 ACPM Problem in Protein Sequences . . . . . . . . . . . . . . . . . . 29

2.5 Pattern Discovery Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3 The Virtual Suffix Tree 31

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.2 Basic Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.2.1 Example VST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

v

3.2.2 Properties of the Virtual Suffix Tree . . . . . . . . . . . . . . . . . . . 34

3.2.3 Pattern Matching on VST . . . . . . . . . . . . . . . . . . . . . . . . 36

3.3 Improved Virtual Suffix Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.3.1 Adjusting Edge Lengths . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.3.2 Construction Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.3.3 Further Space Reduction . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.3.4 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.4 Computing Suffix Links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.5 From SA to VST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4 The Probabilistic Suffix Array 59

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.2 Probabilistic Suffix Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.3 Proposed Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.3.1 Internal Node Attributes . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.3.2 Measurement Attributes . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.3.3 Example PSA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4.3.4 Interval Array and Document Frequency in Linear Time . . . . . . . . 65

4.4 Constructing the PSA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.4.1 Building the Interval Tree . . . . . . . . . . . . . . . . . . . . . . . . 67

vi

4.4.2 Building the Suffix Link . . . . . . . . . . . . . . . . . . . . . . . . . 68

4.4.3 Sorting the PSA Structure . . . . . . . . . . . . . . . . . . . . . . . . 68

4.4.4 Computing Conditional Probabilities Using the PSA . . . . . . . . . . 70

4.4.5 Prediction with VLMM via the PSA . . . . . . . . . . . . . . . . . . . 71

4.5 Space Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

4.5.1 Storage Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

4.5.2 Construction Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

4.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

4.6.1 Predicting Protein Families . . . . . . . . . . . . . . . . . . . . . . . . 77

4.6.2 Space Consideration . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

4.6.3 Computational Time Requirement . . . . . . . . . . . . . . . . . . . . 80

4.6.4 PSA in Phylogenetic Tree Construction . . . . . . . . . . . . . . . . . 81

4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

5 Circular Pattern Matching 87

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

5.2 Exact Circular Pattern Matching Problem . . . . . . . . . . . . . . . . . . . . 89

5.2.1 Linear Time ECPM Algorithm . . . . . . . . . . . . . . . . . . . . . . 89

5.2.2 Comparison of ECPM algorithms . . . . . . . . . . . . . . . . . . . . 92

5.3 Approximate Circular Pattern Matching Problem . . . . . . . . . . . . . . . . 93

5.3.1 Greedy ACPM Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 93

vii

5.3.2 ACPM with LIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

5.3.3 ACPM with q-grams and Suffix Array . . . . . . . . . . . . . . . . . . 96

5.3.4 Improved Algorithm: ACPM with Bidirectional Edit Distance . . . . . 98

5.3.5 Comparison with Other ACPM Algorithms . . . . . . . . . . . . . . . 104

5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

5.4.1 Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

5.4.2 CPM Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . 107

5.4.3 Multidomain Protein Networks using Circular Patterns . . . . . . . . . 110

5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

6 Circular Pattern Discovery 133

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

6.2 The Circular Pattern Discovery Problem . . . . . . . . . . . . . . . . . . . . . 134

6.3 The ECPD Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

6.4 The ACPD Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

6.4.1 ACPD using Maes’ Algorithm . . . . . . . . . . . . . . . . . . . . . . 137

6.4.2 Proposed ACPD Algorithm . . . . . . . . . . . . . . . . . . . . . . . 139

6.4.3 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

6.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

6.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

viii

7 Conclusion and Future Work 147

7.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

7.2.1 Circular Pattern Discovery . . . . . . . . . . . . . . . . . . . . . . . . 149

7.2.2 Network Analysis for Circular Multidomain Proteins . . . . . . . . . . 149

7.2.3 From PSA to PFA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

7.2.4 Approximate Pattern Matching Using PSA . . . . . . . . . . . . . . . 150

7.2.5 Prediction with PSA using Inexact Matching . . . . . . . . . . . . . . 150

7.3 Publications from the Dissertation . . . . . . . . . . . . . . . . . . . . . . . . 152

ix

List of Tables

2.1 Interval Array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.1 VST node attributes for the example sequence T = missississippi$ usedin Figure 3.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.2 Node attributes in the improved VST for the example sequence, T =missississippi$. 40

3.3 Branching factor and maximum space requirement for various sample files. . . 54

3.4 Storage requirement for the VST, including suffix links . . . . . . . . . . . . . 55

3.5 Detailed attributes for nodes in the TA data structure using the sample sequence,T = missississippi$. . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

3.6 Node mapping table from TA to VST . . . . . . . . . . . . . . . . . . . . . . 56

4.1 Attributes of PSA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.2 Example PSA internal nodes, using the PSA of the sequence T = accactact$ 65

4.3 Example PSA leaf nodes, using the PSA of the sequence T = accactact$ . 65

x

4.4 Performance of the PSA in modeling and prediction of protein families. Fam-ilies correspond to the first 51 protein families with 12 or more members inthe Pfam database, ordered alphabetically based on their abbreviated names inPfam. For comparison, we have included the results obtained using the PST [17]on the same data set. (TP stands for true positive, while MD stands for misseddetection). ∗∗The family apple was not in the dataset used in [17]. . . . . . . 84

4.5 Summary performance in protein family classification using the PSA and PST . 85

4.6 Summary data on the first 51 families in Pfam, as described in Table 4.4. . . . . 85

4.7 Construction memory needed for the PSA and PST. Results are based on thefirst 51 families in Pfam, as described in Table 4.4. . . . . . . . . . . . . . . . 85

4.8 Construction time comparison for PSA and PST. Results are based on the first51 families in Pfam, as described in Table 4.4. Recorded time is time neededper family (in seconds). Speedup is computed as the ratio with respect to PSAtime. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

4.9 Prediction time comparison between PSA and PST. Results are based on thefirst 51 families in Pfam, as described in Table 4.4. Recorded time (in seconds)is prediction time per family – i.e. total time needed to predict all members inthe family against all the other families. . . . . . . . . . . . . . . . . . . . . . 86

5.1 Comparison of ECPM algorithms . . . . . . . . . . . . . . . . . . . . . . . . 93

5.2 Comparison with other proposed ACPM Algorithms . . . . . . . . . . . . . . 105

5.3 Top 15 highest degree proteins with GO function . . . . . . . . . . . . . . . . 109

5.4 The predicted protein functions using union for In-edge and Out-edge . . . . . 111

5.5 The predicted protein functions using intersection for In-edge and Out-edge . . 112

5.6 Performance in Protein Function Prediction using the Top-500 Proteins . . . . 112

xi

5.7 Network statistics for multidomain protein networks . . . . . . . . . . . . . . 114

5.8 Top 25 proteins with the highest node degree differences between protein net-works using the circular and non-circular patterns. . . . . . . . . . . . . . . . . 116

5.9 The longest path in the Protein network . . . . . . . . . . . . . . . . . . . . . 117

5.10 The longest path in the Family network . . . . . . . . . . . . . . . . . . . . . 118

6.1 The number of distinct patterns with pattern length . . . . . . . . . . . . . . . 144

6.2 Sample discovered circular patterns with length five. . . . . . . . . . . . . . . 144

6.3 Sample discovered circular patterns with length thirteen. . . . . . . . . . . . . 145

xii

List of Figures

1.1 Summary of work in this dissertation. . . . . . . . . . . . . . . . . . . . . . . 8

2.1 Algorithm for computing document frequency [131] . . . . . . . . . . . . . . 20

2.2 Edit Graph of T and PP [87]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.3 Maes’ Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.1 Suffix tree and virtual suffix tree for the string T = missississippi$. . . . 51

3.2 Example VST (solid nodes) showing left SA index (lSA) and right SA index(rSA) for sample nodes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

3.3 Edge-length adjustment procedure. . . . . . . . . . . . . . . . . . . . . . . . 52

3.4 Improved VST for the string T = missississippi$ . . . . . . . . . . . . 52

3.5 Suffix links on the VST for the sample string T = missississippi$. . . . 55

3.6 Constructing VST from the suffix array. . . . . . . . . . . . . . . . . . . . . . 56

4.1 State diagram and transition matrix for a first order Markov model for an exam-ple sequence T = accactact$. . . . . . . . . . . . . . . . . . . . . . . . . 61

4.2 Example suffix tree and probabilistic suffix tree for the string T = accactact$. 62

xiii

4.3 Top-k classification rate for sample protein families using the PSA. . . . . . . . 79

4.4 Memory consumption factor (MC Factor) needed to construct the PSA and PSTdata structures for the first 51 protein families in Pfam. . . . . . . . . . . . . . 80

4.5 Phylogenetic tree for 20 species constructed using the predicted probabilitiesobtained using the PSA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

5.1 Suffix tree for the string T = missississippi$ with some suffix links. . . 118

5.2 The number of hypotheses with q-gram . . . . . . . . . . . . . . . . . . . . . 121

5.3 Dynamic Programing in q-gram matching . . . . . . . . . . . . . . . . . . . . 122

5.4 Three cases in computing the circular edit distance in the ACPM algorithmusing the bidirectional edit distance. The numbered double-header show thesymbol positions involved in each case. . . . . . . . . . . . . . . . . . . . . . 123

5.5 The time cost of the CPM algorithms . . . . . . . . . . . . . . . . . . . . . . . 124

5.6 Degree distributions in the network of multidomain proteins constructed basedon the circular patterns they contain. . . . . . . . . . . . . . . . . . . . . . . . 125

5.7 Number of directly connected pairs in Top-K highest degree proteins . . . . . . 126

5.8 The Protein network (using both CPs and non-CPs) . . . . . . . . . . . . . . . 127

5.9 The Protein network using only non-circular patterns . . . . . . . . . . . . . . 128

5.10 The Protein network using only circular patterns . . . . . . . . . . . . . . . . . 129

5.11 The Family network (using both CPs and non-CPs) . . . . . . . . . . . . . . . 130

5.12 The Family network using only non-circular patterns . . . . . . . . . . . . . . 131

5.13 The Family network using only circular patterns . . . . . . . . . . . . . . . . . 132

xiv

6.1 Variation of the number of distinct patterns (including non-circular and circularpatterns) with pattern length. . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

6.2 Variation of number of circular patterns with pattern length. . . . . . . . . . . . 143

6.3 Variation of maximum number of occurrences with pattern length. . . . . . . . 145

xv

Chapter 1

Introduction

1.1 Overview

The suffix tree is an important data structure used to represent sequences (for example, text,DNA sequence, video, etc.). However, its space requirement is huge for most practical applica-tions. Most methods for suffix tree (ST) and suffix array (SA) have focused on the theoreticaltime and space complexity. Markov models are popular for modeling complex sequences whosesources are unknown, or whose underlying statistical characteristics are not well understood. Amajor problem, however, is its space complexity, which grows exponentially with the order ofthe Markov model. The probabilistic suffix tree (PST) was proposed to address with the spaceproblem of the Markov model. This reduced the space complexity theoretically. However, inpractice, the space requirement for the PST is still relatively large, and often impractical formost real-life problems. Circular permutations and circular pattern matching are interestingproblems in computer science and biology. There are several algorithms and methods to solvethis problem. But in practice, these algorithms and methods have huge time and space costs.

In the first part of this dissertation, we develop efficient suffix data structures for analysisof huge sequences. In the second part, we use these data structures to study the problem ofcircular permutations and their applications in computational biology. Based on our circularpermutation work, we define and study the circular pattern discovery problem.

1

CHAPTER 1. INTRODUCTION 2

First we propose a space efficient data structure called the virtual suffix tree (VST). The VSTsupports the same functions as the suffix tree, but with much less space practical requirement.The average space requirement is significantly smaller than other data structures for suffix treeand suffix array. Secondly, we propose an efficient data structure, the probabilistic suffix array(PSA) to represent the Markov model. PSA takes a much smaller space than probabilistic suffixtree which is implemented on a regular suffix tree. We show the experiment in the biologyapplications. Lastly, using suffix trees and suffix arrays we propose algorithms to solve thecircular pattern matching problem and use these to study circular permutations and functionprediction for multidomain proteins. We also introduce the circular pattern discovery problemand present algorithms to solve the problem.

1.2 The Problem

1.2.1 Suffix Tree

The suffix tree is an important data structure used to represent the set of all suffixes ofa string. The suffix tree is efficient in both time and space, and has been used in a varietyof applications, such as pattern matching, sequence alignment, identification of repetitions ingenome-scale biological sequences, and in data compression. Various algorithms have been de-veloped for efficient construction of suffix trees [38,89,122,128]. However, one major problemwith the suffix tree is its practical space requirement. The suffix array is a related data structure,which was originally introduced in [84] as a space-efficient alternative to the suffix tree. Thesuffix array simply provides a listing of all the suffixes of a given string in lexicographic order.The suffix array can be used in most (though, not all) situations where a suffix tree is used.

Although the theoretical space complexity is linear for both data structures, typically, for agiven string T of length n, the suffix array requires about three to five times less space than thesuffix tree. The construction time for both algorithms is also O(n) on average. For suffix arrays,construction algorithms that run in O(n logn) worst case are relatively easy to develop, butO(n) worst case algorithms are much harder to come by. Recent suffix sorting algorithms withworst-case linear time complexity have been reported in [57, 63, 65]. Gusfield [46] providesa comprehensive treatment of suffix trees and its applications. Puglisi et al. [104] provide a


recent survey on suffix arrays. An extensive discussion on the connection between the Burrows-Wheeler Transform [23] and suffix trees and suffix arrays is provided in [3].

For small alphabet sizes, the suffix tree and the suffix array have about the same complexityin pattern matching. For pattern matching, the suffix array requires time in O(m logn) to locateone occurrence of a pattern P of length m in T . However, with additional data structures, such asthe lcp array, this time can be reduced to O(m+ logn). With the suffix tree, the same search canbe performed in O(m) time. The problem, however, is for sequences with large alphabets where|Σ| → n. Here, |Σ|, the alphabet size is no longer negligible. Using the array representation ofnodes in the suffix tree will require O(n|Σ|) space for the suffix tree, and O(m) time for patternmatching. For linear space, the linked list or binary search tree can be used, but the search timebecomes O(m|Σ|) or O(m log |Σ|) respectively.

The Challenge. A key problem then is to develop space-efficient data structures that cansupport pattern matching using the same time complexity as suffix trees, but at a practical spacerequirement that approaches that of the suffix array. Such a data structure should also supportthe complete functionality of the suffix tree, such as support for suffix links, as may be requiredin certain applications. Two recent data structures that have attempted to address this problemare the ESA – enhanced suffix array [2], and the LST – linearized suffix tree [61, 62]. Bothmethods are based on the notion of lcp-intervals [60], constructed using the suffix array andthe lcp array.

1.2.2 Markov Models and Probabilistic Suffix Tree

Markov models are very popular for modeling complex sequences whose sources are un-known, or whose underlying statistical characteristics are not well understood. This is especiallythe case when the sequences exhibit some memory. For a short term memory of length, say L,this means that the conditional distribution of the next symbol given the last L symbols doesnot change significantly if we condition on L or more previous symbols. Thus, such sequencesare often modeled using Markov models of order L , or using the Hidden Markov Models(HMM) [37]. The models provide efficient mechanisms to compute the required conditionalprobabilities, and also for generating sequences from the models. The problem is that the sizeof Markov models increases exponentially with increasing memory length L. Thus, they are


practical only for low order models with short memory lengths. This leads to the second chal-lenge: such low-order Markov models often provide a poor approximation of the true sequencebeing modeled. It is known that learning with Hidden Markov Models is computationally verychallenging. Hardness results on the learnability of HMM are discussed in [1], while similarresults on inferencing using HMM are reported in [42].

Probabilistic suffix models such as probabilistic suffix trees (PST) have been proposed byRon et al. [108] to address some of the key problems with Markov models. They showedthe equivalence between PSTs and a subclass of probabilistic finite automata (PFAs) calledprobabilistic suffix automata (PSF/PSA): for a given PST, there is an algorithm to constructa PSF/PSA whose size is the same as that of the PST within a constant factor. Further, thedistribution generated by a PST is guaranteed to be within a small distance from that generatedusing the PSF/PSA, as measured by the Kullback-Leibler divergence.

Probabilistic suffix trees and probabilistic suffix automatons are related to context-basedmodels which are extensively used in sequence prediction and data compression. Typical exam-ples of such context models include context tree weighting (CTW) [129], prediction by partialmatching (PPM) [28], and the Lempel-Ziv decomposition [133, 134]. See also [106, 107]. Theuse of these context models as a surrogate for variable length Markov models with applicationsin sequence prediction are reviewed in [16]. Other applications have been found in model-ing DNA sequences [109], protein sequence classification [17, 71], and in modeling API callsequences for malicious codes in Windows XP [88].

The Challenge. The probabilistic suffix models however require O(Ln2) time and spaceto construct. The algorithm to construct the PFA/PSA from the PST also runs in O(Ln2) time.Later Apostolico and Bejerano [9] showed how the PST can be constructed in O(n) time, inde-pendent of the order L, using traditional suffix links used in constructing suffix trees, and thenotion of reverse suffix links. They did not consider the problem of constructing the PSF/PSAfrom the PST. Their use of the suffix tree also implies that the space requirements for construct-ing the PST will be very high, although it is still linear in terms of the sequence length.


1.2.3 Circular Pattern Matching and Circular Pattern Discovery

Computing similarity (or dissimilarity) between two strings is an important problem ingeneral sequence analysis [44, 44, 46, 51, 81, 115], pattern recognition [113, 120] and biology[40, 47, 55]. The circular edit distance is an extension of traditional edit distance which seeksto determine similarity between strings in a circular shift. A circular shift is a mapping f :Σ∗ → Σ∗, f t(c1...cr) = ct+1...crc1...ct−1ct , where 0 ≤ t ≤ r− 1 and r is the length of stringc1c2...cr. The circular edit distance between two strings s1 and s2 is defined as EDc(s1,s2) =min{ED[ f i(s1), f j(s2)]|0 ≤ i ≤ |s1|− 1, 0 ≤ j ≤ |s2|− 1}, where |s1| is the length of string s1

and |s2| is the length of string s2, ED[A,B] is the standard edit distance between A and B. Thus,the dissimilarity between two strings in a circular shift is a function of the circular distancebetween them. Computational methods have also been proposed to study circular patterns inbiological problems. [123, 124, 126, 127]

In biology, circular proteins and circular permutations in proteins are becoming of increas-ing interest, especially given their role in the structure, function, folding, and stability of pro-teins [40, 47, 55]. In a circular (or cyclic) protein, the traditional N- and C-termini are joined,resulting in a protein sequence with no termini [123]. The cyclotides is a typical example ofa naturally-occurring family of cyclic proteins in the Plant Kingdom. Cyclotides are known toplay a major role and provide important functions in terms of plant defense against insects andother pathogens [35]. Their cyclic structure is known to be an important factor in their unusualstability [35]. Other common examples of cyclic proteins are the bacteriocins, small antimicro-bial peptides with 30-70 residues produced by bacteria [33], cyclosporins found in fungi [66],and the primate rhesus θ-defensin -1 [119] with antibacterial properties for the immune systemof macaques monkeys.

The Challenge. There are several algorithms for calculating the circular distance betweentext T and circular pattern P, but they did not consider the problem of finding circular patternP and its circular shifts inside substrings of text T . In our work, we define variants of the CPMproblem and propose algorithms to solve them.

Pattern discovery is a fundamental analysis method used to identify possibly hidden rela-tions within or between the sequences. Pattern discovery problems are well studied in many


applications. In biology, motif discovery is often performed as a kind of pattern discovery ap-plication. To our knowledge, there is no existing work that explicitly studied the problem ofpattern discovery involving circular patterns. In this work, we introduce the Circular PatternDiscovery problem and propose algorithms for its solution.

1.3 Contribution

We introduce the VST (virtual suffix tree), an efficient data structure for suffix trees andsuffix arrays. Starting from the suffix array, we construct the suffix tree, from which we derivethe virtual suffix tree. Later, we remove the intermediate step of suffix tree construction, andbuild the VST directly from the suffix array. The VST provides the same functionality as thesuffix tree, including suffix links, but at a much smaller space requirement. It has the samelinear time construction even for large alphabets, Σ, requires O(n) space to store (n is the stringlength), and allows searching for a pattern of length m to be performed in O(m log |Σ|) time, thesame time needed for a suffix tree. Given the VST, we show an algorithm that computes all thesuffix links in linear time, independent of Σ. The VST requires less space than other recentlyproposed data structures for suffix trees and suffix arrays, such as the enhanced suffix array [2],and the linearized suffix tree [62]. On average, the space requirement (including that for suffixarrays and suffix links) is 13.8n bytes for the regular VST, and 12.05n bytes in its compact form.

We present the probabilistic suffix array (PSA), a data structure for representing informa-tion in variable length Markov chains. The PSA essentially encodes information in a Markovmodel by providing a space-efficient representation of the probabilistic suffix tree (PST). OurPSA provides the same functionality as the PST, but at a significantly reduced space require-ment. Given a sequence of length n, construction and learning in the PSA is done in O(n)time and space, independent of the Markov order. Prediction using the PSA is performed inO(m log n

|Σ|) time, where m is the pattern length, and Σ is the symbol alphabet. The specificmemory requirement is 33n bytes in the worst case, and 26n bytes on average, including spacefor the suffix array and the input sequence. This can be compared with the 41n bytes neededusing the PST.

We propose an exact circular pattern matching (ECPM) algorithm that runs in linear time


and linear space. We also propose algorithms to solve the approximate circular pattern matching(ACPM) problem. In our work, we solved a harder version of the ACPM problem when com-pared to other previous work on ACPM [44, 81, 123, 124, 126, 127]. We present an experimenton finding circular relations in multi-domain proteins. Our experiments show that the methodsbased on circular permutations can produce very good results in predicting protein functions.

We propose two algorithms for the Circular Pattern Discovery (CPD) problem. The firstalgorithm uses suffix trees and suffix links to solve the exact circular pattern discovery problemin O(m2

2N) time. The second algorithm uses suffix arrays to solve the more challenging ap-proximate circular pattern discovery (ACPD) problem in O(km2

2N2) worst case, and O(km22N)

on average. By exploiting the nature of the ACPD problem, the complexity can be reduced toO(m2

2N2) worst case, and O(m22N) on average.

Aspects of the work from this dissertation are reported in the following papers: [74–79].

1.4 Organization

The dissertation is organized as follows. In Chapter 2, we introduce related work, includ-ing basic notations and definitions. Chapter 3 presents the Virtual Suffix Tree (VST) includingits construction algorithm, searching algorithm and support for suffix links. We also analyze itstime and space complexity and compare with related algorithms. Chapter 4 introduces the Prob-abilistic Suffix Array (PSA) including detailed explanation and implementation. The practicalspace requirement of the data structure is also examined. In this chapter, we present the experi-ments of PSA using protein sequences and miDNA sequences and protein sequences. Chapter5 discusses the circular pattern matching (CPM) problem. Algorithms are then proposed forthe exact and inexact variants of the problem using suffix data structures. We implement thealgorithms and show examples of using these algorithms to search for circular patterns in mul-tidomain proteins. In Chapter 6, we define the circular pattern discovery (CPD) problem andpresent algorithms to solve the CPD problem. Chapter 7 draws some conclusions and also de-scribes possible directions for future work. Figure 1.1 provides a summary of the work reportedin this dissertation.


Dissertation work

PSA

Data Structure

Algorithm

Experiment

Protein Family Classification

Phylogenetic Tree Construction

VST

Data Structure

Algorithm

CPM & CPD

CPM

AlgorithmECPM

ACPM

Experiment

Protein function prediction

Circular pattern network

CPD

AlgorithmECPD

ACPD

Experiment Pattern discovery in multi-domain protein database

Figure 1.1. Summary of work in this dissertation.

Chapter 2

Related Work

In this chapter, first we introduce the suffix data structures, namely suffix trees and suffix arrays,which form the basis for the majority of the work proposed in this dissertation. The suffix datastructures are important in computer science and bioinformatics. In Section 2, we introduce theenhanced suffix array (ESA), the Linearized Suffix Tree (LST), and other space-efficient suffixtrees. These two structures use the suffix array and LCP array to simulate the suffix tree. Theyare related to our new structure, the virtual suffix tree (VST).

In Section 3 we describe variable length Markov models (VLMM) and the probabilisticsuffix tree (PST) which is used to implement VLMM. They are related to a new data structure,the probabilistic suffix array (PSA) proposed in this work. We also present previous workon calculating the term frequency (TF) and document frequency (DF), which are used in ourproposed PSA representation of VLMM.

In Section 4, we discuss the circular pattern matching (CPM) problem. We first review thegeneral string pattern matching problem. We define two CPM problems and introduce previouswork. We will provide our solutions to the different CPM problems and experimental results inChapter 5. In Section 5, we discuss the related work on the pattern discovery problem.

9

CHAPTER 2. RELATED WORK 10

2.1 Suffix Tree and Suffix Array

2.1.1 Basic Notations and Definitions

Let T = T [1..n] be the input string of length n, over an alphabet Σ. Let T = αβγ, for somestrings α, β, and γ (α and γ could be empty). The string β is called a substring of T , α is calleda prefix of T , while γ is called a suffix of T . The prefix α is called a proper prefix of T if α 6= T .Similarly, the suffix γ is called a proper suffix of T if γ 6= T . We will also use ti = T [i] to denotethe i-th symbol in T — both notations are used interchangeably. We use Ti = T [i..n] = titi+1 . . . tnto denote the i-th suffix of T . For simplicity in constructing suffix trees, we ensure that no suffixof the string is a proper prefix of another suffix by appending a special symbol, $ to T , such that$ /∈ Σ, and $ < σ, ∀σ ∈ Σ. We let P = P[1...m] to be the pattern string that needs to be found inT .

In our work, the size of alphabet |Σ| is not fixed. It may be small, example |Σ|=4 for DNAsequences, or it may be large, example |Σ| ≈ 106 for multidomain protein sequences.

2.1.2 Suffix Tree

Given a string T , its suffix tree (ST) is a rooted tree with n leaves, where the i-th leaf nodecorresponds to the i-th suffix Ti of T . Except for the root node and the leaf nodes, every nodemust have at least two descendant child nodes. Each edge in the suffix tree represents a substringof T , and no two edges out of a node start with the same character. For a given edge, the edgelabel is simply the substring in T corresponding to the edge. We use li to denote the i-th leafnode. Then, li corresponds to Ti, the i-th suffix of T . When the edges from each node are sortedalphabetically, then li will correspond to TSA[i], the i-th suffix of T in lexicographic order, whereSA denotes the suffix array.

For edge (u,v) between nodes u and v in ST, the edge label (denoted label(u,v) ) is a non-empty substring of T . The edge length is simply the length of the edge label. The edge label isusually represented compactly using two pointers to the beginning and end of its correspondingsubstring in T . For a given node u in the suffix tree, its path label, L(u) is defined as the label of


the path from the root node to u. Since each edge represents a substring in T , L(u) is essentiallythe string formed by the concatenation of the labels of the edges traversed in going from theroot node to the given node, u. The string depth of node u, (also called its string length or pathlength) is simply |L(u)|, the number of characters in L(u). The node depth (also called nodelevel) of node u is the number of nodes encountered in following the path from the root to u.The root is assumed to be a node at depth 0.

Given the string T = T [1..n], of length n, but with the end of string symbol appended togive a sequence T$ with length n + 1, the suffix tree of the resulting string T $ will have thefollowing properties:

1. Exactly n+1 leaf nodes.

2. At most n internal (or branching) nodes (the root node is considered an internal node).

3. Every distinct substring of T is encoded exactly once in the suffix tree. Each distinctsubstring is spelled out exactly once by traveling from the root node to some node u, suchthat L(u) is the required substring. Note that the node u may be an implicit node, i.e.ending at a position between two (explicit) nodes.

4. No two edges out of a given node in the suffix tree start with the same symbol.

5. Every internal node has at least two outgoing edges. Properties (1), (2), (4), and (5) implythat a suffix tree will have at most 2n+1 total nodes, and at most 2n edges.

2.1.3 Suffix Links

Some suffix tree construction algorithms make use of suffix links. The notion of suffix linksis based on a well-known fact about suffix trees [89, 128], namely, if there is an internal node uin ST such that its path label L(u) = aα for some single character a ∈ Σ, and a (possibly empty)string α∈ Σ∗, then there is a node v in ST such that L(v) = α. A pointer from node u to node v iscalled a suffix link. If α is an empty string, then the pointer goes from u to the root node. Suffixlinks are important in certain applications, such as in computing matching statistics needed inapproximate pattern matching, regular expression matching, or in certain types of traversal ofthe suffix tree.


2.1.4 Implementation and Problems with the Suffix Tree

A predominant factor in the space cost for suffix trees is the number of interior nodes inthe tree, which depends on the tree topology. Thus, a major consideration is how the outgoingedges from a node in the suffix tree are represented. The three major representations usedfor outgoing edges are arrays, linked lists, and binary search trees. While the array is simpleto implement, it could require a large memory for large alphabets. However, independent ofthe specific method adopted, a simple implementation of the suffix tree, for example, usingUkknoen’s algorithm [122], can require as large as 33n bytes of storage with suffix links, or 25nbytes without suffix links [3].

2.1.5 Suffix Array

The suffix array (SA) is another data structure, closely related to the suffix tree. The suffixarray simply provides a lexicographically ordered list of all the suffixes of a string. If SA[i] =j, it means that the i-th smallest suffix of T is Tj, the suffix starting at position j in T . Arelated structure, the LCP array contains the length of the longest common prefixes betweenadjacent positions in the suffix array. Combining the suffix array with the LCP informationprovides a powerful data structure for pattern matching. With this combination, decisions onthe occurrence (or otherwise) of a pattern P of length m in the string T of length n can be madein O(m+ logn) time. Given the new worst-case linear-time direct SA construction algorithms,and the small memory footprint of suffix arrays, it is becoming more attractive to constructthe suffix tree from the suffix array. A linear-time algorithm for constructing ST from SA ispresented in [3].

2.2 Space-Efficient Suffix Trees

The problem of practical space needed in using suffix trees have been recognized, andmethods have been proposed to provide space-efficient data structures [8,45,82,83,94,110,111].Andersson et al. [8] presented a level-compressed suffix tree in O(n) bytes. Munro et al. [94]proposed some space efficient suffix structures in O(n logn)bits of space and O(m|Σ|) searching


time. The structures include a suffix array and other auxiliary structures to represent the suffixtree. But these structures did not include the suffix link which is an important part of the suffixtree. Grossi et al. [45] proposed compressed suffix structures, an indexing structure in O(n)bits with O(m|Σ|) searching time. Compact Suffix Array [82] uses at most 9n bytes of space(or 131

8n for including the LCP array), but the time of construction is O(n logn) and time ofsearching is O(m log logn +(logn)2 log logn + nocc logn log logn), where nocc is the number ofoccurrences.

While the suffix array reduces the problem of space, the suffix tree still provides a simplerway in certain analysis problems, such as in computing matching statistics [27]. Thus, methodshave been proposed to improve the suffix array with extra information to provide the full func-tionality of the suffix tree. The enhanced suffix array [2] and the linearized suffix tree [61, 62]are two example data structures recently proposed for full-text indexing with the functionalitiesof both the suffix tree and suffix array.

2.2.1 ESA

The enhanced suffix array (ESA) [2] is composed of the suffix array, and extra data struc-tures, namely, the lcp array and a child table that contains branching information betweenparent and child nodes in the suffix tree. The key idea used in the ESA is the concept oflcp-intervals (originally used in [60]). Given the suffix array, SA an interval [i.. j] in SA,1≤ i < j ≤ n+1 is called an lcp-interval with lcp-value l if the following conditions hold:

1. lcp[i] < l;

2. lcp[k]≥ l ∀ k s.t. i+1≤ k ≤ j;

3. lcp[k] = l, for some k, s.t. i+1≤ k ≤ j;

4. lcp[ j +1] < l.

Thus, rather than the traditional suffix tree, the ESA constructs an lcp-interval tree. Nodesin a suffix tree are now replaced with lcp-intervals, such that the parent-child relationships


in a traditional suffix tree are now captured by equivalent parent-child relationships betweenlcp-intervals. The root node corresponds to the interval [1..n] in the suffix array, essentially theentire suffix array. In the ESA, the basic structure used to represent the child-nodes from a givenparent node is a linked list. Essentially, the child table is composed of three arrays, namely up,down, and nextIndex. The up and down arrays store information about the edges in thetree, while array nextIndex records information about the linked list used to represent thesibling relationship between nodes with the same parent. These three arrays would ordinarilyrequire 3n elements to store. Interestingly, only n of these elements are required, and hence thechild table requires only n integers to store. The ESA assumes that |Σ| is small relative to n.Thus, for large alphabets, pattern matching could take longer on the ESA. For instance, with thebinary tree representation of the nodes in the suffix tree, pattern matching will take O(m log |Σ|)time for a pattern of length m; doing the same on the enhanced suffix array (which uses a linkedlist representation of the nodes) will require O(m|Σ|) time, a significant difference for largealphabets.

2.2.2 LST

The linearized suffix tree (LST) [61, 62] is an improvement on the ESA. It uses the sameup and down arrays as the ESA, but replaces the nextIndex array with two other arrays:lchild and rchild. Thus, the two arrays store information about siblings at a given node inthe interval tree, such that the intervals can be represented by a complete binary tree. Specifi-cally, let the interval [i.. j] denote the lcp-interval for a given node in the lcp-interval tree (equiva-lently a node in the traditional suffix tree). Then, the two new arrays are defined as follows [61]:lchild[i] records the first index of the left child node of the longest interval starting at i in thecomplete binary tree; rchild[i] records the corresponding value for the right child. Like theESA, the nature of the four arrays in the new child table used by the LST makes it possible torepresent the relevant information in the table using only n integers rather than the 4n integersthat ordinarily would be required. However, unlike the ESA, the LST uses a complete binarytree as the basic structure to represent information about sibling nodes at a given node. Thisimportant difference makes it possible for the LST to support pattern matches in O(m log |Σ|)time, the same time bound for suffix trees.


2.3 Markov Models and Probabilistic Suffix Tree

2.3.1 Variable Length Markov Models

A Markov model is a sequence of stochastic events {Xn,n = 0,1,2,3...} with state space Sthat satisfies the Markov property:

P(Xn+1 = j|Xn = i,Xn−1 = in−1...,X0 = i0) = P(Xn+1 = j|Xn = i)

A Markov model (or Markov chain) of order L, where L is finite, is a sequence of eventssatisfying:

P(Xn+1 = j|Xn = i,Xn−1 = in−1...,X0 = i0) = P(Xn+1 = j|Xn = i,Xn−1 = in−1...,Xn−L = in−L)

Thus, in an order-L Markov chain, the current state is dependent on the past L states. Fixedlength Markov models (FLMM) represent a probabilistic finite state machine which can beused to model arbitrarily complex sequential data. Such models aim at learning the probability,P(σ|C), the conditional probability distribution of a symbol σ, given its context C, where σ∈ Σ,and C ∈ ΣL, L is the order or memory length of the model, and is fixed. The FLMM of order Lis represented as a ΣL×Σ matrix. The space requirement is thus in O(ΣL+1).

Variable length Markov Models (VLMM) differ from FLMM in an important way. Vari-able length Markov models attempt to learn the conditional distribution of a symbol wherebythe context length or model order could be varying, depending on the data being modeled. Es-sentially, for VLMM, C ∈

SLi=1 Σi. This property of varying memory length implies that with

the VLMM, Markov dependencies of varying order in the training data – both large and small– could be captured with ease. This flexibility of the VLMM however comes at a huge cost interms of space requirement. To represent a VLMM of order L, we have to store L matrices, onefor each order, from 1 to L. The total space requirement will be in O(Σ2 + Σ3 + ...+ ΣL+1), orO(ΣL+2). The space is huge, even when L is small. Thus, the space requirement for Markovmodels is exponential in L, whether we consider fixed length or variable length models.

An important observation that could point to a potential reduction in the space requirementfor Markov models is that for a given sequence of length n, there are n(n + 1)/2 possible sub-


strings in the sequence. Thus, there are at most n(n+1)/2 states that can be represented in theMarkov model for the sequence, for any given order. For the length-n sequence, the maximumorder of a Markov model will be n−1. Thus, with knowledge of the sequence, we can have alimit on the possible number of states in its Markov model.

2.3.2 Probabilistic Suffix Tree

The probabilistic suffix tree (PST) is a probabilistic suffix model which is based on thetraditional suffix tree. Like the suffix tree, the PST represents all the n(n + 1)/2 substringsfrom the root to the leaf nodes. The PST models variable length Markov models, which meansthat the string depth is not fixed for every node. For an FLMM, its corresponding PST canbe obtained from the PST of the VLMM by constraining each leaf node to be of the samestring depth. The transition probability of a symbol on a given path is computed as the relativefrequency of the symbol in the observed data, given the preceding substring on the path. Thelength of the substring used to determine such conditional probabilities is simply given by thememory length or order of the model.

Example PST and ST for a sample sequence T = accactact$ are given in Chapter 4(see Figure 4.2).

The original algorithm [109] used an O(Ln2) time complexity to construct and prune thePST from a suffix tree. The improved algorithm [88] used balanced red-black trees [31] toconstruct the PST in O(Ln logn) time complexity. Apostolico and Bejerano [9] presented an al-gorithm having an O(n) time complexity using suffix links and reverse suffix links, independentof L.

2.3.3 Computing T F and DF via Suffix Arrays

In [9, 109], the conditional probabilities required for the PST were computed as relativecounts, using the notion of empirical probabilities, based on symbol frequencies from the ob-served data. To compute the empirical probabilities, we use the notions of term frequency anddocument frequency as used in information retrieval. The term frequency, T F is simply the


number of times a given term (or substring in our case) occurred in a given text. Here, thetext could contain many documents. Therefore, the document frequency for a given term is thenumber of times the term occurred in the document. Thus, the term frequency is easily obtainedfrom the document frequency as a simple sum. When the text contains only one document, theT F and DF will be the same.

There are n(n + 1)/2 substrings in a sequence of size n. Using a naıve algorithm, wewill need to compute T Fs for all the n(n + 1)/2 terms. However, with the suffix tree for thissequence, we have n leaf nodes and at most n internal nodes. These 2n nodes represent then(n+1)/2 substrings in the sequence. Therefore, some multiple substrings at different positionsin the sequence will be represented by the same node. These substrings represented by the samenode must have the same frequency count. Since the node labels are unique in the suffix tree,this means that the multiple substrings in the same node are essentially the same substring, thatwere repeated multiple times in the sequence. Thus, while there are O(n2) possible substringsin T , there are at most O(n) unique substrings. Hence, we need to compute the T F for only the2n unique substrings.

Yamamoto and Church [131] presented a data structure to represent nodes in the suffix treeusing a suffix array. They called this structure the interval array. Using the interval array, theiralgorithm calculated all the required T Fs in O(n) time, but needed O(n logn) time to computethe DFs. Our data structure is based on the interval array, and we will show how we can improvethe algorithm for computing DF to O(n) time.

Interval Array Structure

From the suffix tree, we know we can cluster the potential n(n+1)/2 substrings into at most 2n”groups”. In [131], the interval array was proposed as a data structure to represent these groups.Table 2.1 shows the interval array for sample sequence T = accactact$. From the table, wecan describe some important properties of the interval array as follows.

For all substrings, we have the following:

1. There are at most 2n groups in the document of size n. Substrings in the same group have


the same statistics (for example: T F and DF) and the same derivative measurements fromthese statistics.

2. An lcp-delimited interval < i, j > is constructed using the LCP array, where < i, j > is aninterval on the suffix array. An lcp-delimited interval < i, j > must meet the condition:

max(LCP[i],LCP[ j+1])< Lengthgroup(i, j)≤min(LCP[i+1],LCP[i+2], ...,LCP[ j]), where,Lengthgroup(i, j) is the LCP of the suffixes that belong to the same group with the < i, j >

interval.

The lcp-delimited intervals have the following properties:

1. Each lcp-delimited interval represents one unique group of substrings.

2. The maximum length of an lcp-delimited interval < i, j > is given by:

min(LCP[i+1],LCP[i+2], ...,LCP[ j]).

3. A non-trivial lcp-delimited interval is one with a start position that is less than the endposition. That is, the lcp-delimited interval < i, j > is non-trivial if i < j. Otherwise,if i ≥ j, the interval is said to be trivial. There are n trivial lcp-delimited intervals withT F=1. There are at most n−1 non-trivial lcp-delimited intervals with T F > 1

4. The lcp-delimited intervals for a document can form a nested structure of intervals, butno two lcp-delimited intervals can overlap. Thus, the lcp-delimited intervals can be rep-resented in a tree-like structure.

5. Let α and β be two substrings in the same lcp-delimited interval < i, j >. Then, thefollowing two conditions hold:

Table 2.1. Interval ArrayInterval LCP Term Freqency< 2,3 > 3 2< 1,3 > 2 3< 6,7 > 2 2< 4,7 > 1 4< 8,9 > 1 2


• T F(α) = T F(β) = j− i+1

• DF(α) = DF(β)

Below we consider algorithms for computing T F and DF .

Determining the lcp-delimited intervals

Computing the term frequency (T F) is almost analogous to determining the lcp-delimited in-tervals. Given the lcp-delimited interval < i, j >, the T F algorithm in [131] uses the relationT F(< i, j >) = j− i+1 to determine the term frequency for the interval. The time complexityof the algorithm is O(n), using O(n) space.

We observe that there are at most n neighboring LCP pairs (i.e. LCP[i] and LCP[i + 1]) ina suffix array of size n. Thus, there are at most n increasing orders in such neighboring pairs.Similarly, there are at most n decreasing orders between the neighboring pairs. A decreasingorder between neighboring pairs LCP[i] and LCP[i + 1] implies that there exists at least onelcp-delimited interval < k, i >, where k ≤ i. Using the foregoing and the fact that we have atmost 2n lcp-delimited intervals in a document of size n, Yamamoto and Church [131] computedthe term frequency. Given that the lcp-delimited intervals could be nested, a stack can be usedto hold the information on the intervals.

Assuming we have m decreasing neighboring pairs in an LCP array of size n. Let theinterval between each pair be: n1,n2, ...,nm. Hence, there are n1 +1+n2 +1+n3 +1, ...ni +1+...nm + 1 = m + ∑

mi=1 ni ≤ m + n ≤ 2n lcp-delimited interval. When we find the ith decreasing

neighboring pair, we will output the ni intervals. Thus, determining the lcp-delimited intervalscan be done in O(n) time. This linear time complexity can be compared with the work reportedin [2], where they constructed a similar structure (the LCP-interval tree) in O(n log |Σ|) time.

Computing the Document Frequency (DF)

Determining the document frequency (DF) is more difficult than computing the term frequency(T F). In [131], an algorithm was given to calculate DF in O(n logn) time and O(n) space. The


algorithm (reproduced below in Figure 2.1 ) uses the procedure described above to get the lcp-delimited intervals. It uses a new array to map the first symbol of the substring to a documentid based on the order of the suffix array index. Using a simple algorithm, we could search ineach interval to determine the document frequency. This will however be too expensive in timecost, leading to time in O(n2). Given that lcp-delimited intervals could be nested, we only needto check in the extra range over the calculated range. So the core of the algorithm is to checkwhether the document id of the current position has been previously computed. We reproducethe DF algorithm proposed in [131] below (see Figure 2.1). In Chapter 4, we modify thisalgorithm for an improved time complexity.

Figure 2.1. Algorithm for computing document frequency [131]


2.4 Circular Pattern Matching

2.4.1 String Pattern Matching

The pattern matching problem is to find the occurrences of a given pattern in a given textstring. This is an old problem, which has been approached from different fronts, motivatedby both its practical significance and its algorithmic importance. Matches between strings aredetermined based on the string edit distance. Given two strings T : t1...tn and P : p1...pm,over an alphabet Σ, the edit distance indicates the minimum member of edit operations whichtransform one string into other string. There are three basic edit operations: insertion of asymbol, deletion of a symbol, and substitution of a symbol with another symbol. We assumethat the cost of each edit operation is unity. If two characters are identical, the cost of the matchoperation is zero. The substitution operation can also considered as a mismatch operation. Theexact string matching problem is to look for all occurrences of a pattern matching a substringof the text with zero edit operations. Various algorithms have been proposed for both exact andapproximate pattern matching [3, 11, 20, 34, 46, 48, 64, 117].

Grossi and Vitter [45] pointed out that a full-text indexing system is expected to be ableto support three basic types of queries, existential query: returns a binary value (true or false)indicating whether a pattern, P occurs in the text T ; counting query: returns nocc, (0 ≤ nocc ≤n), the number of occurrences of P in T ; and enumerative query: returns nocc numbers, eachindicating the starting position in T , of an occurrence of P [3].

Exact String Matching

Three well-know efficient string matching algorithms with linear time complexity are the Knuth-Morris-Pratt(KMP) algorithm [64], Karp-Rabin algorithm [59] and the Boyer-Moore(BM) al-gorithm [20]. Like KMP, the BM algorithm matches the pattern and the text by skipping char-acters that are not likely to result in exact matching with the pattern. Unlike the other methods,it compares the strings from right to left of the pattern. These algorithms need an O(m) prepro-cessing for the pattern and search in O(n) or sometimes even sublinear in practice. The totaltime will be O(n+m).


A different approach to pattern matching based on bitwise operations was introduced by R.Baeza-Yates and G. Gonnet [13]. Here, the pattern is represented by a binary mask. Bit-wiseSHIFT and AND operations that are considered constant time are used to find the patterns.Under this framework, SHIFT and AND correspond to the pattern movement and matchingrespectively. The algorithm is effective for small patterns, when the pattern length is less than acomputer word (say, 64 characters), which is usual for the text searching problem.

When multiple patterns need to be searched, alternative algorithms are used, such as theAho-Corasick algorithm [5] which used a keyword tree for the set of patterns. In addition tomultiple pattern matching, the suffix tree algorithm [46] is efficient when the pattern will besearched multiple times. There are linear time suffix tree construction algorithms available[89, 122, 128]. The search time for each pattern will be O(m). The total time for searching spatterns will thus be O(n+ sm). This can be compared with O(m+ sn) that would be needed byalgorithms such as KMP and BM.

Approximate String Matching

Algorithms for the approximate pattern matching problem can be grouped into three majorcategories: methods based on dynamic programming [114]; methods based on bit-wise oper-ations [130]; and methods based on the longest common subsequence (LCS) [68]. The editdistance can be computed using dynamic programing. The path that leads to the minimal dis-tance can easily be identified by adding trace-back pointers. The time complexity for computingthe edit distance is generally in O(mn). When the number of allowed errors k is known, the editdistance can be computed in O(kn), for example using Ukkonen’s deterministic finite state au-tomaton (DFA) approach [121]. Typical approximate matching algorithms based on bit-wiseoperations are AGREP [130] and NRGREP [95]. These are obtained by extending the bit-wiseexact matching algorithms. Another variation of pattern matching with errors is the k-mismatchproblem, where only substitution operations are considered during the edit distance computa-tion. In [68], the k-mismatch problem is solved using the suffix tree so that the LCS query canbe answered in constant time. To find all the k-mismatch patterns of P in text T , we performk-mismatch checks for every alignment of P, each time starting at different character positionin T .


Dynamic programming is the most popular algorithm in the approximate string matching(ASM). From the classical searching for longest common subsequence (LCS) [31] to multiplestrings alignment [25, 125]. It uses divide and conquer method to break a complex probleminto subproblems by solving the subproblems first and then integrating the solutions to get theoptimal answer. The search time is O(mn) and its space requirement is O(m). This algorithmcan be improved to achieve O( kn√

|Σ|) [26] on average. Smith-Waterman algorithm is one of

the most famous algorithms in this category which was first proposed by Temple F. Smith andMichael S. Waterman in 1981 [116] for local alignment. Needleman-Wunsch [96] algorithm is ageneral global alignment technique used in bioinformatics. Both of these algorithms are basedon dynamic programming. Smith-Waterman [116] algorithm guarantees finding the optimalsolution for local alignment and Needleman-Wunsch [96] algorithm finds the optimal solutionfor global alignment.

2.4.2 Circular Pattern Matching Problems

A circular shift is a mapping f : Σ∗→ Σ∗, f t(c1...cr) = ct+1...crc1...ct , where 0≤ t ≤ r−1and r is the length of string c1c2...cr. Thus, f 0(c1...cr) corresponds to the original string. Let[s] be a set of circular shifts of string s, then [s] = { f i(s)|0 ≤ i ≤ |s|− 1}. Given two circularstrings s1 and s2, the edit distance between s1 and s2, ED(s1,s2), is the minimum number of editoperations needed to transform one member of [s1] to one member of [s2]. This is defined asEDc(s1,s2) = min{ED[ f i(s1), f j(s2)]|0 ≤ i ≤ |s1| − 1 and 0 ≤ j ≤ |s2| − 1}, where |s1| is thelength of string s1 and |s2| is the length of string s2, ED[A,B] is the standard edit distance. Weconsider two major problems related to circular pattern matching (CPM) defined as follows.

Problem 1: Exact circular pattern matching (ECPM). Given one circular pattern P =P[1...m] and the text T = T [1...n], return all occurrences of circular string [P] and its circu-lar shifts inside text T without any error. [P] is a match to text T at position j ∈ [1...n−m+1]⇔ f t(P) = T [ j... j +m−1], for some t, 0≤ t ≤ m−1.

Problem 2: Approximate circular pattern matching (ACPM1). Given text T = T [1...n],circular pattern P = P[1...m] and maximum error k, return “Matching” when the edit distancebetween text T and circular pattern P is less or equal to k. Thus, the result will be the matchingpair g = {(T,P)|EDc(P,T )≤ k,s.t.− k ≤ n−m≤ k} .


This problem uses the existential query to look for circular pattern matches between textT and circular pattern P, where −k ≤ n−m ≤ k. This problem compares two sequences withcircular edit distance less than k.

We also consider a harder variation of the ACPM problem with the extension −k ≤ n−m.This variation is to find circular permutations of P inside T . This problem compares circularpattern P and substring of T with circular edit distance less than k. We define this variationmore formally as Problem 3 below:

Problem 3: Approximate circular pattern matching problem 2 (ACPM2)

Given text T = T [1...n], circular pattern P = P[1...m] and maximum error k, return allpositions where the circular string [P] matched text T with at most k errors. [P] is said to be ak-approximate match with text T at position j ∈ [1...n−m−k +1] i f EDc(P,T [ j... j +m])≤ k,where 0≤ t ≤ m−1,−k ≤ n−m.

Comparing with the ECPM problem, the ACPM2 problem looks for all approximate matcheswithin the given maximum error k. The ACPM1 problem looks for an approximate match be-tween two whole strings text T and circular pattern P, but the ACPM2 problem looks for allapproximate matches between every substring of text T and the circular pattern P and its circularshifts. The ACPM2 problem is the hardest of these three problems.

An extention of the CPM problem is the All-Against-All variant.

Problem 4: All-Against-All CPM problem

Given SeqDB, the database of sequences, the All-Against-All CPM problem is to compareeach sequence in the database with every other sequence in the database for possible circularmatches.

Similar to the standard CPM problem, we can also consider the ECPM and ACPM versionsfor the All-Against-All CPM problem.


2.4.3 Exact Circular Pattern Matching (ECPM)

Given a pattern P and a text T , the exact circular pattern matching problem is to find theposition in T which matches a circular permutation of P. The exact circular pattern match-ing problem was first studied by Booth [19] in 1980. He proposed an algorithm to detect thelexicographically smallest conjugate of a word. Improved methods were proposed in [10, 36].However, the focus of the algorithms was on the canonical rotation(s) of a word. ECPM wasa particular case in that problem. The swap pattern matching problem [7] is related to theECPM problem. It looks for an exact match of a swapped pattern P in text T . We can see thatthe ECPM problem is a particular case of the swap pattern matching problem. Gusfield [46]discussed the ECPM problem as an end-of-chapter exercise but did not provide an explicit so-lution. Shiloach [115] provided an algorithm for the ECPM problem. However both only solvethe online version of the ECPM problem which does not use indexing to preprocess the text T .

Our work is perhaps more closely related Iliopoulos et al. [51] who proposed two algo-rithms that solve the ECPM problem using indexing. These two algorithms are based on suffixstructures and the time complexity are O(m log logn+nocc) and O(m logn+nocc) respectively,where nocc is the number of occurs.

The first algorithm builds a new data structure CPI-I to index circular patterns. Two stepsare used to find the circular pattern. First, the algorithm will index the text T in two suffix treesSTT and STT where T is the reverse order of T . It also maintains two list LL(R) and LL(R) whichare linked lists of all the leaf nodes from left to right in STT and STT respectively. Secondly,the algorithm will search two parts Q1,Q2 of each permutation of pattern P in STT and STT

respectively. STT returns the occurrences of Q1 and STT returns the occurrences of Q2. Thealgorithm finds the intersection of these two sets by using the two linked list LL(R) and LL(R).

The construction time and space complexity for this algorithm is O(n log1+ε n), where 0 <

ε < 1 and n is the length of text T . The query time complexity is O(m log logn+nocc).

The second algorithm used another new structure CPI-II to address the ECPM problem.The data structure CPI-II is constructed by the suffix array SA, inverse suffix array SA−1 andarrays Pre and Su f , where Pre is an array for the prefix of pattern P and Su f is an array forthe suffix of pattern P. There are three steps involved. First, compute the interval of prefix


of pattern P into array Pre and the interval of suffix of pattern P into array Su f . There are mprefixes and suffixes in P. For each prefix or suffix, it is calculated using the previous prefixor suffix, so the time complexity is O(logn) using the suffix array SA and inverse suffix arraySA−1. Secondly, for each circular permutation pattern which can be constructed by P[m− i]P[i]where 1 ≤ i ≤ m, the algorithm finds the intervals of P[m− i] and P[i] in Su f [m− i] and Pre[i]separately. Thirdly, output the intersection of the intervals of P[m− i] and P[i].

The space complexity are O(n) bytes for this algorithm implemented using in the suffixarray. The time complexity for answering a query is O(m logn + nocc). When implementedusing the compress suffix array [3,45], its space complexity will be O(n logn) bits, but the timecomplexity for queries increases to O(m log2 n+nocc).

2.4.4 Approximate Circular Pattern Matching (ACPM)

The ACPM problem is to find k-approximate matches between circular pattern [P] and textT . The naıve method for the ACPM problem is to use each of circular strings f t(P) to calculatethe edit distance between T and f t(P), where m is the length of pattern and 0 ≤ t ≤ m− 1.Thus the dynamic programming procedure will be run m times. The time complexity of a naıvealgorithm to compute ED([P],T ) is O(m2n).

Maes [81] published a “divide and conquer” algorithm to compute ED([P],T ) in O(mn logm).Up to now, this is the best theoretical result for computing the edit distance between a circularpattern and a text. Given the significance of Maes’ algorithm, we present the details below.

In theory, Maes [81] algorithm is the best algorithm. It uses “divide and conquer” to cal-culate the edit distance by using a dynamic program table. This algorithm constructs an editgraph between text T and string PP (Figure 2.2 [87]), where PP is a concatenation of pattern Pto itself. In this edit graph, let path(x1,y1)−(x2,y2) be a path from vertex (x1,y1) to vertex (x2,y2).For each vertex (x,y) on this path, we have x1 ≤ x≤ x2,y1 ≤ y≤ y2. The edit distance betweenT and f i[P] in the subgraph can be computed by following a path from vertex (0, i) to vertex(n, i+m-1), where 0≤ i≤ m−1. Let Pathi be an optimal edit path between f i(P) and T in theedit subgraph. That is a path of minimum cost from vertex (0, i) to vertex (n, i+m-1). Maes’algorithm is based on an important observation on the edit graph: if Pathi and Path j are each


an optimal edit path, then Pathi and Path j can not cross each other, where 0 ≤ i < j ≤ m− 1.We can see that when Pathi and Path j has a crossing point say at (x,y), then edit distance ofpath(0,i)−(x,y) is less than edit distance of path(0, j)−(x, y), whenever i < j. Thus Path j is nolonger an optimal path. The set of paths {Patht |i < t ≤ j} do not have optimal path, becauseeach Patht has to cross Pathi at the point (x,y).

Figure 2.2. Edit Graph of T and PP [87].

Based on the above observations, the algorithm calculates Pathi and Path j first, where i < j.If two paths do not cross each other, that means there is an optimal edit path between Pathi andPath j. Next, they calculate the path Patht in-between Path j and Path j, where i < t < j. In thiscase, the time for calculating Patht is O(( j− i)× n). When Pathi and Path j cross each other,we do not need to calculate the path set {Patht |i < t ≤ j} anymore, because these is no optimalpath between them.

Following this idea, the algorithm calculates all optimal paths starting from Path0 andPathm, where Pathm is the optimal path from (0, m) to (n, 2m− 1), and Path0 and Pathm

are parallel paths. Figure 2.3 illustrates this algorithm. The step (1) of Figure 2.3 shows thiswith time cost of O(mn). In the second step(step (2) of Figure 2.3), the algorithm computesthe optimal path Path

m2 between Path0 and Pathm with time complexity of O(mn). In the third

step (step (3) of Figure 2.3), two optimal paths Pathm4 and Path

3m4 are computed. Time cost for

computing each path is O(mn2 ), hence time complexity of this step is O(mn) too. The step 4

(step (4) of Figure 2.3) calculates four paths Pathm8 ,Path

3m8 ,Path

5m8 and Path

7m8 . Time cost of


computing each path is O(mn4 ), so time complexity of this step is O(mn) too. And so on and so

forth, there are O(logm) steps and time cost of each step is O(mn). Even when there are somecrossing points, the time for calculating each step is still O(mn). Thus the total time complexityis O(mn logm). After getting all optimal paths, the minimum edit distance between text T andcircular pattern P can be computed.

Figure 2.3. Maes’ Algorithm

Gregor et al. [44] gave a O(m2n) algorithm, however, this is a data-dependent algorithm. Inpractice, the algorithm may reach to O(mn) time complexity on average. Oncina [98] presentedan algorithm which has the same time complexity as Gregor et al. [44]. Marzal et al. [87]provide a branch and bound algorithm which is based on Maes [81] algorithm. The worst casetime complexity is the same as Maes algorithm, but with more efficient time complexity onaverage.

The above methods all produce complete results in calculation of the circular edit distance.Some studies [22, 90–92] also present suboptimal algorithms with reduced time complexitythat runs on O(mn), but with the possibility of missing some results. Bunke and Buhler [22]presented a suboptimal algorithm whose time complexity is O(mn). Mollineda [90–92] pub-lished two algorithms based on Bunke algorithm [22] and showed in an experiment that thesuboptimal solution is almost as good as the optimal counterparts.


We note that all the above methods on the ACPM problem have only considered the ACPM1variant. To our knowledge, there has been no published work addressing the more challengingACPM2 problem.

2.4.5 ACPM Problem in Protein Sequences

A number of studies have been reported on algorithms for detecting circular permutationsfor protein sequences [40,47,55]. The first method [54] used the dot matrix and human visual-ization to identify circular relationships between protein sequence pairs. The work in [6] useda dictionary method to find short fragments common to the protein sequence pairs and usedhuman visualization to report the best local matches.

Needleman et. al [96] proposed a method for global alignment between two protein se-quences. The global alignment algorithm measures the number of edit operations (insertion,deletion, and substitution) for transforming one sequence to another sequence. Uliel et al. [123,124] introduced a method to detect circular permutations in protein sequences using globalalignment [96]. They gave an O(m3) time complexity algorithm to find the complete set ofmatching circular permutations. They also proposed a greedy algorithm in O(m2) time com-plexity, but which could miss some valid circular permutations in the text T . Weiner et al.[126, 127] proposed another greedy method that runs in O(m2) time complexity. They focusedon circular multidomain proteins, where the alphabet are now the protein domain blocks, ratherthan traditional protein symbols. Thus, |Σ| could be quite large, of the order of 20q, where q isthe length of the domain blocks. This is the first application of the CPM problem in studyingmultidomain proteins. However, they did not consider the problems posed by the expandedalphabet.

The algorithm of Uliel et al. [123, 124] used a simple method that calculated the edit dis-tance [72] between text and one of the circular permutations using the Needleman and Wunschalgorithm [96]. This was repeated m times for all the circular permutations, with O(m3) timecomplexity and O(m2) space complexity for each sequence as a pattern P against on the otherprotein sequences. The greedy algorithm of Uliel et al. [123, 124] modified the local align-ment algorithm to find the best local alignment in a 2m× n matrix. This algorithm is similarto the Smith-Waterman local alignment algorithm [116]. It is not guaranteed to find all circular


matches, and thus may miss some valid matches. The algorithm of Weiner et al. [126, 127] isalso a greedy algorithm and thus could miss some valid matches. They concatenated the text Tas T T and the pattern P as PP, and thus constructed a 2n×2m matrix using the Needleman andWunsch algorithm [96]. At the verification phase, the circular matching condition must satisfycertain conditions defined on the 2n×2m matrix [126, 127].

More fundamentally, both groups [123, 124, 126, 127] that have studied CPM in proteinsequences have focused on whole sequence comparison with another whole sequence. In theirexperiments, they have to group the protein sequences based on their specified lengths, and usedthe dissimilarity in lengths for initial pruning. These methods ignored the fact that a shortercircular protein sequence could be part of the functional region of a much larger multidomainprotein. This, however, could be a key consideration in function prediction for multidomainproteins. Further, as with the more theoretical algorithms for the ACPM problem, the methodsfor protein sequences [123, 124, 126, 127] also only considered the ACPM1 problem.

2.5 Pattern Discovery Problem

Pattern discovery is a well studied problem in computational biology and data mining, andvarious methods have been proposed. The basic method is to identify short sequences that tendto be over-represented within a given set of sequences. Mining sequential patterns was studiedin [30, 132]. Motif discovery methods in bioinformatics are surveyed in [112]. Algorithms fordiscovery of proximity patterns were proposed in [12]. Proximity pattern discovery is closelyrelated to the more recent notion of ”complex motif”, which is defined as a composite motifwhereby the individual components are constrained to be within a specified seperation distance.Perhaps, a more closely related work is the method of pattern discovery using mutable permu-ation patterns [49]. However, although permu-patterns offer a lot of flexibility in the match,ignoring the order of the patterns still does not handle the problem of possible cyclic relationsbetween patterns. Most efforts in pattern discovery have been invested in studing the statisticalsignificance of the patterns (see for e.g. [85,86]), and the biological relevance of the discoveredpatterns, in the case of biological applications [112]. There has not been much attention on thepattern matching problem involved, which forms the basis of pattern discovery.

Chapter 3

The Virtual Suffix Tree

3.1 Introduction

Our proposed data structure is most closely related to the ESA and LST. The virtual suffixtree can be constructed in the same time and space bounds as the suffix tree. It also supportsbasic search operations in the same time and space bound as the suffix tree. However, the VSTrequires a much smaller practical space than the suffix tree. The space requirement (12.05nbytes using the compact form) is generally smaller than that of ESA and LST, the other closelyrelated data structures (each requires 20n bytes). Other related data structures that have beenproposed include the suffix cactus [56], suffix vectors [93, 103], compact suffix trees [82], thelazy suffix trees [41], level-compressed suffix trees [8], compressed suffix trees [94], and com-pressed suffix arrays [45]. See also [3].

Main results1 We introduce a new data structure, the virtual suffix tree (VST), an efficientdata structure for suffix trees and suffix arrays. The VST neither stores the lcp array nor thelcp-intervals, but rather exploits the inherent nature of the suffix tree topology. We state ourmain results in the form of two theorems about the VST.

1Part of the work reported in this chapter has been published in the following papers: [78, 79]

31

CHAPTER 3. THE VIRTUAL SUFFIX TREE 32

Theorem 3.1: Given a string T = T [1..n], with symbols from an alphabet Σ, and the virtualsuffix tree for T , we can count the number of occurrences of a pattern P = P[1..m] in T inO(m log |Σ|) time, and locate all the ηocc occurrences of P in T in O(m log |Σ|+ηocc) time.

Theorem 3.2: Given a string T = T [1..n], with symbols from an alphabet Σ, the virtual suf-fix tree, including the suffix links, can be constructed in O(n) time, and O(n) space, independentof Σ.

Essentially, the VST provides the same functionality as the suffix tree, but at a much smallerspace requirement. It has the same linear time construction for large |Σ|, requires O(n) spaceto store, and allows searching for a pattern of length m to be performed in O(m log |Σ|) time,the same time needed for a suffix tree. To provide the complete functionality of the suffix tree,we describe a simple linear time algorithm that computes the suffix links based on the VST.We present two algorithms for VST construction. The first algorithm builds the VST from thesuffix tree, which in turn is generated from the suffix array. The second algorithm eliminatesthe need for the suffix tree construction step, and thus builds the VST directly from the suffixarray. Although the space needed for the VST is linear (as in suffix tree implementations usinglinked lists or binary trees), the practical space requirement is much smaller than that of a suffixtree. The VST requires less space than other recently proposed data structures for suffix treesand suffix arrays, such as the ESA [2], and the LST [62]. On average, the space requirement(including that for suffix arrays and suffix links) is 13.8n bytes for the regular VST, and 12.05nbytes in its compact form. This can be compared with the 20n bytes needed by the LST or theESA.

Organization In Section 2, we introduce the basic data structure and discuss the propertiesof the VST. Section 3 presents an improved data structure, along with algorithms for its con-struction. A complexity analysis on the construction and use of the VST is also presented inthis section. Section 4 shows how the suffix link can be constructed on the VST. In Section 5,we eliminate the need to construct the suffix tree, and show how the VST can be constructeddirectly from the suffix array. We make the summary in Section 6.


3.2 Basic Data Structure

Starting from the suffix array, we construct an efficient data structure to simulate the suffixtree (ST). We call this structure a virtual suffix tree (VST). The VST stores information aboutthe basic topology of the suffix tree, the suffix array, and the suffix links. Thus, the VST isrepresented as a set of arrays that maintains information on the internal nodes of the suffix tree.The leaf nodes are not stored directly. However, whenever needed, information about any leafnode can be obtained via the suffix array. Unlike the ESA and LST, the VST neither uses thelcp-interval tree nor stores the lcp array. We call the data structure a virtual suffix tree inthe sense that it provides all the functionalities of the suffix tree using the same space and timecomplexity as a suffix tree, but without storing the actual suffix tree. Later, we show that theVST leads to a more compact representation of suffix trees and suffix arrays. (We mentionthat [60] also used the term ”virtual suffix tree”, but for a limited form of the enhanced suffixarray).

Below, we present the basic VST. This structure will require 14 bytes for each node in theVST and supports pattern matching in O(m log |Σ|) time, for an m-length pattern. In the nextsection, we present an improved data structure that reduces the space cost by eliminating theneed to store edge lengths, while still maintaining O(m log |Σ|) time for pattern matching. Wealso describe a more compact structure for the VST that uses only 10 bytes for each internalnode of the VST, and 5 bytes for each leaf node. Pattern matching on this compact representa-tion will, however, be in O(m|Σ|) time.

Each node in the VST corresponds to a distinct internal node in the suffix tree. In its basicform, each node in the VST is characterized by five attributes. For a given node in the VST (saynode u), with a corresponding internal node in ST (say node uST ), the five attributes are definedas follows.

• sa index: index in the suffix array (SA index) of the leftmost leaf node under theinternal node uST of the suffix tree.

• fchild: the node ID of the first child node of uST that is also an internal node. (Scanningis done left to right; edges at a node are also sorted left to right in ascending lexicographic


order). If node u is a leaf node in the VST, the value will be negative. The absolute valuewill point to the first child node of the next internal node in the VST.

• elength: The edge length of the edge (v,u) in the VST, or equivalently (vST ,uST ) inthe suffix tree, where v is the parent node of u and vST is the parent node of uST .

• nfleaf: the number of child leaf nodes before the first child of uST that is also aninternal node.

• nnleaf: the number of sibling leaf nodes after uST , the current internal node of thesuffix tree, but before the next sibling internal node.

In terms of storage, the sa index, fchild and elength each requires one integer (4 bytes),while nfleaf and nnleaf each requires one byte of storage (assuming |Σ| ≤ 256).

3.2.1 Example VST

We use an example sequence to explain the above definitions. The suffix tree and VST forthe string missississippi$ are shown in Figure 3.1. Note that the string missississippi$is made intentionally different from mississippi$, to capture some of the cases involvedin a VST. Only the internal nodes (dark nodes) are explicitly stored in the VST. The leaf nodes(empty circles) are not stored. The order of storage is based on the node-depths, from top tobottom. Table 3.1 shows the corresponding values of the VST node attributes for each VSTnode in the example.

3.2.2 Properties of the Virtual Suffix Tree

We can trace the properties of the VST based on the standard properties of a suffix tree.

1. The VST only stores the internal nodes of the suffix tree. No leaf nodes in the ST arerepresented in the VST. Information about the leaf nodes can be obtained from the SAwhen needed. Then the space requirement of the VST depends on the topology of the thesuffix tree, or more specifically, on the number of internal nodes.


2. The number of leaf nodes in a suffix tree is n. The number of internal nodes in the suffixtree (and hence number of nodes in the VST) is at most n.

3. The VST stores only the SA index of the leftmost leaf nodes and information about thechild nodes.

4. For a given node in the VST, the number of child nodes will be no larger than |Σ|. Thus,the time needed to match a symbol is at most O(log |Σ|).

5. The nodes in the VST are ordered based on the internal nodes of the suffix tree using thehierarchy sequential access method (HSAM). The child nodes from any given node willbe stored sequentially. The child nodes of two nearby nodes will therefore be stored innearby locations. This is an important property for addressing problems involving localityof reference.

We introduce further definitions needed in the description below. For a given node u in theVST, we use the term prior node to denote the node that appears before the current node u in theHSAM ordering. Similarly, next node denotes the node that appears after the current node u inthis ordering. We use lsa index (left sa index) to denote the SA index of the leftmost leafnode that is a descendant of u. Similarly, rsa index (right sa index) denotes the rightmostleaf node that has u as its ancestor. Figure 3.2 shows an example.

It is simple to determine the lsa index and the leftmost child node of any given node.The properties of the VST and the organization of the VST lead to the following lemma aboutthe VST:

Table 3.1. VST node attributes for the example sequence T = missississippi$ used

in Figure 3.1.node root N1 N2 N3 N4 N5 N6 N7 N8 N9

sa index 0 1 7 9 3 9 12 4 10 13fchild N1 N4 −N5 N5 N7 N8 N9

elength 0 1 1 1 3 1 2 3 3 3nfleaf 1 2 2 0 1 1 1 2 2 2nnleaf 0 1 0 0 0 0 0 0 0 0


Lemma 1: For a given node in the VST, its rightmost child node, and the right sa index

can each be determined in constant time.

Proof: Let u be the current node in the VST, with parent node v. Let w be the next nodein the HSAM ordering. By property 5, if w is an internal node in the VST, the rightmost child(rchild) node of u will be the prior node to the leftmost child node of w. If w is a leaf nodein the VST, then w.fchild will point to the next node after u’s rightmost child node. Then thetime to determine the rightmost child node will be O(1).

For the right sa index, if u has a next sibling node, say w, the right sa index of u willbe the left sa index of this sibling node w minus the nnleaf of u. If u does not have a nextsibling node, the right sa index of u will be the right sa index of node v (u’s parent) minusthe nnleaf of u. That is,

u.rsa index=

{w.sa index−u.nnleaf−1 :u has a next sibling, w

v.rsa index−u.nnleaf :otherwise(3.1)

Thus the time required to determine the right sa index is O(1). �

3.2.3 Pattern Matching on VST

Lemma 1 provides an indication of how pattern matching can be performed on the VST.For pattern matching using the suffix tree, an important issue is how to quickly locate all thechild nodes for a given internal node. In the VST, each node points to its leftmost leaf nodeusing the sa index. During pattern matching, at any given node in the VST, we will need todetermine four parameters, namely the leftmost child node (lchild), the rightmost child node(rchild), the left sa index (lsa index) and the right sa index (rsa index). Theseparameters define the boundaries of the search at the given node. To search in a leaf node ofthe VST, we will need only the left sa index and right sa index of the node. When wesearch in an internal node, we will need all the four parameters to match a pattern. Lemma 1shows that for any given node, we can determine each of these parameters in constant time. Thefollowing examples illustrate the two cases involved in computing the rsa index, and how


pattern matching can be performed on the VST.

Example: Determining the right boundary from a next sibling node. Consider node N5 inFigure 3.2. The left sa index of N5 is 9 and the right sa index is 11, since N5.sa index=9and N5+1.sa index=12, and hence the right sa index of N5=12-1=11. The leftmost childnode is the fchild of the current node, thus the leftmost child of N5 is N8. The next nodeof the rightmost child node is N5+1.fchild=N9. Then the rightmost child node is N9−1=N8,since the child node will be stored side by side between sibling nodes.

Example Determining the right boundary from the right boundary of the parent node.Consider node N1 in Figure 3.2. The left sa index of N1 is N1.sa index=1. The rightsa index of N1 is N2.sa index - (N1.nnleaf -1)=7-1-1=5. The leftmost child node of N1

is N1.fchild=N4. The next node of N1 is N2. Since N2.fchild=-N5 is negative, N2 must be aleaf node in the VST. We therefore know that the next node of the rightmost child node of N1 willbe N5. Finally, the rightmost child node of N1 can be determined as N5−N1.nnleaf = N5−1 = N4.

We summarize the foregoing discussion as the first main result of this work:

Theorem 3.1: Given a string T = T [1..n] of length n, with symbols from an alphabet Σ, and thevirtual suffix tree for T , we can count the number of occurrences of a pattern P = P[1..m] in Tin O(m log |Σ|) time, and locate all the ηocc occurrences of P in T in O(m log |Σ|+ηocc) time.

Proof: The theorem is a consequence of Lemma 1. First consider the cost of one single symbol-by-symbol comparison at a node in the VST. The number of child nodes at any internal nodecan be no larger than |Σ|, and we can find the boundaries of the search in constant time. Sincethe edges are ordered lexically at each internal node, and given the HSAM ordering, matching asingle symbol can be done in O(log |Σ|) time steps using binary search. To find the first match,we need to consider the m symbols in the pattern. We perform the above symbol-by-symbolcomparisons at most m times to decide whether there is a match or not. After a match is found,we can again use binary search (using lsa index and rsa index as bounds) to determineall the ηocc occurrences of the pattern. Reporting each occurrence can be done in constant time,or an additional ηocc time for all the occurrences. �


3.3 Improved Virtual Suffix Tree

The basic data structure introduced above stores the length of each edge in the VST. We canimprove the structure to reduce the space requirement by avoiding the need to store informationabout the edge lengths directly. The improved data structure has only four attributes rather thanfive. The attributes sa index and elength in the basic structure are now combined into oneattribute called the adjusted SA index (asa index). This requires a key modification to thesuffix tree, leading to an important distinction between the suffix tree and the virtual suffix tree.

3.3.1 Adjusting Edge Lengths

A well-known property of the suffix tree is that no two edges out of a node in the tree canstart with the same symbol. For efficient representation of the VST, this characteristic of the STis modified such that, for a given node, every edge that leads to an internal node in the VST hasan equal length. This modification is done as follows: Start from the root node and progresstowards the leaf nodes in the VST. For a given internal node, say u, adjust the edge label fromu to each of its children such that all edges that lead to an internal node will have the sameedge length. The major criteria is that, for two sibling internal nodes, their edge labels differonly in the last symbol. If for some edge, say (u,w), the original edge length (or edge label)is longer than the new length, prepend the extraneous part of old label(u,w) to each outgoingedge from w. The edge length for edges that lead to leaf nodes are left unchanged. Then repeatthe adjustment at each child node of u. Figure 3.3 shows an example of this procedure. Observethat this adjustment only affects the edge lengths, and does not change the general topology ofthe suffix tree.

The above adjustment procedure leads to an important property of the VST:

Property: In the improved VST, all internal sibling nodes occur at the same node-depth,and same string-depth, and the edge labels for the edges from the parent to each sibling differonly in the last symbol. This means that, in the VST, two branches from the same node can startwith the same symbol, but their edge labels will differ.

This property provides an important difference between the suffix tree and the VST. The


suffix tree mandates that no two edges from the same node have the same starting symbol.Further, the suffix tree only guarantees that the node-depth of two sibling nodes are the same,but not their string depth. This property of equal-length sibling edge labels is the key to moreefficient representation of the VST, without explicit edge labels. Figure 3.4 shows an exam-ple of the modified suffix tree with equal-length edges for sibling nodes that are also internalnodes, and the corresponding improved virtual suffix tree. Table 3.2 shows the correspondingvalues of the attributes for each node in the improved VST. What remains is how we computeasa index, the adjusted SA index. This is done by combining the original sa index withelength.

Lemma 2 : Given a node in the VST say u, and its parent node (say v), we can compute theadjusted SA index in constant time. Further, when required, the edge length can be determinedin constant time.

Proof: Computing the adjusted edge length (new elength) and the adjusted SA index(asa index) can be done using the following relations:

u.asa index=

u.new elength+u.sa index : u = v.fchild and

u.new elength 6= 1u.sa index : otherwise

(3.2)

u.sa index= u.fchild.sa index−u.nfleaf (3.3)

At time of VST construction, we calculate asa index from bottom to top. For leaf nodesin the VST, we already know the sa index and new elength, then we can calculate theasa index from Eqn (3.2). When the node u is an internal node in the VST, we first obtainu.sa index from Eqn (3.3) since we know u.nfleaf and u.fchild.sa index. Then wedetermine u.asa index from Eqn (3.2). The new edge length is not stored explicitly in theVST nodes, but can be computed in constant time whenever needed (for instance, during patternmatching) by simply changing the subjects in Eqns (3.2) and (3.3). This is possible since at thistime we already know u.asa index for each node in the VST. �

Thus while we store only the asa index, our calculations will still use the originalsa index. However, this can be derived from the asa index in constant time. In fact,


we can observe that in practice, we need to compute the asa index for only the leftmostchild node at each node-level, while keeping the original sa index for all other nodes. Todetermine the new elength for these other nodes, we simply make a constant time access totheir leftmost (sibling) node (at the same node-level), and then use this to compute the length.For searching with the VST, we will calculate the length of the common string at each level.If the length is greater than 0, then we know there is a common string in the edge labels forthe child nodes and only the last character is different. Thus, we do not need to store the edgelengths explicitly, leading to a reduction of one integer per node over the basic VST.

Table 3.2. Node attributes in the improved VST for the example sequence, T =missississippi$.

NodeName root N1 N2 N3 N4 N5 N6 N7 N8 N9

sa index 0 1 7 9 3 9 12 4 10 13fchild N1 N4 -N5 N5 N7 N8 N9

new elength 0 1 1 1 1 1 1 3 1 2nfleaf 1 2 2 0 1 1 1 2 2 2nnleaf 0 1 0 0 0 0 0 0 0 0asa index 0 1 7 9 3 9 12 4+3=7 10 13+2=15

We have included new elength, so one can compare with elength in Table 3.1. However, in practice this

will not be stored in the VST.

3.3.2 Construction Algorithm

Construction of the VST makes use of an array Q which records the internal nodes of thesuffix tree. This array maps the internal nodes of the suffix tree to nodes in the VST. Thus,elements in the array are in the same ordering as the corresponding nodes in the VST.

Given an input string T , the first step is to construct the suffix array for T . This can bedone in worst case linear time and linear space using any of the existing algorithms [57,63,65].Using the SA, we construct the suffix tree as described in [3]. While the suffix tree can beconstructed directly in linear time, working from the SA to the ST will require less space forthe construction. The suffix tree is then preprocessed in linear time to adjust the edges from agiven parent node that lead to internal child nodes to equal-length edges. Using the adjustedsuffix tree, the algorithm will process the internal nodes in the suffix tree in a top-down manner


to determine the attributes (fchild, nfleaf and nnleaf) for the corresponding nodes inthe VST. Next, we process the VST from the VST leaf nodes to the root, using the Q array toupdate the asa index at each node. The adjusted asa index field includes information onthe sa index and edge length.

The steps for constructing the VST for a given input string are summarized in Algorithm3.1.

3.3.3 Further Space Reduction

We can further reduce the space needed by the VST, at the cost of an increased time forpattern matching. In the pattern matching phase, if the algorithm is to compare symbols one-by-one, rather than using binary search on the branches from a given node in the VST, we willonly need to compute the lsa index and rsa index of the node.

Consider an arbitrary node (say node u) in the VST. The number of children from u or thenumber of u’s leaf nodes cannot be larger than |Σ|. Thus, the sa index of any child node of uwill lie between node u’s lsa index and rsa index. Then comparing one symbol from thepattern against the first symbol on each edge from u to its children will require at most O(|Σ|)time steps. The left child node and the right child node will not need to be used again. Thus, theattributes fchild and nfleaf in the leaf nodes of the VST are no longer required. We makethe asa index to be negative for the leaf nodes. Thus, during pattern matching, this serves asa flag for the VST leaf nodes. This compact structure will reduce the space requirement at eachleaf node of the VST by 5 bytes. Time for pattern matching, however, will increase to O(|Σ|)for each symbol in the pattern P, or O(m|Σ|) overall, where m = |P|.

3.3.4 Complexity Analysis

Time and space complexity.

The time cost for lines 1-3 in the construction algorithm CONSTRUCT-VST (Algorithm 3.1) isO(n)+O(n)+O(n)=O(n). Lines 5-17 in the algorithm perform a one time traversal of the nodes


in the suffix tree. The respective values of pTop and pBottom range from 1 to 2n. Thus the costfor the traversals is O(n). Lines 18-27 in the algorithm run at most pBottom times. The timefor lines 18-27 in the algorithm is thus O(n), since each iteration of the loop requires constanttime. Therefore, for the regular VST, the overall construction time is O(n). The time for patternmatching is in O(m log |Σ|). For the compact structure, the construction time is the same as theregular structure, but the VST is no longer stored linearly. Here we use an array to store therelation between the Q array and the compact VST. The searching time is now O(m|Σ|).

The space requirement clearly depends on the number of nodes in the VST, which is at mostn for a sequence of length n. Each node requires a fixed amount of memory to store, leading toan O(n) space requirement.

Number of nodes and practical space requirement.

The actual space needed for the VST depends on the topology of the suffix tree. This topologycan be captured by the number of internal nodes in the suffix tree, or alternatively, by thequantity RIL, the ratio between the number of internal nodes and the number of leaf nodes. Wecall RIL the density or branching factor for the suffix tree. We conducted an experiment toevaluate the effect of this branching factor on the storage requirement of the VST. The suffixtree was constructed and the branching factors computed for a set of files taken from [104]. Foreach file, we used the first 224 symbols as the text, and computed the branching factor. Table3.3 shows the results. The maximum ratio of 0.76 was observed for the file Jdk13c. Onaverage, however, the maximum ratio was around 0.63. The worst case occurs for a sequencewith |Σ| = 1, (that is, T = an), leading to a branching factor of 1. The table shows that, for agiven sequence, the branching factor depends on a complex relationship between n, |Σ|, and themean LCP.

The space requirement for the VST, for both the compact and regular structures dependsdirectly on the branching factor. The last two columns in Table 3.3 show the maximum spacerequirement for each file.

The foregoing discussion leads to the following lemma on VST construction:


Lemma 3: Given a string T = T [1..n], with symbols from an alphabet Σ, the virtual suffixtree (without suffix links) can be constructed in O(n) time, and O(n) space, independent of Σ.

3.4 Computing Suffix Links

Constructing the suffix tree from the suffix array as described in [3] does not include thesuffix link. There are also a number of other suffix tree construction algorithms that build thesuffix tree without the suffix link. See for example, Farach et al. [38], and Cole and Hariharan[29]. The suffix link, however, is a significant component of the suffix tree, and is important incertain applications, such as approximate pattern matching using matching statistics, and otherforms of traversal on the suffix tree. Thus, a data structure to support the complete functionalityof the suffix tree requires an inclusion of the suffix link. Recent efficient data structures for suffixtrees have thus provided mechanisms for constructing the suffix link. The ESA [2] providedsuffix links using complicated RMQ preprocessing [18]. The LST [62] also supported suffixlinks using the lcp-interval tree and intervals defined on the inverse suffix array. A recentwork by Maaβ [80] focused exclusively on suffix link construction from suffix arrays, or fromsuffix trees that do not have such links.

The virtual suffix tree provides a natural mechanism for constructing suffix links. The keyidea is that suffix links in the VST can be computed bottom-up, from the nodes with the highestnode-depth (leaf nodes) in the VST to those with the least (the root). This is based on thefollowing two observations about suffix links.

1. Consider a leaf node uST in the suffix tree corresponding to suffix Ti in the original se-quence. The suffix link from uST will point to the leaf node corresponding to the suffixTi+1 (that is, the suffix that starts at the next position in the sequence).

2. The suffix link from a node u in the VST will point to some node w with a smaller string-depth in the VST, such that |L(u)|= |L(w)|+1 (or equivalently |L(uST )|= |L(wST )|+1).

The following lemma establishes how we can build suffix links on the VST.


Lemma 4 Given the VST for a string T = T [1..n] of length n, the suffix links can be con-structed in O(n) time using an additional O(n) space.

Proof: Let u and w be two arbitrary nodes in the VST. Let v be the parent node of u. Letu.slink be the node to which the suffix link from node u points. We consider two cases:

Case A: u is a leaf node in the VST. Then, using the above observations, the suffix link fromnode u will point to node w in the VST (that is, u.slink = w) such that SA[w.sa index] =SA[u.sa index] + 1. Clearly, |L(w)| = |L(u)| − 1, where L(x) is the path label of node x.Note that this path label is not explicitly stored in the VST, but for each node, the length can becomputed in constant time. This computation can be performed in constant time by maintainingtwo arrays and observing that n−|L(w)|= n−|L(u)|+1. One array is the inverse suffix array(ISA) for the given string, defined as follows: ISA[i] = j if SA[ j] = i, (i, j = 1,2, ...,n). Thesecond is an array M that maps the SA values to the corresponding parent nodes in the VST,defined as follows: M[i] = u, if uST in ST is the parent node of the leaf node corresponding tothe suffix TSA[i]. Clearly, both arrays can be computed in linear time, and require linear space.

Case B: u is not a leaf node in the VST. This is a simpler case. When u is an internalnode in the VST, the suffix link from u will point to some node w, such that w is an ancestorof node u.fchild.slink, such that |label(u,u.fchild)|= |label(w,u.fchild.slink)|.The O(n) time result then follows by using the skip/count trick [46], by observing that a VSThas at most n nodes, a node depth of at most n, and that each upward traversal on the suffix linkdecreases the node depth by at least 1. �

Although the above description is from the viewpoint of a VST already constructed, thesuffix links can be constructed as the VST is being built, by some modification of the VSTconstruction algorithm. Algorithm CONSTRUCT-VST-WITH-SUFFIXLINKS (Algorithm 3.2)shows a modification of Algorithm algorithm CONSTRUCT-VST (Algorithm 3.1) to incorporatesections to compute the suffix links. The suffix link construction algorithm is based on theQ array used during the VST construction. We observe that the additional work required toconstruct the suffix links on the VST is independent of the alphabet size.

Figure 3.4 shows the result of the suffix link algorithm when applied to the VST of ourexample string T = missississippi$. Essentially, given the VST, the suffix link is con-


structed right to left, node-depth by node-depth, starting with the rightmost node at the deepestnode-depth, and moving up the VST until we reach the root. Thus, the order of suffix linkconstruction in the example will be SL1,SL2, . . . ,SL9.

Algorithm 3.2 shows that the additional work required to compute all the suffix links islinear in the length of the string. After construction, the suffix link on the VST will requireone additional integer per internal node in the VST. This can be compared with the 2 integersper node required to store the suffix link using the ESA, or LST. In a typical VST, where themaximum branching factor is usually less than 0.7, the suffix link will require a maximum extraspace of 0.7n ∗ 4 = 2.8n bytes. Table 3.4 shows the space required for the VST (including thesuffix array and suffix links) for both the compact structure and the regular VST, at varyingvalues of the branching factor.

We summarize the above discussion in the following theorem which captures the secondmain result of the work:

Theorem 3.2: Given a string T = T [1..n], with symbols from an alphabet Σ, the virtual suffixtree, including the suffix links, can be constructed in O(n) time and O(n) space, independent ofΣ.

Proof. The theorem follows directly from Lemma 3 and Lemma 4. �

3.5 From SA to VST

So far, we have constructed the VST by first building the suffix tree from the suffix array,and then converting the suffix tree to a VST. The major problem with this approach is therelatively large memory requirement for suffix tree construction (for instance, compared to itsstorage). In this section, we eliminate this problem by constructing the VST directly from thesuffix array, without a need to first construct the suffix tree.

The VST mainly encodes the structural information in a suffix tree, while avoiding the needto store some information that could be computed from the encoded structure. Thus, the keyto going from SA to VST directly is to observe how the SA encodes the structural information


in a suffix tree. The observation is that, given a sequence, the branching information in itssuffix tree can be determined by making use of the corresponding suffix array and lcp arrayof the sequence. The edge labels, and hence edge lengths can be determined by analyzing thedifferences between adjacent lcp values. In a sense, it was this same observation that was usedin constructing the suffix tree from the suffix array in [3] which was exploited in Algorithm 3.1.

Like the suffix tree, the VST has two types of nodes, leaf nodes and non-leaf nodes. TheVST encodes only the non-leaf nodes in the suffix tree. Each leaf node in the VST correspondsto an internal node in the suffix tree whose child nodes are all leaf nodes in the suffix tree. Thesuffix tree leaf nodes in turn point to positions in the suffix array. Thus, to determine the VSTnodes and their respective attributes from the suffix array, we consider whether the node is aVST leaf node, or a non-leaf node. We call the former Type 0 nodes, and the later Type1 nodes. The problem then is to determine how the VST node attributes are derived from theSA and lcp for each type of node. We take a two step approach. First, we scan the SA andlcp from left to right, and use a temporary data structure to record pertinent information abouteach node in the VST. The temporary data structure (denoted TA) will be an array of structures,(similar to a VST node structure), but a TA node will contain more information than a VSTnode. Each node in the VST has a corresponding entry in TA. In the second stage, we constructa mapping function (denoted MAP) that provides a one-to-one map from the elements in TA tothe VST nodes. At this stage, some attributes in TA are renamed, and non-required fields in theTA structure are removed to give the required virtual suffix tree.

The first stage makes use of two structures – the TA structure and a stack data structure.The stack structure has two elements, the stack value (an integer) and the stack type (one bit).The stack type shows the VST node types described above. Type 0 indicates an unmergedleaf node, while Type 1 indicates an unmerged internal node. The TA structure contains thesame attributes as a VST node (sa index, fchild, nfleaf, and nnleaf), in additionto two pointers, namely, next which points to the next sibling node of the current node, andrsa index, the rsa index of the current node.

The algorithm scans the suffix array (SA) and lcp array (LCP) from left to right, (assumesthe suffixes are sorted left to right in ascending order), and determines whether to create anew node based on the lcp values. The condition for starting the procedure to create a newnode is when LCP[Stack.top.index] is larger than lcp of current index. We exit from the


procedure when LCP[Stack.top.index] is less than lcp of current index. When entering thenode creation procedure, we use curNode to denote the current index, and the curLCP to denotethe lcp of current index. Whenever we exit the procedure, we run a special exiting function forhousekeeping, which could also create a new node. We make use of the following definitions:

Stack.top.index=

{Stack.top.value : Stack.top.type= 0TA[Stack.top.value].rsa index : Stack.top.type=1

(3.4)

curNode.sa index=

{curNode.value : curNode.type=0TA[curNode.value].sa index : curNode.top.type=1

(3.5)

Essentially, a node is created by merging an existing internal node with another internalnode, or with a leaf node, or by merging two leaf nodes to form an internal node. The algorithmmakes use of several auxiliary routines, depending on the type of node. There are two cases,corresponding to the two node types:

1. CASE 1: VST LEAF NODES.

Here, Stack.top.type=0. We consider two sub-cases.

Case 1A: LCP[Stack.top.index] > curLCP

• Case 1A1: LCP[Stack.top.index] 6= LCP[curNode.sa index]We merge Stack.top.index and curNode to a new element of TA, say T[k].Let Stack.top.index be the leftmost leaf node and curNode be the right-most child. The required update is performed using the merge1A1( ) routinedescribed as follows: If curNode is an index for SA, then T[k] has two leafnodes: if (curNode.type = 0), then update T[k] as follows: T[k].sa index =Stack.top.index, T[k].nfleaf=2 (since there are 2 leaf nodes), T[k].rsa index

= curNode.value. If curNode is an element of TA, then T[k] has one leaf node,


and one child node; then, update T[k] as follows: T[k].sa index = Stack.top.index,T[k].fchild = curNode.value, T[k].nfleaf=1 (since there is one leaf node),T[k].rsa index = TA[curNode.value].rsa index.

• Case 1A2: LCP[Stack.top.index] = LCP[curNode.sa index].In this case, curNode must be an element of TA. We update the node as follows:

TA[curNode.value].sa index = Stack.top.index,

TA[curNode.value].nfleaf = TA[curNode.value].nfleaf+1, and popthe stack.

Case 1B: LCP[Stack.top.index] = curLCP.

Again, in this case, curNode must be an element of TA. We simply update the num-ber of leafs, namely, numleaf=numleaf+1 and pop the stack

2. CASE 2: VST INTERNAL NODES.

Here Stack.top.type=1. We also consider two sub-cases.

Case 2A: LCP[Stack.top.index] > curLCP.

We merge TA[Stack.top.value] and curNode to a new element of TA, say T[k].The first child node will be TA[Stack.top.value], and the next (sibling) nodewill be curNode. The required update is performed using the merge2A() routinedescribed as follows: If curNode is an index for SA, then T[k] has one leaf node.Update T[k] as follows: T[k].sa index = TA[Stack.top.value].sa index,T[k].fchild = TA[Stack.top.value], T[k].nfleaf=0, T[k].rsa index = curN-ode.value,

TA[Stack.top.value].nnleaf=1 (this leaf is curNode). If curNode is an ele-ment of TA, then T[k] has two child nodes. If LCP[Stack.top.index] = LCP[Stack.(top-1).index] then pop the stack. (These two must be an element of TA). Then, updateT[k] as follows:


T[k].sa index = TA[Stack.top.value].sa index,T[k].fchild = TA[Stack.top.value], T[k].nfleaf=0,T[k].rsa index = TA[curNode.value].rsa index,TA[Stack.top.value].next = curNode.value.

Case 2B: LCP[Stack.top.index] = curLCP.

The update here is performed using the merge2B() routine, described as follows: IfcurNode is an index for SA, then, update as follows: TA[Stack.top.value].nnleaf= TA[Stack.top.value].nnleaf + numleaf +1. If curNode is an element ofTA, then update as follows: TA[Stack.top.value].nnleaf = TA[Stack.top.value].nnleaf+ numleaf, TA[Stack.top.value].next = curNode.value.

The special exit housekeeping procedure is performed using exitFunction(). The proce-dure is described as follows: If numleaf = 0, then push curNode into stack. Otherwise,(so we must have numleaf 6= 0), then let T[k] be the node resulting from merging curN-ode with the leaf which has the same LCP value as curLCP. Note that curNode is an el-ement of TA. Update T[k] as follows: T[k].sa index = TA[curNode.value].sa index-numleaf, T[k].nfleaf = numleaf, T[k].fchild = curNode.value, T[k].rsa index =TA[curNode.value].rsa index. Push T[k] into the stack, (equivalently, push (k) and set stacktype to 1), and increment k by 1. If curLCP ≤ lcp of next index then push curNode into thestack).

Algorithm 3.3 shows the steps for constructing the TA structure, given the SA and LCParray. Figure 3.5 shows the VST nodes (nodes in TA) created using the algorithm on our runningexample string T = missississippi$. Table 3.5 shows the attributes of each node in theTA structure. Notice that some nodes, such as TA[2] and TA[4] were updated at later steps ofthe algorithm, after their initial creation.

Algorithm 3.4 shows how the TA node labels are mapped to VST node labels. The algo-rithm computes elength for the TA nodes, in order to compute asa index, the adjustedsa index which is used in the VST to avoid storing the edge lengths. Table 3.5 shows theresult of the mapping for the TA structure shown in Fig. 3.5 and Table 3.5. The algorithm usesa simple auxiliary function computeChildren elengths() to determine the edge lengths.


Building the suffix links on the VST structure above can be done as was done earlier. Thetwo algorithms still maintain the linear time construction for the VST.

3.6 Summary

In this work, we have presented the virtual suffix tree (VST), an efficient data structure forsuffix trees and suffix arrays. The searching performance is the same as the suffix tree, that is,O(m log |Σ|) for a pattern of length m, with symbol alphabet Σ. We also showed how suffix linkscan be constructed on the VST in linear time, independent of the alphabet size. The VST doesnot store the edge lengths explicitly. This is achieved by modifying a key property of the suffixtree - the requirement that no two edges from a given node in the suffix tree can start with thesame symbol. This key modification leads to a major distinction between the VST and the suffixtree, and results in extra space saving. However, whenever needed, the length for any arbitraryedge in the VST can be obtained in constant time using a simple computation. A further spacereduction leads to a more compact representation of the VST, but at the expense of an increasedsearch time, from O(m log |Σ|) to O(m|Σ|).

The space requirement depends on the topology of the suffix tree, in particular, the branch-ing factor. For the compact structure, the worst case space requirement (including the suffixarray) is 11.5n bytes without suffix links, and 15.5n bytes with suffix links, where n is thelength of the string. However, in practice, the branching factor is typically less than 0.7. Forthe compact structure, this gives less than 9.25n bytes on average without the suffix links, or12.05n bytes with suffix links.

In this work, we started from efficient storage of the suffix tree and suffix array after theyhave been constructed. Thus, we constructed the VST from the suffix tree, which in turn wasconstructed from the suffix array. To reduce the space requirement at construction time, weintroduced another algorithm that constructs the VST directly from the suffix array. An inter-esting question is whether one can construct compressed versions of the VST, in a way that isanalogous to compressed suffix trees and compressed suffix arrays. This could lead to furtherspace saving for the VST.


(a) (b)

Figure 3.1. Suffix tree and virtual suffix tree for the string T = missississippi$.(a) suffix tree ; (b) virtual suffix tree. The number at each leaf node indicates the position in SA. The number ateach internal node indicates the node ID in the VST.

Figure 3.2. Example VST (solid nodes) showing left SA index (lSA) and right SA index (rSA)

for sample nodes.


(a) original tree (b) improved tree after adjusting the edge lengths

Figure 3.3. Edge-length adjustment procedure.

(a) modified suffix tree (b) improved virtual suffix tree

Figure 3.4. Improved VST for the string T = missississippi$.


Algorithm 3.1: VST Construction Algorithm

CONSTRUCT-VST(T,n)1 SA← COMPUTE-SUFFIXARRAY(T,n)2 ST ← SUFFIXTREE-FROM-SUFFIXARRAY(SA)3 ST ← ADJUST-EDGELENGTHS(ST )4 Initialize VST[],Q[], pTop=0, pBottom=0, curNode=root, Q[pTop]=root5 while (pBottom >= pTop)6 for ( each childnode in curNode) do7 if (childnode is internal node in ST ) then8 pBottom← pBottom + 1; Q[pBottom]← childNode9 if childnode is first internal node then10 VST[pTop].fchild← pBottom11 end if12 else13 Update VST[pTop].nfleaf and VST[pBottom].nnleaf14 end if15 end for16 pTop← pTop + 1; curNode← Q[pTop]17 end while18 for (pb← pBottom down to 0) do19 if (VST[pb] is leaf node) then20 VST[pb].asa index← Q[pb].fchild21 else if (Q[pb].elength=1) then22 VST[pb].asa index←VST[pb].fchild.asa index

+ VST[pb].nfleaf - Q[pb].elength23 else24 VST[pb].asa index←VST[pb].fchild.asa index

+ VST[pb].nfleaf - Q[pb].elength+ Q[pb].elength25 end if26 end if27 end for


Table 3.3. Branching factor and maximum space requirement for various sample files.File |Σ| Max Ratio Compact Regular DescriptionBible 63 0.61 8.60n 10.13n King James bibleChr22 5 0.73 9.50n 11.33n Human chromosome 22E.coli 4 0.65 8.89n 10.52n Escherichia coli genomeEtext 146 0.54 8.02n 9.36n Texts from Gutenberg projectHowto 197 0.55 8.13n 9.51n Linux Howto filesJdk13c 113 0.76 9.69n 11.59n JDK 1.3 documentationRctail 93 0.66 8.95n 10.60n Reuters news in XML formatRfc 120 0.64 8.77n 10.36n Concatenated IETF RFC filesSprot 94 0.61 8.54n 10.05nWorld 94 0.54 8.06n 9.41n CIA world fact bookAverage 0.63 8.71n 10.29n

Algorithm 3.2: VST construction with suffix links

CONSTRUCT-VST-WITH-SUFFIXLINKS(T,n)4 Initialize VST[],Q[],ISA[],M[], pTop←0, pBottom←0, curNode←root, Q[pTop]←root

...18 for (pb← pBottom down to 0) do19 if (VST[pb] is leaf node) then20 Update array M to map SA index and node VST [pb]

...26 end if27 end for28 for (pb← pBottom down to 0) do29 if (VST[pb] is leaf node) then30 VST[pb].slink←M[ISA[VST[pb].sa index+1]]31 else32 Find ancestor w of VST[pb].fchild.slink s.t.

|label(w, VST[pb].fchild.slink)|=|label(VST[pb], VST[pb].fchild)|33 Set VST[pb].slink← w34 end if35 end for


(a) (b)

Figure 3.5. Suffix links on the VST for the sample string T = missississippi$.

(a) suffix links on the VST, but showing the leaf nodes of the suffix tree; (b) suffix links on VST (no ST leaf nodes).

The suffix links are labeled SL1,SL2, . . .SL9, indicating the order in which they were constructed

Table 3.4. Storage requirement for the VST, including suffix linksRatio Compact Regular

Worst Case 1 15.50n 18.00nAverage Case 0.75 12.63n 14.50n

0.7 12.05n 13.80n0.65 11.48n 13.10n0.6 10.90n 12.40n

Table 3.5. Detailed attributes for nodes in the TA data structure using the sample sequence,

T = missississippi$.TA[0] TA[1] TA[2] TA[3] TA[4] TA[5] TA[6] TA[7] TA[8] TA[9]

label P0 P1 P2 P3 P4 P5 P6 P7 P8 P9

sa index 4 3 1 0 7 10 9 13 12 9fchild null TA[0] TA[1] TA[2] null null TA[5] null TA[7] TA[6]next null null TA[4] null TA[9] null TA[8] null null nullrsa index 5 5 5 5 8 11 11 14 14 14nfleaf 2 1 2 1 2 2 1 2 1 0nnleaf 0 0 1 0 0 0 0 0 0 0


Figure 3.6. Constructing VST from the suffix array.

Nodes are labeled based on their labels in the temporary array, TA. The mapping of the TA node labels to the

corresponding VST node labels is shown in Table 3.5.

Table 3.6. Node mapping table from TA to VSTTA nodes P0 P1 P2 P3 P4 P5 P6 P7 P8 P9

VST nodes N7 N4 N1 root N2 N8 N5 N9 N6 N3


Algorithm 3.3: Constructing VST From Suffix Array

CONSTRUCT-VST-FROM-SA(LCP[],SA[])1 Stack←buildStack(); TA[n]; k← 0; st.value=0; st.type=0; push(st)2 for (i← 1 to n) do3 if (LCP[i] ≥ LCP[Stack.top.index]) then4 st.value=i; st.type=0; push(st)5 else6 curLCP←LCP[i]; curNode.value←i; curNode.type←0; numleaf←07 do while(Stack is not empty & curLCP≤ LCP[Stack.top.index])8 if(Stack.top.type=0 & LCP[Stack.top.index] > curLCP) then9 if(LCP[Stack.top.index] 6= LCP[curNode.sa index]) then10 TA[k]← merge1A1(Stack.top.index,curNode)11 curNode.value← k; curNode.type← 1; k← k +112 else13 TA[curNode.value].sa index← Stack.top.index;

TA[curNode.value].nfleaf← TA[curNode.value].nfleaf+114 end if15 else if(Stack.top.type=0 & LCP[Stack.top.index]=curLCP) then16 numleaf← numleaf+117 else if(Stack.top.type=1 & LCP[Stack.top.index]>curLCP) then18 TA[k]← merge2A(TA[Stack.top.value],curNode)19 curNode.value← k; curNode.type← 1; k← k +120 else21 merge2B(TA[Stack.top.value],curNode)22 pop(Stack); break23 end if24 pop(Stack)25 end while26 exitFunction()27 end if28 end for29 root← the value of last element of Stack30 return MAP-TA-TO-VST(TA[],k,ROOT)


Algorithm 3.4: Mapping TA nodes to VST nodes

MAP-TA-TO-VST(TA[],k,root)1 MAP[0..k-1]; W[0..k-1]; pTop←-1; curNode← root2 TA[curNode].fchild.elength← 13 for (j← 0 to k-1) do4 MAP[j]← curNode5 computeChildren elengths(curNode)6 if (curNode.next 6= NULL) then7 curNode← curNode.next8 else9 pTop← pTop + 110 do while (TA[MAP[pTop]].fchild = NULL)11 pTop← pTop + 112 end while13 curNode← TA[MAP[pTop]].fchild14 if (TA[curNode].elength > 1) then15 TA[curNode].sa index←TA[curNode].sa index+TA[curNode].elength16 end if17 TA[MAP[pTop]].fchild← j+118 end if19 end for20 for (j← 0 to k-1) do21 W[MAP[j]]← j22 end for23 for (j← 0 to k-1) do24 if (MAP[j] ≥ 0 then)25 index← j26 SWAP1← TA[MAP[index]]27 do while(MAP[index] ≥ 0)28 SWAP2← TA[index]29 TA[index]← SWAP130 SWAP1← SWAP231 MAP[index]← -132 index←W[index]33 end while34 end if35 end for36 Remove next,rsa index,elength from TA37 Return TA as VST

Chapter 4

The Probabilistic Suffix Array

4.1 Introduction

It has earlier been shown [109] that the probabilistic suffix tree (PST) is equivalent to theprobabilistic suffix automata which is a type of probabilistic finite automata (PFA). The PFAon the other hand can be viewed as a variable length Markov model (VLMM), whose memorylength constraint is determined by the observed data. We present the probabilistic suffix array(PSA), a new data structure for representing information in variable length Markov chains. ThePSA essentially encodes information in a VLMM by providing a space-efficient representationof the probabilistic suffix tree (PST). Our PSA provides the same functionality as the PST, butat a reduced space requirement. The equivalence between the PST and a class of PFAs impliesthat our PSA is also equivalent to this class of PFAs.

Main Resultes. We present algorithms to construct the PSA and for sequence predictionbased on the constructed PSA. Our algorithms are based on the notion of empirical probabilities,modeled using information retrieval notions of term frequency (TF) and document frequency(DF). We present a linear time algorithm for computing the document frequency. We state ourmain results in the form of two theorems about the PSA.

59

CHAPTER 4. THE PROBABILISTIC SUFFIX ARRAY 60

Theorem 4.1: Given a sequence T = T [1...n], with symbols from an alphabet Σ, and thememory constraint L on the variable length Markov model, the probabilistic suffix array (PSA)for T can be constructed in O(n) time, and O(n) space, independent of the Markov order, L.

Theorem 4.2: Given a sequence T = T [1...n], with symbols from an alphabet σ, whereσ = σ1σ2...σ|Σ|, and the probabilistic suffix array (PSA) for T , we can decide on whether apattern P = P[1..m] is generated by the same variable length Markov chain that generated T inO(m log n

|Σ|) time.

In previous work by Ron et al. [109] and Apostolico and Bejerano [9], the probabilistic finiteautomata (PFA) was represented using the probabilistic suffix tree (PST). Here, we present aspace-efficient data structure to simulate such finite state machines when represented as a PST.Since a variable length Markov model is a finite state machine, and can be represented as aPST, our proposed structure can be used to capture the information in a variable length Markovmodel. We call our data structure the probabilistic suffix array (PSA), since it is built on suffixarrays rather than the suffix tree data structure. The PSA uses an array of nodes to capturethe branching structure in the suffix tree, and other auxiliary arrays to maintain informationneeded for learning from the observed data. Learning in the PSA is performed by computingconditional probabilities at each node in the PSA using empirical probabilities computed viathe T F and DF .

Organization. In the next section, we briefly describe the PST using an example sequence.In Section 3, we present our PSA data structure. We also give an example for the PSA toexplain how it works. The construction algorithm is presented in Section 4. We analyze itspractical space requirement in Section 5. Section 6 presents experimental results on proteinfamily classification and phylogenetic tree construction. The last section provides a summaryand conclusion.

4.2 Probabilistic Suffix Tree

Given a sequence T with n observations, the Markov model needed to represent T willrequire space that is exponential in L, the order of the Markov model. In practice, the transition


matrix for the Markov model to represent such a sequence will be sparse as the order L increases.Suffix tree data structures represent all substrings of a sequence seq in O(n) internal nodes andleaf nodes. Ron et al. [109] presented a space-efficient data structure called the probabilisticsuffix tree (PST) to represent the order L transition matrix. The PST encodes only the non-zero transition probabilities. Therefore, the PST is a suffix tree which contains the transitionprobabilities for the Markov model with any order L, where 0 < L ≤ n. However, in practice,the space requirement for the suffix tree is still a problem.

Consider the sample sequence T = accactact$. Its first order transition matrix and thecorresponding state diagram are shown in Figure 4.1. Figure 4.2 shows an example suffix treeand the corresponding PST, for this example sequence. The PST is shown for the case of orderL = 3. In this PST, we label the transition probability in each symbol. The transition probabilityfor a given symbol is calculated using the conditioning context.

a c

tg

0.25

1

0.25

0.5

1

(a) State Diagram (b) Transition Matrix

Figure 4.1. State diagram and transition matrix for a first order Markov model for an example

sequence T = accactact$.

4.3 Proposed Data Structure

We propose the probabilistic suffix array (PSA) as a way to simulate the probabilistic suffixtree. Each node in the PSA has a corresponding node in the PST. The basic PSA structure hasthree types of attributes. The first category of attributes are the foundation attributes, whichconsists of the original text and its suffix array. Construction of the suffix array is in O(n) time


(a) (b)

Figure 4.2. Example suffix tree and probabilistic suffix tree for the string T =accactact$.

(a) suffix tree ; (b) probabilistic suffix tree. The trees are shown without the suffix links. The numbers at each nodeof the PST are based on the count of the number of times the symbols are observed after observing the sequencecorresponding to the node label. Essentially, at a given node, say u, these encode the conditional probabilitiesP(σ|C) of observing the symbol σ following the sequence L(u).

and space, using any of the various linear-time linear-space algorithms. See for example [4, 57,97, 104]. The second category of attributes are the internal node attributes. These are derivedfrom the interval array, which is determined following [131]. These are used to represent theinternal nodes in the suffix tree including the suffix links. The suffix link is the link from aninternal node to its suffix node. The third type of attributes are measurement attributes. Theserecord measurement information, such as term frequency, document frequency, and conditionalprobabilities, which are needed to compute probabilities in the Markov model. Table 4.1 showsthe three categories of attributes used in the PSA. We use the term length of PSA to refer to thenumber of nodes in the PSA. The term length emphasizes the fact that our PSA nodes are storedas arrays. In this work, we use M to denote the PSA length.

4.3.1 Internal Node Attributes

The internal node attributes are derived from the interval array. The pair < Start,End >

represents the interval position of this node in the suffix array. Length denotes that length


of longest common prefix between the substrings represented by the node. This essentiallycorresponds to the length of the path from root to the current node. The PSA internal nodeattributes are used to simulate the internal nodes in a suffix tree. The attribute Suffix Link is aregular suffix link from the current internal node to its suffix node. We use this link to continuethe searching process when a mismatch occurs at the time of prediction. The internal nodeattributes including the suffix link are constructed using Algorithm BUILDPSA.

4.3.2 Measurement Attributes

These attributes are dependent on the application. For example, for applications in docu-ment feature selection, or in protein sequence classification, we only compute and store T F andDF , and the conditional probabilities. In other applications, such as document clustering, wemay need to compute document frequency for the classes, rather than just the document fre-quency. The attributes are also dependent on the methods used in calculating the probabilitiesin the Markov model. Thus, we focus on the method of computing the conditional probabilities,and the probability of a node in the VLMM. For a given node with node label, say (S1...Sn), itsprobability, PV LMM is given by:

PV LMM = P(S1...Sn) = P(Sn|S1...Sn−1)...P(S2|S1)P(S1) (4.1)

If S = S1...St−1St (where 1≤ t ≤ n ) does not occur in the training data, we find the longestsuffix of S which occurred in the training data. Assume the Sk...St−1St (where 1≤ k ≤ t) is the

Table 4.1. Attributes of PSAType Attribute SpaceFundamental Original Text(text) charAttributes Suffix Array(SA) integerInternal Node Start integerAttributes End integer

Length integerSuffix Link integer

Measurement Term Frequency(TF) integerAttributes Document Frequency(DF) integer

cProbability P(St |Sk...St−1)(CP) float


longest suffix of S, then P(St |S1...St−1) = P(St |Sk...St−1). Thus,

P(St |S1...St−1) = P(St |Sk...St−1) =

T FSt

n : k = t

T FSk ...StT FSk ...St−1

: k < t(4.2)

Here, T Fu is the term frequency of the node with node label u. We make two observations:First, if the terminal symbol St of a path Sk...St−1St is a first symbol in an edge of the suffixtree, then the conditional probability P(St |Sk...St−1) can be computed as the frequency of thecurrent node divided by the frequency of the parent node. Secondly, if the terminal symbolSt of a path Sk...St−1St is a non-first symbol in an edge of the suffix tree, then the conditionalprobability P(St |Sk...St−1) is 1. We call this a trivial conditional probability. Thus, we need tostore the conditional probabilities for only the first symbols in each edge. When we determinethat the terminal symbol of a path is a non-first symbol in an edge, we simply return 1 for theconditional probability of the symbol.

4.3.3 Example PSA

Table 4.2 and Table 4.3 show the nature of the PSA nodes and the order-3 conditionalprobabilities in a PSA, using the sequence T = accactact$ used in Figures 4.2 and 4.1. Theentries in Table 4.2 are directly calculated from the interval array. We notice that P(c|a) is1. This indicates the substring “ac” is represented on one edge and that the terminal symbol”c” is the second symbol on this edge. This is an example of a trivial conditional probability.P(c|ta) also is an example whose probability is 1.

Entries in Table 4.3 are computed from the PSA leaf nodes. We only showed the non-trivial conditional probabilities on the leaf nodes. These conditional probabilities are calculatedbased on the first observation described in Section 4.3.2. The numerator is 1 since this is a leafnode which must have a frequency of 1. The denominator is the frequency of the substringcorresponding to the node label of the parent of the current leaf node. This is easily obtainedas (End−Start +1) using the elements in the interval array. The PSA nodes can be comparedwith nodes in the example PST shown in Figure 4.2. The transition matrix in Figure 4.1(b) canbe constructed in the PSA.


PSA node Interval Length Suffix Probability ConditionalIndex Link Expression Probability

1 < 2,3 > 3 2 P(t|ac) 23

2 < 1,3 > 2 3 P(a) 13

2 < 1,3 > 2 3 P(c|a) 13 < 6,7 > 2 4 P(t|c) 1

2

4 < 4,7 > 1 -1 P(c) 49

5 < 8,9 > 1 -1 P(t) 29

Table 4.2. Example PSA internal nodes, using the PSA of the sequence T = accactact$

SA index Probability ConditionalExpression Probability

1 P(c|ac) 13

3 P(a|act) 14 P(a|c) 1

4

5 P(c|c) 14

7 P(a|ct) 12

9 P(a|t) 1

Table 4.3. Example PSA leaf nodes, using the PSA of the sequence T = accactact$

4.3.4 Interval Array and Document Frequency in Linear Time

The interval nodes represent the basic structure of the PSA. Since other attributes are basedon the structure of these interval nodes, we will need to compute these interval nodes first. Theterm frequency (T F), and document frequency(DF) are computed as by products, as we buildthe interval nodes. In this section, we give a new algorithm to compute DF in linear time.Compared with the original algorithm [131] which runs in O(n logn) time, our algorithm ismore efficient and applicable to large document collections. The interval nodes are stored as anarray using the interval array data structure. To build the interval array, we use Yamamoto andChurch’s T F algorithm [131]. In our structure, the non-trivial lcp-delimited intervals representthe internal nodes of the suffix tree. Therefore, we modify the algorithm to output only the non-trivial lcp-delimited intervals. After this procedure, we obtain the interval array which includesinterval attributes < Start,End > and Length for each node. Length is simply the LCP of theinterval represented by < Start,End >. The attribute T F is also computed at this stage.


For applications that involve multiple sequences or documents, for instance, in text clus-tering or in protein sequence classification, we may require the document frequency. In thiswork, the input sequence in such applications will be a concatenation of all the sequences, witha special end of document symbol ($) delimiting each individual sequence.

The original algorithm proposed in [131] for computing document frequency runs on O(n logn)time. Here, we modify the algorithm to improve its running time to O(n). The original algo-rithm was reproduced in Figure2.1 for easy reference.

In line 4-6, the algorithm searches for the largest x. The worst case for this search will beO(logn). We add a new array docsp that maps the document id to a stack. The length of thisarray is the number of documents, Z. When we calculate a new interval < Start,End >, the al-gorithm will check whether the element has been observed previously. The algorithm searchesdocsp with the document id of the new element. If the document id is found, the documentfrequency (DF) is changed. The process performs a simple look-up using docsp in constanttime, and hence the modified algorithm runs in O(n) time. Algorithm COMPUTEDOCUMENT-FREQUENCY implements the proposed modifications.

Algorithm 4.1: Computing Document Frequency in Linear Time

COMPUTEDOCUMENTFREQUENCY

(3) doc← getdocnum(s[j]), docsp[doc]← sp(4) if doclink[doc] 6= -1, do(5) if docsp[doc] >sp or stack[docsp[doc]].i > doclink[doc] do(5.5) stack df[docsp[doc]]← stack df[docsp[doc]]-1(6) doclink[doc]← j, docsp[doc]← sp

4.4 Constructing the PSA

Having described the building blocks for the probabilistic suffix array (PSA), we are nowready to describe how we put them together to construct the PSA. Algorithm BUILDPSA (Al-gorithm 4.2) uses five major steps or procedures in constructing the PSA data structure for agiven input sequence. The first step is the construction of the suffix array from the originalsequence. We use standard linear-time linear-space algorithms for this step. Using Yamamoto


and Church’s PRINT LDIS STACK() function [131], we construct the interval array. We thensimulate the tree-like structure from the interval array in the third step. The third step maps eachposition in the input sequence to its interval. In the forth step, the routine BUILDSUFFIXLINK

starts by first constructing the inverse suffix array W . From W , the algorithm determines alink that allows the interval array to point to the next position. Thus, the suffix link is easy toconstruct.

The final procedure constructs a ranked list of the elements in the interval array (the nodesin the PSA) in a non-decreasing order. This process uses the counting sort in O(M) time andusing 2M integer space, where M is the number of non-trivial lcp-delimited intervals. SinceM ≤ n, this will take O(n) time.

Algorithm 4.2: Building the Probabilistic Suffix Array

BUILDPSA(Text)1 SA← BUILDSA(Text)2 IntervalArray← PRINT LDIS STACK(SA)3 <posInterval,parent>← BUILDINTERVALTREE(SA,IntervalArray)4 PSA← BUILDSUFFIXLINK(posInterval,parent)5 PSA← SORTPSA(PSA)

4.4.1 Building the Interval Tree

Algorithm BUILDINTERVALTREE (Algorithm 4.3) uses the interval pairs < Start,End >

to construct a tree-like structure that encodes the parent-child relationships between the inter-vals. The main idea is to set the interval ID into a position between positions Start and End,such that the position has not been previously set. The problem is that the pairs < Start,End >

can overlap (nesting). A naıve algorithm for this task will require an O(n2) time.

We use a stack to store the free position which is the current pair < Start,End >. If theposition has not been earlier set to some interval ID, the algorithm sets the interval ID one byone from Start to End. If the position has earlier been set to some interval ID, the algorithmwill set the interval ID in the last part of the pair < Start,End > which has not earlier been set


to an interval ID. This process guarantees that the position will be set to one interval ID, andthat the position is included in only one < Start,End > pair.

Some positions which could not be set to an interval ID in the last pass will be set to aninterval ID in the next loop. The following step pops positions from the stack and sets intervalIDs into these positions. This process guarantees that every position will be set to one intervalID, even the positions that belong to the next higher level interval, or to a lower level interval.

After determining the interval IDs, the algorithm computes the parent array. The parentarray stores the link from one interval to the higher interval position which shares the samevalue for Start. The variable pN is a stack. This stack stores the intervals which have beencomputed, but the parent’s interval ID has not yet been set. Line 16 pushes the interval into thestack.

When the current interval includes the interval at the top of the stack pN, it means thecurrent interval is the parent of the interval at the top of pN. Thus, we set the value for parentof the interval at the top of pN to the current interval. The loop in lines 13 to 15 repeat the stepto find all the child intervals of the current interval.

4.4.2 Building the Suffix Link

Algorithm BUILDSUFFIXLINK (Algorithm 4.4) uses the suffix array (SA) to compute theinverse suffix array W . To compute the suffix link, the algorithm sets the position of the parentin W to be the current position. Thus we obtain the suffix link for each leaf node. To computethe suffix link of an internal node, we simply use the parent array to find the suffix link of theinternal node.

4.4.3 Sorting the PSA Structure

After obtaining the suffix link, we resort the PSA structure to make it suitable for efficientsearching during the VLMM prediction stage. The prediction procedure performs frequentsearches using the interval array. (See Section 4.4.5). After sorting using the < Start,Length >

attributes of the PSA, searching will be done in O(logM) time, where M is the PSA length, the


Algorithm 4.3: Buliding the Interval Tree

BUILDINTERVALTREE(SA[],Start[],End[],Length[])1 posInterval[]← -1; pt← 02 for (i← 1 to M) do3 if posInterval[i]=-1 do //first time to observe Start[i]4 for (j← Start[i] to End[i]) do5 if posInterval[j]=-1 do posInterval[j]← i else break end if6 end for7 for (j← End[i] down to Start[i]) do8 if posInterval[j]=-1 do posInterval[j]← i else break end if9 end for10 end if11 pop position pos which is between Start[i] and End[i] in Stack12 posInterval[pos]←i; push the position between Start[i] and End[i-1]13 while pt > 0 and Start[i] ≤ Start[pN[pt-1]]14 parent[pN[pt-1]]← i; pt← pt-115 end while16 pN[pt]← i; pt← pt+117 end for18 return < posInterval[], parent[] >

Algorithm 4.4: Buliding Suffix Link

BUILDSUFFIXLINK(SA[],Start[],Length[], posInterval[], parent[])1 for (i← 1 to n) do W[SA[i]-1]← i end for2 for (i← 1 to n) do W[i]← posInterval[W[i]] end for3 for (i← 1 to M) do4 if Length[i] 6= 1 and sLink = NULL do5 sLink[i]←W[SA[Start[i]]]6 k← parent[i]; pi← i7 do while (k 6= root and sLink[k] = NULL)8 sLink[k]← parent[sLink[pi]]9 do while (Length[sLink[k]] 6= Length[k]-1)10 sLink[k]← parent[sLink[k]]11 end while12 pi← k; k← parent[k]13 end while14 end if15 end for


number of the interval pairs in the PSA.

The algorithm SORTPSA (Algorithm 4.5) uses counting sort to sort the PSA structure. Inthe interval array, the attribute Length is already in decreasing order for intervals that sharethe same Start. Therefore, we only need to sort the structure based on the order of the Startattribute. The time complexity of this algorithm is linear with respect to the number of nodes(number of intervals). The additional space is at most 2n integers, see analysis in Section 4.5.

Algorithm 4.5: Sorting the PSA Structure

SORTPSA(Start[],Length[],sLink[])1 Count[]← 0; Order[]← 0; wOrder[]← 0; NewStart[]← 0; NewEnd[]← 0

NewLength[]← 0; NewsLink[]← 0; sum← 0; NewPattern[]← 02 for (i← 1 to M) do Count[Start[i]]← Count[Start[i]]+1 end for3 for (i← 1 to n) do4 if Count[i] 6=0 do sum← sum+Count[i]; Count[i]← sum-Count[i]; end if5 end for6 for (i← 1 to M) do7 Order[Count[Start[i]]]←i; Count[Start[i]]← Count[Start[i]] + 18 end for9 for (i← 1 to M) do wOrder[Order[i]← i end for10 for (i← 1 to M) do sLink[i]← wOrder[sLink[i]] end for11 for (i← 1 to M) do12 NewStart[i]← Start[Order[i]];NewEnd[i]← End[Order[i]];NewLength[i]← Length[Order[i]]13 NewsLink[i]← sLink[Order[i]];NewsPattern[i]← pattern[Order[i]]14 end for

4.4.4 Computing Conditional Probabilities Using the PSA

Algorithm COMPUTEPROBABILITY (Algorithm 4.6) is a simple routine which is basedon the PSA structure. It computes the conditional probability at each internal node by usingequation (4.2). The algorithm uses the temporary array parent which was generated by algo-rithm BUILDSUFFIXLINK (Algorithm 4.4). The array parent contains a record of the parentfor each given interval node. The algorithm is linear in time and is in place. After computingthe conditional probabilities, the space used by array pattern can be released, since the array isno longer needed.


We summarize the above results in Theorem 4.1, our first main result on the PSA:

Theorem 4.1: Given a sequence T = T [1...n], with symbols from an alphabet Σ, and thememory constraint L on the variable length Markov model, the probabilistic suffix array (PSA)for T can be constructed in O(n) time, and O(n) space, independent of the Markov order, L.

Proof: Algorithm BUILDSA and PRINT LDIS STACK each runs in O(n) time complexityas in [104] and [131] respectively. Algorithm BUILDINTERVALTREE computes two arraysposInterval and parent. The array posInterval stores the relationship between leaf nodes andinterval nodes, while parent represents the parent nodes for the interval nodes. There are nleaf nodes and M interval nodes. Lines 2-17 in algorithm BUILDINTERVALTREE searches eachinterval node. Lines 4-6 makes the relationships between current interval node and the leafnodes before the first child interval node of the current interval node. Lines 7-9 calculates therelationships between current interval node and leaf nodes after the last child interval node ofcurrent interval node. Line 11 computes the relationships between current interval node andleaf nodes which is not included in other interval nodes. Each leaf node will be computed onetime in algorithm BUILDINTERVALTREE. Lines 13-15 compute the child nodes of the currentinterval node. The time cost is O(M) over the whole algorithm, even with the loop in Line2. Algorithm BUILDSUFFIXLINK calculates the suffix link via the inverse suffix array W . Itcalculates suffix links starting at the lowest level interval nodes which does not include otherinterval nodes. Then it calculates the suffix link for the parent node of current interval nodefollowing the parent array. So the time is O(M) for computing the suffix links for all intervalnodes. There are 6 loops in algorithm SORTPSA. Each loop runs either n times or M times.So the time complexity is O(n). Therefore, overall, the probabilistic suffix array (PSA) for asequence T of length n can be constructed in O(n) time.

Linear space requirement follows from the space analysis in Section 4.5. �

4.4.5 Prediction with VLMM via the PSA

An important procedure in Markov models is to compute the probability that a given testpattern is generated by the model. For the variable length Markov model (VLMM), we denote


Algorithm 4.6: Calculating Conditional Probability

COMPUTEPROBABILITY(PSA,M,n)1 for (i← 1 to M) do2 if PSA.Length[i]=1 do3 PSA.cProbability[i]← PSA.TF[i]/n4 else5 PSA.cProbability[i]← PSA.TF[i]/PSA.TF[PSA.parent[i]]6 end if7 end for

this probability as the PV LMM of the input pattern. Algorithm VLMM-PREDICTION (Algorithm4.7) calculates the PV LMM of a test sequence. This algorithm uses Equation (4.1) to computeP(S1),P(S2|S1)... by searching for the sub-patterns S1,S1S2,... When there is a mismatch whilesearching with the sub-pattern Sk...St , the algorithm jumps to the node pointed to by the suffixlink attribute of the current node. Thus, matching proceeds from the node representing the suffixSk+1...St , after accounting for the prefix of this suffix, which has already been matched in theprevious step.

The algorithm scans positions in the pattern from left to right. While matching the sub-pattern, it uses the function LeftmostMatchedPosition which uses standard suffix ar-ray search algorithms [3, 84] to find the left most position (le f tPosn) that matched the sub-pattern. The function LeftmostMatchedPosition searches the pattern Pattern[s...i]#,where # is a symbol that never occurred in the alphabet. That is, # /∈ Σ,# < σ,∀σ ∈ Σ, and$ < #. The search will result in a mismatch. Since Pattern[s...i] has already matched up toposition i−1 using the SA, we can easy determine the leftmost position of Pattern[s...i] in thesuffix array. We use le f tPosn to denote this leftmost position in the algorithm. The functionSearchPSA uses the determined leftmost position to search in the PSA. This function alsouses standard suffix array search algorithms [3, 84]. The function determines the index of thePSA node such that PSA.index.Start = le f tPosn and PSA.index.Length is the minimum valuethat is larger than i− s, the length of matching prefix of the pattern. Since the PSA is aleadysorted by Start and Length, the search for the index will be done in O(logM) time complexity,where M is the length of PSA (i.e. the number of nodes in the PSA).

When a mismatch occurs, the algorithm uses the previously computed index to determine


the suffix link. Thus the search will be redirected to the new branch following the suffix link.The algorithm now uses the longest suffix of the sub-pattern that so far matched as the newsub-pattern, and re-starts matching from the symbol that mismatched. Determining this pointwhere matching should re-start is a constant time operation.

The foregoing implies that, given a sequence T = T [1...n], with symbols from an alphabetΣ, and the PSA for T , we can decide on whether a pattern P = P[1...m] is generated by the samevariable length Markov chain that generated T in O(m log n

|Σ|) time.

We can use the predicted probability above to perform protein sequence classification. Sup-pose we have F protein families and we have computed the PSA for each family. Let PSAk

be the model constructed using the k-th protein family. Further, let PV LMM(P,PSAk) be theprobability that protein sequence P is generated by PSAk, as returned by Algorithm VLMM-PREDICTION. Then, we classify P to protein family f , where f is given by:

f = argmaxk=1...F

{PV LMM(P,PSAk)}. (4.3)

We summarize the foregoing in the following Theorem, the second major contribution of thiswork on the PSA:

Theorem 4.2: Given a sequence T = T [1...n], with symbols from an alphabet σ, whereσ = σ1σ2...σ|Σ|, and the probabilistic suffix array (PSA) for T , we can decide on whether apattern P = P[1..m] is generated by the same variable length Markov chain that generated T inO(m log n

|Σ|) time.

Proof: We observe that whenever a mismatch occurs, the start position of the sub-patterns will be increased by one. The number of mismatches is at most m. Similarly, the number ofmatching positions will equally be at most m. Thus, the loop in Lines 2-17 will run at most 2mtimes. Line 12 calls the function LeftmostMatchedPosition which uses standard suffixarray search algorithms [3, 84], that run in O(logn) time per call. The function SearchPSAin Line 13 searches the PSA data structure with the parameters determined in the previous stepin O(logM) time, where M is the length of the PSA. Thus, the total running time will be inO(m logn). We improve this time to O(m log n

|Σ|) using an extra |Σ| space to record the starting


position of each symbol in the suffix array. �

Algorithm 4.7: Prediction with VLMM via the PSA

VLMM-PREDICTION(Pattern,PSA)1 s← 1, Prob← 1,L← 1,R← n, index← 0,m←‖Pattern‖2 for (i← 1 to m) do3 Search Pattern[s...i] in PSA.SA with parameter L,R4 if mismatch do5 index← PSA.index.sLink6 L← PSA.index.Start, R← PSA.index.End7 s← s+18 if i≤ s−1 do9 i← i+110 end if11 else12 position← LeftmostMatchedPosition(Pattern[s...i],PSA.SA,L,R)13 index← SearchPSA(PSA,position, i− s+1)14 Prob← Prob × PSA.index.Conditional Probability15 L←PSA.index.Start, R← PSA.index.End16 end if17 end for18 return Prob

4.5 Space Analysis

In this section, we analyze the space requirement for the PSA structure. We consider spacerequired during its construction (work space), and for its storage and use.

4.5.1 Storage Space

The basic PSA structure has four types of attributes. The measurement attributes are de-pendent on the application, and may not be needed every time the PSA is used. We also observethat, though the T F is needed for computing the empirical probabilities required for the laterstage of determining the conditional probabilities, we do not need to store the T Fs directly.


They can be obtained easily using the < Start,End > pairs stored at each node. Thus, we focuson the other attributes. From Table 4.1, we see that the worst case space for the PSA structureis 6n integers plus n characters, or 25n bytes, assuming n≤ 232, and |Σ|= 256. On average, wecould save at least n integers by storing the < Start,End > pair as < Start,(End− Start) >,and the fact that the maximum value in the Length array will be the maximum LCP value forthe sequence, which is known to be of length in O(log|Σ| n) [58].

Further, with M = PSA length (number of PSA nodes or intervals), the space for the internalnode attributes and suffix link attributes will be 4M integers. In this case, M is a sub-linearfunction of n. Thus, the ratio γ = M

n , where 0 ≤ γ ≤ 1 is an important measure on the nodebranching structure of the PSA, and hence the complexity of the original sequence. With largerγ, we need more practical space to store the PSA. At γ < 0.7, which is our observation forexample sequences tested (See Table 3.3), this will result in an average space requirement of(5n+12n∗ γ), or 13.4n bytes for the PSA.

4.5.2 Construction Space

There are five steps to build the PSA and compute the VLMM probabilities. We list thework space in each step and analyze the worst case space over all the construction steps.

1. Building Interval Array and Computing T F , DF

The space requirement will depend on the following:

(a) Input: Text of size n, suffix array (n integers) and LCP Array (n integers).

(b) Work space (in integers): Stack(2n), DF (n), doclink(Z) and docsp(Z), where Z isthe number of documents.

(c) Output: interval array with < Start,End > and Length, n integers each.

The total space will be (7n+2Z) integers plus n characters for the text.

Memory reuse: The interval < Start,End > and Stack can share the same space. Stackis the space which stores the yet-to-be-computed non-trivial lcp-delimited intervals. Inthe extreme case, the length of Stack is at most n, since there are at most n non-trivial


lcp-delimited intervals. So when a non-trivial lcp-delimited interval has been processed,it will be stored in the interval < Start,End >. The length of the interval < Start,End >

is at most n. Thus, the interval < Start,End > records the calculated non-trivial lcp-delimited intervals. However, the sum of length of the interval < Start,End > and Stackis at most n. Thus, the interval < Start,End > and Stack can share the same space. Themaximum space required will then be (5n+2Z) integers plus n characters.

2. Building the Interval Tree


(a) Input: Text of size n, suffix array, interval array < Start,End > and Length.

(b) Work space and output (in integers): posInterval (n), parent (n), pN (n) and Stack(n).

The total space needed for this step is 8n integers plus n characters.

Memory reuse: The arrays Stack and pN can share the same n integer space, since pNis a stack that indicates the nodes have been processed, while Stack is the space thatindicates those not yet processed. These are mutually exclusive, with a combined lengthn. The maximum space will be 7n integer plus n char.

3. Calculating Suffix Links


(a) Input : Text of size n; 1n integer array each for, SA, Start, Length, posInterval, andparent.

(b) Work space and output (in integers): inverse suffix array W (n) and suffix link(n).

The total space for this step is 7n integer plus n characters.

Memory reuse: Suffix link and posInterval share the same n integer space. In the algo-rithm BUILDSUFFIXLINK, after line 7, the array posInterval is never used again, and thespace could be reused for the suffix link. The maximum space is thus 6n integer plus ncharacters.


4. Sort PSA


(a) Input: Text of size n; 1n integer array each for Start, Length, and Su f f ixLink.

(b) Work space and output: 8 counting arrays (n integer each), namely: Count, Order,wOrder, NewStart, NewEnd, NewLength, NewsLink, NewPattern.

Memory reuse: At most two additional arrays are need at any given time. Thus, themaximum work space is 5n integers plus n characters.

5. Computing the Conditional Probabilities

This is an in place algorithm, requiring no extra space.

From above analysis, the maximum space required during PSA construction will be 7nintegers plus n characters, where we have assumed that n� Z. We can use an argument similarto that made for the storage space to save n integers, and applying the γ ratio to get an averagecase construction space requirement of 5.15n integers, or 20.6n bytes.

The above PSA space requirement can be compared with the space requirement using thePST. The use of reverse suffix links [9] imply that the PST will require at least 37n byteson average (assuming Ukkonen’s suffix tree construction algorithm), without counting otherauxiliary structures needed for the PST construction.

4.6 Experiments

We performed experiments on protein sequences to test the proposed data structure. Theexperiments were performed using a DELL PC, with 4 × 2.67GHz CPU, and 8G memory,running Ubuntu 10.10 Linux operating system. All programs were compiled using gcc.

4.6.1 Predicting Protein Families

Our major objective was to develop a time- and space-efficient alternative to the PST. How-ever, to place our results in the correct context, we must first verify that the proposed PSA pro-


duces an equivalent performance in protein sequence modeling and prediction, when comparedwith the original PST. Thus, to be able to compare the PSA results with previous approachesusing the PST [17, 71], we used the same protein sequence dataset that was used in [17] andin [71]. We downloaded the Pfam databse [15, 39], release 1.0. The database contains 175families originally derived from the SWISSPROT 33. We use family members from the un-aligned sequences to generate the PSA, one PSA for each family. To test the performance inmodeling the protein sequences and in predicting the family for unknown sequences, we usedleave-one-out cross validation. For each sequence in the SWISSPROT 33 database, we com-pute the PV LMM using the PSA structure for each family. We then assign the protein sequenceto the family with the maximum probability. When the maximum probability is obtained usingthe model of the correct family (whose PSA is generated without the test sequence), we say wehave a correct classification (true positive), otherwise, there is a classification error. This simpleapproach avoids the difficult problem of setting thresholds for correct classification.

Table 4.4 shows the classification performance using the PSA. For comparison, we havealso included the results obtained using the PST on the same dataset, as reported in [17]. Table4.5 shows the summary classification performance. As expected, both the PST and the PSAproduce comparable results with respect to modeling and classification of protein families. ThePST had an average true positive rate of 90.8%, while the PSA had 90.2%. We note the signif-icant differences in the family sizes for the PST and PSA results. Although we used the samegeneral Pfam dataset, there has been various additions to the Pfam database since the originalpublication of the PST results. On average, the size of the families in our dataset was 128.21,while the size used for the PST was 97.82. The total size of the current dataset used for thePSA was more than 1500 sequences larger than that of PST (6539 versus 4891). As can be seenfrom Table 4.4, most of the families where the PST performed significantly better than the PSAcould be due to this difference in family sizes (see for example, ank, C2, efhand).

One advantage of performing classification using Eqn. (4.3), is that, when there is an errorin the classification, we can consider the family with the next highest predicted probability.For instance, we can consider the the top-k classification rate, which shows the probability offinding the correct protein family within the first k families, as ordered based on their generatedprobabilities, using the test sequence. Fig. 4.3 shows the classification performance of thePSA on some sample families, using the top-k classification rate. We can observe how the


classification performance rapidly approaches 100% after the first few k values.

0 2 4 6 8 10

020

4060

8010

0

Top k

True

Pos

itive

Per

cent

age

●

●

●● ● ● ● ● ● ●

●

7tm_1adh_shortcytochrome_b_Cefhand

Figure 4.3. Top-k classification rate for sample protein families using the PSA.

4.6.2 Space Consideration

A major problem with suffix trees is their practical memory space requirement. Althoughthey have the same theoretical linear space requirement as suffix arrays, in practice, suffix treesconsume much more space [3, 46]. This was our primary motivation for developing the PSA.Table 4.6 shows the summary data on the protein families in Pfam used in our experiments.Table 4.7 compares the memory space required to construct the PST [17] and SPST [71]) withthat required for the PSA.

We have included results for PST-20, PST-FULL, and SPST [71]. PST-FULL correspondsto the complete PST with no pruning, i.e. with the full string depth for each leaf node. PST-20corresponds to orde-20 PST, i.e. PST with a maximum string depth of 20 symbols. This wasthe variant used in [17]. The SPST proposed in [71] also involved some pruning of the suffixes.First, we can observe the nature of the protein sequences (Table 4.6). The maximum branchingfactor (γ = M

N ) observed was 0.75, while the minimum was 0.46. The average was 0.62. Asa key performance measure, we used the memory consumption factor (MC Factor), defined asthe ratio of the required memory to the total sequence length (N) of the family. We compared


the MC Factor for PSA, PST, and SPST. The table shows that the PSA ratio was steady atabout 33.16N bytes. The maximum memory required by the PSA for any of the families was53.46N bytes. This can be compared with the (mean and maximum) memory needed for PST-20 (41.67N,222.12N), PST-FULL (167.47N,1216N) and SPST (67.17N,111.53N). Fig. 4.4shows more detailed information on the memory consumption needed to construct the datastructures, using the Pfam protein families. Perhaps, more significantly, while the PSA memoryis relatively constant independent of the sequence or family, we can observe the huge fluctuationin the memory needed for the PST and SPST, as captured by the range and standard deviationon the memory consumption factor.

0 20 40 60 80 100 120 140

020

040

060

080

010

0012

00

File Size (K)

MC

Fac

tor

●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●● ●● ●● ●●● ●

● PSAPST−20PST−FULLSPST

Figure 4.4. Memory consumption factor (MC Factor) needed to construct the PSA and PST

data structures for the first 51 protein families in Pfam.

4.6.3 Computational Time Requirement

Table 4.8 shows the summary of the time required for constructing the PSA and the PSTdata structures. Table 4.9 shows the corresponding summary of the time needed for predictionusing the models. The tables show that for prediction, on average, the PSA is about 2.5 timesfaster than PST-20. The PSA was much faster at the construction stage. For instance, while thePSA was about 3 times faster to build than PST-20, and about 250 times faster than constructing


PST-FULL. The speedup could be related to the fact that the PSA requires a much smallerconstruction space, and thus less time is spent on moving data in memory.

4.6.4 PSA in Phylogenetic Tree Construction

Ron et al [109] and Bejerano and Yona [17] have enumerated various applications of thePST. As earlier indicated, the PSA is simply a more efficient alternative to the PST. Thus,the PSA can be used anywhere the PST is used. To study this versatility of the PSA/PSTfurther, we considered a new application of the PSA – specifically, its use in the problem ofphylogenetic tree construction, using mtRNA or mtDNA sequences. We used the mtDNA se-quences from 20 species, namely, human (Homo sapiens, V00662), common chimpanzee (Pantroglodytes, D38116), pigmy chimpanzee (Pan paniscus, D38113), gorilla (Gorilla gorilla,D38114), orangutan (Pongo pygmaeus, D38115), gibbon (Hylobates lar, X99256),baboon (Pa-pio hamadryas, Y18001), horse (Equus caballus, X79547), white rhinoceros (Ceratotherium si-mum, Y07726), harbor seal (Phoca vitulina, X63726), gray seal (Halichoerus grypus, X72004),cat (Felis catus, U20753), fin whale (Balenoptera physalus, X61145), blue whale (Balenopter-amusculus, X72204), cow (Bos taurus, V00654), rat (Rattusnorvegicus, X14848), mouse (Musmusculus, V00711), opossum (Didelphis virginiana, Z29573), wallaroo (Macropusrobustus,Y10524) and platypus (Ornithorhyncus anatinus, X83427). This is the same dataset previouslyused in constructing phylogenetic trees by Otu et al [99] and Li et al [73]. This is a challengingdataset, and there has been some debate on the position of some of the species [24, 105].

To construct the phylogenetic tree, we use the PSA to compute a dissimilarity measurebetween every pair of species in the dataset. First we construct the PSA for the mtDNA sequencefor each species. Then for a given species, we compute the PV LMM, the probability that the givenspecies is generated by the PSA constructed from each of the other species. After getting theprobabilities, we use the quantity λ(A,B) = − logPV LMM(A,PSAB) as the dissimilarity measurebetween the sequences from two species A and B, where PV LMM(A,PSAB) is the predictedprobability that sequence A is generated by the model represented by the PSA of sequence B.We then use the measurements λ(A,B) for all pairs of species to construct the phylogenetic tree.

Fig. 4.5 shows the constructed phylogenetic tree using the PSA. The results are generally inagreement with earlier work on this dataset (see [73,99] for example). The only major difference


is in the placement of cow, which is supposed to be closer to bluewhale and finwhale than tomouse and rat. This is a very encouraging result, especially, given that the PSA approach iscompletely alignment free. We believe a more detailed study of the use of PSA in phylogentictree construction (for instance, deriving other features or measures of similarity) could furtherimprove the results.

hors

ew

rhi

noca

tg

seal

h se

alba

boon

goril

lahu

man

com

chi

mpi

g ch

imgi

bbon

oran

guta

nb

wha

lef w

hale

cow

mou

se rat

plat

ypus

opos

sum

wal

laro

o

Figure 4.5. Phylogenetic tree for 20 species constructed using the predicted probabilities

obtained using the PSA.

4.7 Summary

We have presented the probabilistic suffix array (PSA), a data structure for representinginformation in variable length Markov models. The PSA provides the same functionality as theprobabilistic suffix tree (PST), but at a significantly reduced time and space requirement. Givena sequence of length N, construction and learning in the PSA is done in O(N) time and space,independent of the Markov order. Prediction using the PSA is performed in O(m log N

|Σ|) time,where m is the pattern length, and Σ is the symbol alphabet. The specific memory requirementfor PSA constuction is 33N bytes in the worst case, and 26N bytes on average, including space


for the suffix array and the input sequence. This can be compared with the 41N bytes neededusing the PST.

We have shown experiments in computational biology. The first experiment compares PSAwith PST [17] and SPST [71] on the same data set. The space for PSA is more efficient thanPST-FULL and SPST. The space of PSA is close to the PSA-20 which only stores L=20 depthpath. The construction time of PSA is significant fast than PST and SPST and the predictiontime of these three methods (PSA, PST and SPST) is similar. The other experiment is Phyloge-netic Tree Construction by PSA. This experiment shows a very encouraging result, especially,given that the PSA approach is completely alignment free.


Table 4.4. Performance of the PSA in modeling and prediction of protein families. Families

correspond to the first 51 protein families with 12 or more members in the Pfam database,

ordered alphabetically based on their abbreviated names in Pfam. For comparison, we have

included the results obtained using the PST [17] on the same data set. (TP stands for true

positive, while MD stands for missed detection). ∗∗The family apple was not in the dataset

used in [17].Family Size # MD TP (%) Size # MD TP (%)

by PSA by PSA by PST by PST7tm 1 530 26 0.951 515 36 0.9307tm 2 36 2 0.944 36 2 0.9447tm 3 12 1 0.917 12 2 0.833AAA 79 9 0.886 66 8 0.879ABC tran 330 46 0.861 269 44 0.836actin 160 22 0.863 142 4 0.972adh short 186 48 0.742 180 20 0.889adh zinc 129 22 0.829 129 6 0.953aldedh 69 11 0.841 69 9 0.870alpha-amylase 114 0 1.000 114 14 0.877aminotran 63 14 0.778 63 7 0.889ank 305 60 0.803 83 10 0.880apple∗∗ 16 1 0.938arf 43 2 0.953 43 4 0.907asp 72 5 0.931 72 12 0.833ATP-synt A 79 1 0.987 79 6 0.924ATP-synt ab 183 1 0.995 180 6 0.967ATP-synt C 62 1 0.984 62 5 0.919beta-lactamase 51 2 0.961 51 7 0.863bZIP 95 14 0.853 95 10 0.895C2 101 21 0.792 78 6 0.923cadherin 168 12 0.929 31 4 0.871cellulase 40 3 0.925 40 6 0.850cNMP binding 69 4 0.942 42 3 0.929COesterase 62 3 0.952 61 5 0.918connexin 40 3 0.925 40 1 0.975copper-bind 61 1 0.984 61 3 0.951COX1 80 4 0.950 80 13 0.838COX2 114 10 0.912 109 2 0.982cpn10 58 1 0.983 57 4 0.930cpn60 84 1 0.988 84 5 0.940crystall 103 6 0.942 53 1 0.981cyclin 80 19 0.763 80 9 0.888Cys-protease 95 1 0.989 91 11 0.879cystatin 88 11 0.875 53 4 0.925Cy knot 61 6 0.902 61 4 0.934cytochrome b C 133 4 0.970 130 27 0.792cytochrome b N 170 4 0.976 170 3 0.982cytochrome c 175 10 0.943 175 11 0.937DAG PE-bind 108 5 0.954 68 7 0.897DNA methylase 57 2 0.965 48 8 0.833DNA pol 51 12 0.765 46 9 0.804dsrm 22 14 0.364 14 2 0.857E1-E2 ATPase 117 2 0.983 102 7 0.931efhand 739 96 0.870 320 25 0.922EGF 676 33 0.951 169 18 0.893enolase 41 2 0.951 40 0 1.000fer2 88 15 0.830 88 5 0.943fer4 156 34 0.782 152 18 0.882fer4 NifH 49 2 0.959 49 2 0.959FGF 39 2 0.949 39 1 0.974


Table 4.5. Summary performance in protein family classification using the PSA and PSTFamily Size # MD TP (%) Size # MD TP (%)

by PSA by PSA by PST by PSTMean 128.216 12.373 0.902 97.820 8.720 0.908Std 145.798 17.711 0.103 84.778 8.690 0.050Min 12.000 0.000 0.364 12.000 0.000 0.792Max 739.000 96.000 1.000 515.000 44.000 1.000Total 6539 631 4891 436

Table 4.6. Summary data on the first 51 families in Pfam, as described in Table 4.4.Protein Family Length # of Internal γ = M

N Min MaxSize (N) nodes (M) LCP LCP

Mean 128.216 22173.451 252.353 0.616 190.608 14.659Std 145.798 23359.451 136.862 0.079 139.531 9.198Min 12.000 1360.000 68.000 0.460 26.000 3.987Max 739.000 140744.000 753.000 0.750 719.000 49.912

Table 4.7. Construction memory needed for the PSA and PST. Results are based on the first

51 families in Pfam, as described in Table 4.4.PSA MC PST-20 MC PST- MC SPST MC

Factor Factor FULL Factor FactorMean 616 33.16 425 41.67 1577 167.47 1068 67.17Std 655 8.23 116 42.68 125 211.22 541 23.77Min 71 16.74 263 4.33 1455 12.77 79 18.53Max 4523 53.46 1023 222.12 2235 1216 2547 111.53

Table 4.8. Construction time comparison for PSA and PST. Results are based on the first

51 families in Pfam, as described in Table 4.4. Recorded time is time needed per family (in

seconds). Speedup is computed as the ratio with respect to PSA time.PSA PST-20 Speedup PST-FULL Speedup SPST Speedup

Mean 0.092 0.244 3.150 12.648 250.967 1.825 1.740Std 0.106 0.231 1.297 9.984 238.806 2.701 2.617Min 0.005 0.016 0.854 3.780 20.917 0.047 0.036Max 0.545 1.376 6.895 56.540 1076.000 17.121 16.658


Table 4.9. Prediction time comparison between PSA and PST. Results are based on the first

51 families in Pfam, as described in Table 4.4. Recorded time (in seconds) is prediction time

per family – i.e. total time needed to predict all members in the family against all the other

families.PSA PST-20 Speedup PST-FULL Speedup SPST Speedup

Mean 87.71 98.83 2.52 113.25 2.79 207.03 3.4Std 93.89 66.4 3.28 74.47 3.46 130.93 2.39Min 5.00 59.23 0.25 21.47 0.32 33.50 1.03Max 576 528 20.13 579.3 18.86 600 17.6

Chapter 5

Circular Pattern Matching

5.1 Introduction

The CPM problem was first introduced in 1980 [19]. Since then, variations of the circularpattern matching problem has been studied. The first variant [46, 115] is the exact circularpattern matching(ECPM) problem. This problem is to find all occurrences of a circular patternP in a text T without any error. The second variant [44, 81] is the approximate circular patternmatching problem. This problem allows some error between the circular pattern P and the textT . According to the definitions in Chapter 2 on related work, most approaches have focused onthe existential query for the ACPM1 problem.

The ACPM2 problem is given text T , circular pattern P and maximum error k, return allpositions where the circular string [P] match to text T with at most k errors. [P] is said to be ak-approximate match with text T at position j ∈ [1...n−m− k +1] if EDc(P,T [ j... j +m])≤ k,where 0≤ t ≤ m−1,−k ≤ n−m. This is clearly more difficult than the ECPM problem or theACPM1 problem for existential and counting queries.

Main Results. Our main goal is to solve the ECPM and ACPM2 problems for existentialqueries which were introduced under related work. We then define and solve the ECPD andACPD problems to find “interesting” circular patterns, as defined using specified constraints.

87

CHAPTER 5. CIRCULAR PATTERN MATCHING 88

In this chapter, we present a new algorithm to solve the ECPM problem. This algorithm runsin linear time and space complexity. To date this is the best ECPM algorithm with respect to timeand space complexity. We also present four algorithms to solve the ACPM2 problem. Three ofthe algorithms report complete results and one is a greedy(suboptimal) algorithm reporting anincomplete result. We compare our algorithms with other algorithms in the literature such asthose reported by Maes [81],Gregor [44],Uliel [123] which were introduced in related work. Onaverage, our ACPM2 algorithm provides the best result for the ACPM2 problem with respectto time and space complexity.

The following three theorems represent our main contributions on the CPM problem.

Theorem 5.1: Given a text T = T [1..n] and a circular pattern P = P[1..m], with symbolsfrom an alphabet Σ, the ECPM algorithm can solve the ECPM problem in O(m log |Σ|) worstcase time after constructing the suffix tree and suffix links in O(n) time and space complexity.

Theorem 5.2: Given a text T = T [1...n] and a circular pattern P = P[1...m], with symbolsfrom an alphabet Σ, Algorithm ACPM2 solves the ACPM2 problem in O(km2n) time, and O(n)space.

Theorem 5.3: Given a database sequences SeqDB, with Z sequences and N total symbols(|Σ| could be O(N)), Algorithm ACPM2 solves the all-against-all ACPM problem in O(kmaN)time cost on average, and O(kmmN2) time worst case, using O(N) worst case space, wherema = N

Z , mm is the length of the longest sequence in SeqDB.

Organization. In the next section, we present a linear-time linear-space algorithm for theECPM problem. Algorithms for the ACPM2 problem are presented and analyzed in Section 3.In Section 4, we show experiments on analyzing circular permutations in multidomain proteinsusing our algorithms. Based on the results, we perform protein function prediction for multido-main proteins. In Section 5, we summarize our work on circular pattern matching problems.


5.2 Exact Circular Pattern Matching Problem

In this section, we present an algorithm to solve the ECPM problem in linear time andlinear space. Our method is index-based and can be built on the space-efficient virtual suffixtree proposed earlier in Chapter 3. The key to our method is the suffix link provided by thesuffix tree and the VST. To our knowledge, this is the first time that the suffix link has beenexploited to solve the circular pattern matching problem.

Notation. Before describing the algorithm, we assume there is a sequence database SeqDBwith Z sequences. The total number of symbols in SeqDB is N. Let SeqDB[i] be the i-thsequence in SeqDB, where 0 < i ≤ Z. Let mi be the length of SeqDB[i]. The average numberof symbols per sequence in SeqDB is ma = N

Z . Let k be the allowed error in a match. In thisdatabase SeqDB, the alphabet size is |Σ|.

5.2.1 Linear Time ECPM Algorithm

The index-based approach was first introduced by Iliopoulos and Rahman [51]. However,their methods were quite complex and difficult to implement. The best time complexity ofIliopoulos and Rahman’s algorithms is O(N logN). Our algorithm is relatively simple withlower time and space complexity. First we give an example of ECPM and use this to explainour algorithm. Then we present a formal algorithm for the ECPM problem with input pattern Pand ST , the suffix tree constructed from T . We build this algorithm to solve the all-against-allversion of the ECPM problem, given a database of sequences.

Examples. In chapter 3, we showed the suffix tree for the string T =missississippi$.We utilize this example again here. Figure 5.1 shows the suffix tree for T with a few suffixlinks. Let circular pattern P =iss, so [P] = {iss,ssi,sis}, where f 0(P) = iss, f 1(P) =ssi, f 2(P) = sis. f i+1(P) is obtained by removing the first symbol from f i(P) and thenappending this symbol at the end, where 0 ≤ i < m− 1. So in our algorithm, when we matchf i(P), we use the suffix link to find the first (m−1) symbols of f i+1(P). This operation is donein constant time. Then we only need to compare the last symbol of f i+1(P) to the correspondingposition in the suffix tree and if the symbols match, we report a match for f i+1(P).


In our algorithm, we use the suffix link to match the circular pattern in an incrementalmanner. For example, we search f 0(P) = iss first. In this case, we find iss is in the edgebetween node N1 and node N4, thus there is a match. Then match for f 1(P) = ssi by followingthe suffix link from node N4 to node N6. Thus we get the next matching edge in internal node N6

by using the “skip/count” method [46]. The time cost of this operation is constant time. To lookfor the matches to f 2(P) = sis, we use the suffix link from node N6 to node N5. The prefix sihas matched up to node N5, thus we only check the outgoing edges from node N5 to its children,and select the one whose first symbol matches symbol s. We find the edge between node N5

and N8 is matched by symbol s. So all of circular matchings of [P] have been found whosepositions start from the leaf nodes of N4,N6 and N5 respectively. From position 2 to position 9in T , there are nine substrings that matched the circular pattern P = iss.

We give another more complex example for searching f 0(P) = ism. First, we find theprefix is in the edge between node N1 and node N4, but there is a mismatch at the last symbolm. To search for f 1(P) = smi, we follow the suffix link from node N4 to node N6. The previousiteration only matched two symbols, so this iteration start from the second symbol of f 1(P) =smi which is m. However, the length of path from root to N6 is 3, which is larger than 1, sothis searching starts from node N3 which is the parent node of N6. Then we check the secondsymbol, but it is still a mismatch. Thus we know that f 1(P) = smi does not occur in T . Now tosearch for f 2(P) = mis, we again follow the suffix link from node N3 to the root, and continuethe match from the root. Thus, we find a match of f 2(P) = mis on the path to leaf node6. Therefore, the system will report one occurrence in leaf node 6 of the suffix tree, whichcorresponds to position 1 in T .

ECPM Algorithm Description

Algorithm ECPM (Algorithm 5.1) shows the pseudo code for our exact circular pattern match-ing algorithm. In this algorithm, the input ST is the suffix tree for the text T . T denotes a giventext (string, or sequence) to be searched. For a given pattern P with length of m, the algorithmderives a new pattern, called PP, PP is derived from Pattern P which repeats the Pattern P andremoves the last character of P. Thus is, PP = P[1...m]◦P[1...m−1]. Therefore, the new patternPP has a length of 2m−1.


In the suffix tree, ST, the algorithm searches PP starting from the root of ST from theleftmost branch to the rightmost. The iteration variable i indicates the starting position in PP.

The variable len indicates the position of node label. When len is equal to 1, a new childnode is created for the current node. Line 4 in the algorithm represents the process of findingthe right child for the current node. This operation requires O(log(|Σ|)) time for comparison.If len is larger than 1, the matching operation takes place in the current node.

The variable top is the pointer which points to the current circular pattern. If the length ofthe matched pattern PP[top...i] is m, it means that a circular pattern occurs. Since the numberof possible circular patterns is m, pointer top will never be larger than m. If top pointer is largerthan i, it indicates that the symbol pattern[top] character never occurred in the text. Thus, thisset of circular patterns cannot be in the text. The pointer top is increased by one in the followingtwo cases.

The first case is when a mismatch occurs in the current node. In this situation, the currentnode is replaced by the node which is pointed to by its suffix link. At the same time the stringdepth for the current node decreases by one. Pointer top increases by one. After this, the nextiteration of the string matching process starts. The other case is when the path length from theroot to the current node is m. It indicates one of the circular patterns occurs inside Text. Beforematching the next position of PP, pointer top increases by one to keep the length of the patternequal to m.

Based on the ECPM algorithm, we develop an algorithm to solve the all-against-all prob-lem. That is, we compute ECPM(SeqDB[i],SeqDB[ j]),∀i, j, i 6= j.Algorithm ALL-VS-ALL-ECPM (Algorithm 5.2) enumerates each sequence in the database as a pattern to search thecircular relation by using the former ECPM algorithm. It constructs the suffix tree ST for theentire database first. Then let each sequence be a pattern to search in ST .

Algorithm Analysis

In the ECPM algorithm (Algorithm 5.1), when the same position in PP is compared again (Line18), pointer top increases by one. top remains less than or equal to m. Thus, Line 18 runs at


most m times.

The “for” loop from Line 3 to Line 23 runs at most 2m−1 times. Inside this loop, the cost ofLine 4 is at most O(log |Σ|). The other lines inside this “for” loop have a constant running time.Therefore, the time cost of this algorithm is at most log |Σ|× (3m−1)=O(m log |Σ|). Given therelation between the ST and VST as described in Chapter 3, the proposed algorithm can easilybe modified to use the VST.

The space cost of implementation using the suffix tree is 32n + 2m−1, where m ≤ n. Thespace cost if we implement use the VST is 13.8n+2m−1.

We summarize the above in Theorem 5.1.

Theorem 5.1: Given a text T = T [1..n] and a circular pattern P = P[1..m], with symbolsfrom an alphabet Σ, the ECPM algorithm can solve the ECPM problem in O(m log |Σ|) worstcase time after constructing the suffix tree and suffix links in O(n) time and space complexity.

Algorithm ALL-VS-ALL-ECPM (Algorithm 5.2) builds the ST in O(N) time cost. Timecost of line 4 is O(m log |Σ|), so the total cost of the loop from line 2 to line 5 is O(N). Worstcase time complexity of this algorithm is O(N log |Σ|). Space complexity of this algorithm isO(N).

5.2.2 Comparison of ECPM algorithms

Table5.1 shows the comparison of our ECPM algorithm with Iliopoulos and Rahman’salgorithms [51] CPI-I, CPI-II. The table shows that the ECPM algorithm is the best algorithmwith respect to time complexity, even when |Σ|→O(N). In terms of space, the ECPM algorithmhas the same time complexity with CPI-II algorithm which was implemented using the suffixarray. The CPI-II algorithm implemented using the compressed suffix array has the best spacecomplexity among these algorithms, but the time complexity is not as good. We note that forpractical space complexity, the CPI-I need two suffix trees and various auxiliary arrays, andhence will require a significantly larger practical space.


5.3 Approximate Circular Pattern Matching Problem

In this section, we present our algorithms for the ACPM problem. We start with a sim-ple greedy algorithm and then consider a suffix-array based q-gram algorithm for the ACPMproblem. First, we introduce a basic LIS algorithm APM-VIA-LIS (Algorithm 5.3) to find anapproximate match of a pattern P in text T . The algorithm does not handle circular patternmatching. Next, we propose our algorithms for the ACPM problem and analyze their complex-ity. The LIS method for pattern matching will be used in these algorithms. When we use thisalgorithm to solve the ACPM problem, we have to use all circular shifts f t(P) to match the textT .

The LIS method utilizes the LIS algorithm [46,50] to calculate the longest common subse-quence (LCS) [31,46] between two sequences. The verification process is to verify whether theedit distance between these two sequences is less than k. When we calculate LIS and LCS, eachmatched symbol will occur in the LCS. We are able to get occurring positions in two sequencesfor the matched symbols. We can use these positions to check the number of edit operationsbetween two matched symbols. Thus the algorithm reports the edit distance between these twosequences. The time complexity for this algorithm is O(mn

|Σ| logm). When |Σ| is close to O(m),as in the case for multidomain proteins, the time complexity will be O(n logm).

5.3.1 Greedy ACPM Algorithm

Algorithm ACPM-GREEDY (Algorithm 5.4) compares any two sequences with one as textand the other one as circular pattern in two main steps. The first step is the generation of LCS.The second step will verify whether the LCS generated in step 1 represents a part of a validsubsequence. These two steps were presented in Algorithm APM-VIA-LIS(Algorithm 5.3).

Table 5.1. Comparison of ECPM algorithmsECPM algorithm CPI-I [51] CPI-II [51] CPI-II(Compressed Suffix Array) [51]

Time Complexity O(N log |Σ|) O(N log1+ε N +N log logN) O(N logN) O(N logN logN)Space Complexity O(N)bytes O(N log1+ε N)bytes O(N)bytes O(N logN)bits


First, Algorithm ACPM-GREEDY (Algorithm 5.4) will choose two sequences, one as textand the other one as circular pattern. After getting text T and circular pattern P, the ACPMworks on the following two steps. The first step creates a new pattern PP by concatenation of P.And then the second step calculates the LCS between PP and T and returns the LCS string lcs.This procedure is performed in line 5. This step also verifies the approximate pattern matchingwith parameter k.

This method is greedy(suboptimal): it finds only one occurrence of the pattern, it may notto detect all the existing circular patterns in the text. If there is more than one LCS in T , thismethod may miss some matches.

Time Complexity Analysis

For the time complexity analysis, we need to consider three cases.

1. For the case of using one sequence as pattern P and the other sequence as text T , thetime complexity of getting LCS (line 5) is O(mn

|Σ| logm). When |Σ| is close to O(m) as inmutildomain proteins, the time complexity will be O(n logm).

2. For the case of searching for one sequence against a group of sequences (loop from line2 to line 6), the time complexity is O(∑Z

i=1 ni logm) = O(N logm), where N is the totallength of all sequences used, N = ∑

Zi=1 ni Z is the number of sequences, and ni is the

length of the i-th sequence in SeqDB.

3. For the case of searching for a CP among a group of sequences (loop from line 1 to line 7),the time complexity is ZN logm), where m is the length of the longest sequences. The finaltime complexity is O(N2 logm), since Z = O(N). In our experiment with multidomainproteins, N ≈ 6Z

5.3.2 ACPM with LIS

Here we present a second algorithm (Algorithm 5.5) for the ACPM problem. This methodalso utilizes the LIS algorithm [46, 50] to calculate LCS [31, 46]. However, the method to


construct Pattern P and Text T are changed. More importantly, unlike the greedy algorithmdescribed earlier, this algorithm can detect all the circular patterns. The pseudo code is listedin Algorithm ACPM-LIS . One sequence is used to construct the circular pattern P. Anothersequence is used as text T . All possible circular shifts of the pattern are enumerated. The subTis extracted from sequence T by a sliding window with size of m+ k. Finally, each enumeratedcircular shift of the pattern is searched against the sliding window separately. Assuming thesequence to be researched using a sliding window has the length n, then, there are max{1,n−(m+ k)+1}= O(n) windows to be constructed.

During the searching process, if there is a common subsequence with length m-k is found,then, there is a circular pattern occurring in Text T . This method reveals all the circular patternsin each sequence. Thus it can find the optimal solution, with respect to completeness of theresults.


For one sliding window, the time complexity of finding one circular pattern is O(m(m+k)|Σ| logm

(line 7 to line 9). There are O(n) siding windows and O(m) circular patterns inside one querypattern and one text (line 5 to line 10). Therefore, the time complexity of detecting a circularpattern between one pattern and one text is O(m(m+k)

|Σ| logm×mn). Examining these terms, wecan find that k is at most O(m) and |Σ| = O(m). Thus, the time complexity can be abbreviatedas O(m logm×mn) = O(m2n logm).

For each pattern, the algorithm compares with the other sequences (line 2 to line 11). Thetime complexity of each pattern comparing with all other sequences is O(m2 logm×∑

Zi=1 ni)=O(m2N logm),

where N is the total length of all sequences and ni is the length of i-th sequence in SeqDB.

After considering all patterns (line 1 to line 12), the time complexity becomes ∑Zi=1 m2

i N logm),where m is the length of the longest sequence. In fact, ∑

Zi=1 m2

i ≤ (∑Zi=1 mi)2 = N2. The final

time complexity is therefore O(N3 logm). This is the worst case complexity.


5.3.3 ACPM with q-grams and Suffix Array

The q-gram approach [3] is a two-phase method to reveal all approximate patterns. Thefirst phase is the Hypothesis Phase which determines all potential matching positions usingonly q-gram substrings of P and T . In the second phase, the Verification Phase, the algorithmverifies each potential matching position to report the correct matches. First we introduce anACPM algorithm with q-grams, then we present a hybrid algorithm with ECPM and ACPMwith q-grams. The latter algorithm is more time efficient in practice, but the theoretical timecomplexity is the same as the ACPM algorithm with q-grams.

Figure 5.2 shows the number of hypotheses with different q values in the ProDom databaseof multidomain protein sequences [32]. Here we used N = 106. We notice that when q increases,the number of hypotheses will decrease fast. So when q is not very small, e.g q≥ 3, the numberhypotheses will typically reduce to O(N).

Algorithm Description

Algorithm ACPM-QGRAM (Algorithm 5.6) shows the process. Lines 1-7 is the preprocessingstage. This stage constructs a long concatenated sequence, seq, using all the sequences so farencountered in SeqDB. It also builds an auxiliary array pos. This array is used to maintainthe relationship between position in seq and SeqDB. Line 8 constructs the suffix array for theconcatenated sequence.

Lines 9-24 is a loop to generate all of the hypotheses for the q-gram method using theLCP array. Line 11-13 determines candidate matching positions that have the same q-gramprefix. Line 14 considers each pair of candidate positions obtained with the current q-gram forverification.

Lines 15-22 is the verification algorithm. We use the LIS algorithm to verify the approx-imate patterns. Constructing the circular pattern is the same as in the previous algorithm.We enumerate the m circular patterns from a sequence one by one. We construct subT fromthe second sequence T as follows. Assume the q-gram occurs in position y, so let subT bethe substring of T which includes T [y...y + q− 1] and the length is (m + k). So text will be


T [y−m− k + q...y + q− 1], T [y−m− k + q + 1...y + q], ... T [y...y + m + k− 1]. There are(m+ k−q) number of such substrings.


The time complexity of LIS to verify a pattern vs. substrings of Text which includes onematched q-gram is O(m logm× (m + k− q)). Since k ≤ O(m) and q ≤ O(m), the time com-plexity is O(m2 logm). Each pair in the same group has O(m) circular pattern operations, thusthe time complexity for verifying each pair is O(m2 logm×m) = O(m3 logm) There are r groupsand group i has ni elements and there are ∑

ri=1 n2

i pairs. The total complexity is O(m3 logm×∑

ri=1 n2

i ). The worst case occurs when r = 1 with time complexity of O(N2m3 logm). For theaverage case, m is the average length of the sequences. Then the time complexity will be inO(Nm3

a log(ma)), where is ma = NZ .

Hybrid Algorithm

We can combine the ECPM algorithm (ECPM) and the ACPM algorithm with q-gram (ACPM-QGRAM) for a possible improvement in practical time. The ECPM algorithm reports the exactcircular pattern matches which should not be computed again when we search for approximatematches. First, the hybrid algorithm uses the ECPM algorithm to look for matching exactcircular patterns and stores them. Next the hybrid algorithm uses the q-gram method to generatehypothesis. In the verification phase, the algorithm checks whether this hypothesis has occurredbefore within the ECPM results. If it occurred, then the verification stage is skipped.

Thus this approach will reduce the practical running time, given the reduced verifica-tions. For checking the each hypothesis, it takes O(logN) time cost. The time complexityis O(N2(m3 logm + logN). But this algorithm needs O(N2) space to maintain the pair-wiseresults from the ECPM algorithm.


5.3.4 Improved Algorithm: ACPM with Bidirectional Edit Distance

In this subsection, we propose an algorithm to solve the all-against-all ACPM2 problem.The algorithm uses a two-stage hypothesis generation – hypothesis verification paradigm. Aftergenerating the hypotheses using the q-gram filteration method, the algorithm verifies each hy-pothesis in O(km) time complexity, where k is the maximum error allowed and m is the lengthof the pattern. This algorithm follows the same general paradigm as the previous algorithm(Section 5.3.3), however, there are significant differences in both the hypothesis. In practical,this algorithm uses the suffix tree than the suffix array.

Filteration via q-grams

The q-gram approach [53] is a filteration method which is based on the fact that for any twostrings that are approximate matches, there must be some exact matching sub-region betweenthem. The problem is how to determine such sub-regions and their length(s). Lemma 5.1 showsthis fact and points out how to choose the value q, the minimum length of the matching regions.

Lemma 5.1 [14] : Given a text T , a pattern P of length m, and an integer k, (0 ≤ k < m),for a k-approximate match of P to occur in T , there must exist at least one q-length block ofsymbols in P that form an exact match to some q-length substring in T , where q = b m

k+1c.

Approximate pattern matching based on q-gram uses two phases. The first phase is thehypothesis phase which identifies all potential matches using q-gram filtering operations. Basedon partial exact matching, the algorithm can find O(N2) potential matches. The second phase isthe verification phase. Here the hypothesized potential matches from the first phase are verifiedto determine whether they are true k-approximate matches. Our ACPM2 algorithm is basedon q-gram filteration. First we generate hypotheses for potential approximate circular patternmatches by q-gram filter operations. Then we verify each hypothesis to find the true circularpatterns.


ACPM Hypothesis Generation

Algorithm 5.7 represents our hypothesis generation phase using suffix trees. First, it builds thegeneralized suffix tree ST for the sequence database. Then, the algorithm treats each sequenceas a pattern. For each pattern, q is calculated using q = b m

k+1c, where m is the length of thecurrent pattern. There exists O(m) overlapping q-grams in a given m-length pattern. We searcheach overlapping q-gram of the pattern from left to right in the suffix tree ST . Each exactmatch is a hypothesis with the current q-gram. The use of the suffix tree implies that for eachgiven distinct q-gram in P, all the occurrences in T will be found as the leaf nodes from thesame parent node in the suffix tree. Finding this parent node requires only one q-gram matchin O(q log |Σ|) time. Similar to the ECPM algorithm, after searching the first q-gram, say qi =P[i...i+q−1] in ST , the next q-gram qi+1 = P[i+1...i+q] can be matched incrementally fromqi by using suffix links in constant time (Using the function Search-in-ST). Thus, counting allthe matching q-grams or locating all the parent nodes for each unique matching q-gram can bedone in O(m log |Σ|) time, independent of n or N, where n is the length of the current sequence(the text), and N is the total length of all the sequences in the database (the number of leaf nodesin the generalized suffix tree).

Given pattern P = P[1..m] and text T = T [1..m], there are O(m) q-grams in P and O(n)q-grams in T . So each q-gram of P can produce at most (n− q + 1) exact matches in T . Thenumber of hypotheses is O(m(n−q + 1)) = O(mn). We notice the added difficulty introducedby the ACPM2 problem. Unlike for the ACPM1 problem where only one occurrence of P in Tis required, here, we have to verify each of the potential O(mn) hypotheses, in order to identifyall the occurrences. In the sequence database seqDB, we have the total length of all sequencesas: N = ∑

Zi=1 mi. Each q-gram can produce potentially O(N) exact matches. Thus, the number

of hypotheses will be in O(N2). In the worst case, there exist O(|Σ|q) unique q-grams. Thenumber of hypotheses will be O(|Σ|q× ( N

|Σ|q )2) = O( N2

|Σ|q ) on average. When q increases, thenumber of hypotheses will decrease exponentially. It will be O(N) when |Σ| is O(N).


ACPM Verification Algorithm Description

Our verification algorithm makes use of a novel bidirectional edit distance that uses both thedirect sequence and its reverse. Before we introduce our ACPM verification algorithm, we firstdescribe some important characteristics of edit distances.

Lemma 5.2 : Suppose there is an exact matching q-gram common to both the text T =T [1...n] and the pattern P = P[1...m]. Let (i, j) be the position of occurrence of the q-gram in Tand P respectively. If this q-gram is part of a true approximate match between P and T , then theedit distance between T and P is given by ED(P,T ) = ED(P[1... j−1],T [1...i−1])+ED(P[ j+q...m],T [i+q...n]), where 1≤ j ≤ m,1≤ i≤ n,q≥ 1.

Proof: There exists one q-length block in exact match between the pattern and the text(Figure 5.3). Since this q-gram is involved in a true approximate match between P and T , itmust be involved in the edit distance computation between P and T . Thus the optimal edit pathmust contain the subpath between these regions of the pattern and the text. P[ j... j + q− 1]only compares with T [i...i + q− 1], so the edit cost of the subpath is zero. We don’t needto compare P[1... j− 1] with T [i...n], nor do we need to compare P[ j...m] with T [1...i− 1],because the optimal edit paths do not cross these areas. Figure 5.3 illustrates this using a q-gram exact match (the solid line). Three optimal paths (the dashed lines) pass through point(i, j), but no optimal path can pass through point (i,L) or point (H, j), where L > j and H > i.P[1... j−1] only compares with T [1...i−1] and P[ j +q...m] compares with T [i+q...n]. Hencethe edit distance between pattern P and text T is the sum of three components ED(P[1... j−1],T [1...i− 1]), ED(P[ j... j + q− 1],T [i...i + q− 1]) and ED(P[ j + q...m],T [i + q...n]). SinceED(P[ j... j +q−1],T [i...i+q−1])=0, the Lemma holds. �

Lemma 5.3 : For unit cost edit operations, the ED(P,T ) = ED(PR,T R), where PR and T R

are reversed version of P and T respectively.

Proof: This Lemma has been proved in the proof of [118] Lemma 2.�

We need to verify each hypothesis generated by Algorithm 5.7. Assume QT and QP arethe q-grams from the text T and the pattern P respectively. Let QT = QP = Q. So there is


an exact matching q-gram between the text and the pattern using this q-gram. We present anO(km) verification algorithm to determine whether the q-gram Q is part of a true k-approximatecircular match between P and T .

We now describe the idea of bidirectional edit distance, based on which we verify a givenhypothesis. Suppose QP is a substring of pattern P which starts at position j and QT is asubstring of text T which starts at position i, where 0 ≤ j ≤ m and 0 ≤ i ≤ n. That is, QP =P[ j... j + q− 1] and QT = T [i...i + q− 1]. Our goal is to compute the circular edit distancebetween pattern P and the substring of T denoted subTi, where subTi = T [i+q−1−m−k...i+m+k]. First, we construct two strings from the pattern P1 = P[ j +q...m]◦P[1... j−1] and P2 =PR

1 . We also construct two strings T1 = T [i+q...i+m+k] and T2 = T [i−1−m−k+q...i−1]R

from the text. We compute the edit distance between P1 and T1 using Ukkonen’s algorithm[121]. This can be done in O(km) and returns an array ED1 which contains the minimum editdistance of each row. The value of ED1[h] indicates the minimum distance between P1[1...h]and T1[1...r], where h− k ≤ r ≤ h + k. Similarly, We use the same algorithm to calculate theminimum edit distance array ED2 between P2 and T2. ED2[h] is the minimum distance betweenP2[1...h] and T2[1...r], where h− k ≤ r ≤ h+ k.

Figure 5.4 shows the comparisons made by the algorithm. Here we have P1 = P[ j+q...m]◦P[1... j− 1] and P2 = PR

1 = (P[ j + q...m] ◦P[1... j− 1])R = P[1... j− 1]R ◦P[ j + q...m]R. P2 ismatched against T2 in reverse direction, while P1 is matched against T1 in the regular (forward)direction.

We construct an array ED from ED1 and ED2 as follows.

ED[h] =

ED2[m−q−h] : 1≤ h≤ m−q0 : m−q+1≤ h≤ mED1[h−m] : m+1≤ h≤ 2m−q

(5.1)

Lemma 5.4 : For a given hypothesis occurring at positions i and j in T and P respectively,

ED( f h+ j+q−2(P),subTi) = ED[h]+ED[h+m−1], where 1≤ h≤ m.

Proof : According to Figure 5.4, we construct a new string PP as PR2 ◦P[ j... j +q−1]◦P1,

where PR2 = P1. Then PP = P[ j+q...m]◦P[1... j−1]◦P[ j... j+q−1]◦P[ j+q...m]◦P[1... j−1].


There are three cases in representing f h(P). These cases are indicated as numbers 1,2, and 3respectively on the double-headed arrows in Figure 5.4.

CASE 1 : h = 1, where f h+ j+q−2(P) = PP[1...m]. PP[1...m] is constructed from two parts.One is PR

2 and the other is P[ j... j + q−1]. The edit distance in this case is ED[1]+ ED[m] byLemma 5.2, where ED[1] is ED2[m−q] and ED[m] = 0.

CASE 2 : 2 ≤ h ≤ m− q, where f h+ j+q−2(P) = PP[h...h + m− 1]. PP[h...h + m− 1] isconstructed from three parts. The first part of PR

2 . The second part is P[ j... j+q−1] and the lastpart is P1. From Lemma 5.2, we know the edit distance is ED[h]+ED[h+m−1].

CASE 3 : h = j. This case is similar to the first case, so the edit distance is ED[h]+ED[h+m−1].

In these three cases, we only compute 1 + m− q− 1 + 1 = m− q + 1 circular shifts of thepattern. We do not calculate the circular shifts f r(P), where j +1≤ r ≤ j +q−1. There existsan exact match involving P[ j... j+q−1] (the matching q-gram), and f r(P) does not contain thissubstring, when j + 1 ≤ r ≤ j + q− 1. Therefore, it is not necessary to calculate f h(P), whenj +1≤ r ≤ j +q−1.�

Lemma 5.5 : For a given hypothesis occurring at positions i and j in T and P respectively,EDc(P,subTi) = min0≤h≤m−1{ED[h]+ED[h+m−1]}, where subTi = T [i+q−1−m−k...i+m+ k].

Proof : There are m circular shifts in the pattern P. From Lemma 5.4, we calculate allof the minimum edit distances between possible circular shifts f h(P) and the text subTi in thishypothesis, where h ∈ [0, j]∪ [ j + q,m]. The circular edit distance of P against subTi is theminimum edit distance of all possible circular shifts against the text subTi in current hypothesis.The lemma holds. �

Lemma 5.6 : For a given hypothesis, the time complexity of the verification algorithm isO(km).

Proof : Algorithm 5.8 presents the verification processes following the above method.Clearly reversing the strings can be done in linear time, and computing edit distances on the


reversed string does not change the time required. Similarly, EDc() in Lemma 5.5 can becomputed in O(km) time, by maintaining an O(m) array to record intermediate results duringthe computation of standard edit distance using dynamic programming. Line 4 calls the k-approximate pattern matching algorithm in O(km) time using Ukkonen’s algorithm [121] bybidirection. Line 5 implements equation (5.1). Line 6 to Line 9 run in O(m) time cost to checkthe matching by Lemma 5.4. Thus the time complexity is O(km). �

Theorem 5.2: Given a text T = T [1...n] and a circular pattern P = P[1...m], with symbolsfrom an alphabet Σ, Algorithm ACPM2 solves the ACPM2 problem in O(km2n) time, and O(n)space.

Proof: Suffix tree construction (including suffix links) can done in linear time and lin-ear space. After constructing the suffix tree, hypothesis generation phase is performed inO(m log |Σ|) time, independent of n or N, since at this stage we only need a count of the numberof hypothesis, and the the parent nodes of the leaf nodes in the ST that correspond to the startpositions of the matching q-grams in the text or database. Given pattern P and text T , thereexists O(mn) possible hypotheses. From Lemma 5.6, the time complexity of verification algo-rithm is O(km) for each hypothesis. This means that the algorithm solves the ACPM2 problemin O(km×mn) = O(km2n) time.�

Theorem 5.3: Given a database sequences SeqDB, with Z sequences and N total symbols(|Σ| could be O(N)), Algorithm ACPM2 solves the all-against-all ACPM problem in O(kmaN)time cost on average, and O(kmmN2) time worst case, using O(N) worst case space, wherema = N

Z , mm is the length of the longest sequence in SeqDB.

Proof: The result essentially from Theorem 5.2. For a sequence database seqDB of lengthN, hypothesis generation phase is performed in O(N log |Σ|). The number of hypotheses will beO(N2), so the time complexity is O(kmmN2) in the worst case, where mm is the length of thelongest sequences. On average, the number of hypotheses is O( N2

|Σ|q ), then the time complexity

is O(kmaN2

|Σ|q ), where ma is the average length of sequences in seqDB. When q increases or|Σ| → N, the time complexity will be O(kmaN). Space requirement is in O(N) to maintain thesuffix tree data structure. �


5.3.5 Comparison with Other ACPM Algorithms

In Table5.2, we compare our ACPM-QGRAM algorithm and ACPM-BIDIRECTIONAL al-gorithm with the other related algorithms which were introduced in Chapter 2, namely Maes’ al-gorithm [81], Gregor and Thomason’s algorithm [44] and Uliel et. al’s algorithm [123]. Weineret. al’s algorithm [126, 127] is a greedy algorithm which may miss some important circularrelations. We also compare this algorithm with the other algorithms.

Our goal is to develop an algorithm to solve the ACPM problems and to apply this to studycircular permutations in multidomain proteins. We make minor changes in Maes’ algorithm[81], Gregor and Thomason’s algorithm [44], Uliel et. al’s algorithm [123] and Weiner et. al’salgorithm [126, 127] for adapting them to the ACPM problems. Because these algorithms allfocus on computing the circular edit distance, we extend them to match the pattern against allthe substring of T . This takes (n−m) steps, and hence the total time complexity will increaseby n−m = O(n) times.

The time complexity of our ACPM-QGRAM algorithm is O(m3aN2) in the worst case. On

average, the time complexity is O(m3aN2/|Σ|q), where ma is the average length of sequence

and N is the total length of sequences, and q = b mk+1c. When q increases, O(N2/|Σ|q) will be

reduced to O(N), since |Σ|q ≤ O(N).

The time complexity of our ACPM-BIDIRECTIONAL algorithm is O(kmmN2) in the worstcase. On average, the time complexity is O(kmaN2/|Σ|q), where ma is the average length of se-quence and N is the total length of the sequences, and q = b m

k+1c. When q increases, O(N2/|Σ|q)will be reduce to O(N) since |Σ|q ≤ O(N).

Table 5.2 shows the time complexity for the worst case of these algorithms. The last rowshows the average case for the most challenging problem of all-against-all approximate circularpattern matching. Comparing with the ACPM-QGRAM, when m is large (m = O(N)), our q-gram algorithm will be worse. In this case, the Maes [81] algorithm will be the best algorithm inthe worse case. But when m

k+1 increases and m is not very large, the ACPM-QGRAM algorithmwill run in O(m3N), where m = N

Z . This can be treated as a constant (NZ ≈ 6 for the case of

multidomain proteins). Therefore, under such condictions, the ACPM-QGRAM algorithm is alinear time algorithm on average. The proposed ACPM-QGRAM algorithm is better than the


Table 5.2. Comparison with other proposed ACPM Algorithms

Maes [81] Gregor et. al’s [44] Uliel et. al’s [123] Weiner et. al’s1 [126, 127] ACPM-qgram ACPM-BIDIRECTIONAL

One circular pattern O(m2 logm) O(m3) O(m3) O(m2) O(m2 logm/|Σ|) O(km)against one text =O(m logm)windowOne-against-One O(m2 logmn) O(m3n) O(m3n) O(m2n) O(m3 logm) O(kmn)One-against-All O(m2 logmN) O(m3N) O(m3N) O(m2N) O(m2N logm) O(kmN)All-against-All O(∑Z

1 m2N logm) O(∑Z1 m3N) O(∑Z

1 m3N) O(∑Z1 m2N) O(m3N2 logm) O(kmN2)

=O(N3 logm) =O(N4) =O(N4) =O(N3)Average Case O(N2ma logma) O(N2ma) O(N2m2

a) O(maN2) O(m3aN log(ma)) O(kmN)

All-against-All(m = ma = N

Z )

other four related algorithms which were introduced in related work. Comparing the ACPM-BIDIRECTIONAL algorithm with Maes’ algorithm [81] which is the best available algorithmfor the ACPM1 problem, we can see that apply Maes’ algorithm to the all-against-all ACPMproblem requires time in Θ(N2ma logma) on average, and O(N3 logmm) worst case. These arestill worse than our proposed algorithm that runs in O(kmaN) time on average and O(kmmN2)worst case. Landau et al’s algorithm [67] runs in O(kmn) to solve the ACPM2 problem (forone-against-one). In the all-against-all ACPM2 case, the time complexity will be Θ(kN2). Thisalgorithm is worse than our algorithm that run in O(kmaN) time on average.

5.4 Experiments

As discussed in Chapter 1, circular permutations and cyclic pattern have been used in var-ious studies in biology. We performed some experiments using the results of the proposedalgorithms to study circular permutations in molecular biology. In our experiments, we applyour algorithms on multidomain proteins to look for potential circular permutation relationshipsbetween them. We also use the results of the proposed algorithm to predict potential functionsfor uncharacterized or unknown proteins. In these experiments, the alphabet size is the numberof domains which is a large number, close to 106.

1Algorithm could produce incomplete results


5.4.1 Data Set

Protein Domain Database (ProDom)

The protein domain is a section of the protein sequence whose structure can evolve, functionand it exists independently of the rest of the protein chain [32]. Most proteins consist of severaldomains. The same protein domain may occur in related proteins. The ProDom is a databaseof known protein domains. The ProDom web site (http://ProDom.prabi.fr) provides a tool tosearch a protein domain in the protein database. The results are the proteins which contain agiven protein domain. Each domain is represented as a unique symbol, thus a multidomainprotein is viewed as a sequence of such symbols. The length of the domain representationis generally much smaller than the original protein sequence, but the size of alphabets hasincreased drastically.

Gene Ontology Database (GO)

The Gene Ontology (GO) project (http://www.geneontology.org/) provides a description ofgenes and protein products in different databases including the known functions of the genes.Currently the GO Consortium includes many databases such as GeneDB (http://www.genedb.org/),UniProtKB-Gene Ontology Annotation @ EBI (UniProtKB-GOA) (http://www.ebi.ac.uk/GOA/)and FlayDB (http://flybase.bio.indiana.edu/). More details on the GO Consortium is availableat http: //www.geneontology.org/GO.consortiumlist.shtml.

The ProDom database provides the Accession Number for the parent protein of each do-main. The Accession Number is also provided for UniProtKB-GOA. This establishes a con-nection between entities in ProDom and their corresponding entities in GO database. In ourexperiments, we used this relation to obtain the GO terms used to describe the protein func-tion. Based on this, we can predict functions for multidomain proteins using our CPM resultsobtained using the proteins in ProDom.


5.4.2 CPM Experimental Design

We implemented the exact circular pattern matching algorithm and three approximate cir-cular pattern matching algorithms and applied them to detect circular patterns in ProDomdatabase. We downloaded data from ProDom web site(http://ProDom.prabi.fr) on March 12,2009 (ProDom version 2006.1 as released on November 6th, 2008). There were 1,997,497 pro-teins in this database. We removed proteins with less than three domains, and also removedredundant proteins with more than 90% similarity to some other protein. The result is a reduceddatabase with 973,686 proteins. This means that ECPM and ACPM will only apply to proteinsthat contains an entire copy of another protein.

Results: Speed and Completeness

We ran the four algorithms on the reduced database and use the results to analyze the relation-ship between multidomain proteins. ACPM-QGRAM algorithm was executed on two differentparameters, namely q = 1 and q = 2. When q = 1, the result is complete. When q = 2, the resultis suboptimal (incomplete). We use the complete results as a benchmark to compare with theresults from the algorithm.

The exact algorithm is the fastest algorithm. It only needed six minutes to build the suffixtree and perform searches for all circular patterns. ACPM-LIS algorithm is the slowest al-gorithm. ACPM-GREEDY algorithm is faster than the other ACPM algorithms, but the resulthas low accuracy (around 50%). Figure 5.5 shows the practical time required by these threealgorithms, where q-gram has two instances, q = 1 and q = 2.

A comparison of the outputs of the algorithms provides some insight in their overall per-formance. There are 29,625,738 relations in the complete result. ECPM algorithm can beviewed as a greedy algorithm when the objective is approximate matching. The number ofrelations found using ECPM was 28,096,046 which is close to the complete result. ACPM-GREEDY only identified 15,075,729 relations. The ACPM-QGRAM algorithm with parameterq = 2 found 29,345,380 relations. When we run the ECPM algorithm in ProDom, we get almost95% of the complete relations. When we run ACPM-QGRAM algorithm of q = 2, we get more


than 99% of the complete relations.

We run the hybrid algorithm where the ECPM algorithm was applied first and followed byusing ACPM-QGRAM algorithm with parameter q = 1. We get the complete result and the timecost was reduced from 41 hours to 14 hours.

Analysis of Results

Based on the results, we built a relationship network among the multidomian proteins. This is adirected graph. The proteins are represented as the vertices, while the relations are representedby the edges. The In-edges and Out-edges are defined as follows. If a protein sequence P1 isa circular pattern in protein sequence P2, then there is an Out-edge from P1 to P2. Conversely,there is an In-edge from P2 to P1. Figure 5.6 shows the degree distribution of the network.Panel (a) of Figure 5.6 is the degree distribution of all vertices and panel (b) shows the degreedistribution of the Top-100 highest degree nodes. Panel (c) and (d) are log-log plots of panel (a)and (b) respectively.

Each protein sequence is not only used as a pattern to search against the other proteinsequences, but also used as text to be searched against using the other protein sequences inthe database. 424,888 protein sequences were found to be a pattern in some other proteinsequences. 799,044 protein sequences contain at least one other protein sequence as a circularpattern. 374,279 protein sequences have both out-edge and in-edges. 50,609 protein sequencesonly have out-edges while 424,765 protein sequences only have in-edges. The average degreeof this graph was 23 with an average out-degree of 46 and an average in-degree of 24.5.

Figure 5.7(a) shows the number of directly connected pairs in the Top-K highest degreeproteins, where K is 10, 20 ... 1000. Let the Top-K highest degree proteins be vertices of asubgraph, the number of directly connected pairs is the number of edges. We define a ratio ρK

as follows: ρK = # o f total edges# o f edges in Top−K complete subgraph = # o f observed edges

12×K×(K−1)

. Figure 5.7(b) shows theratio ρk in Top-K proteins. When K is less than 460, the ratio ρK stays stable at around in 0.5.When K is larger than 460, the ratio ρK starts to decrease. Thus in this graph, Top 460 highestdegree proteins have higher relations.


Table 5.3. Top 15 highest degree proteins with GO functionRank Count AC Number Go Description

1 23353 Q7VMZ1 nucleotide binding ; ATP binding ; ATPase activity ; nucleoside-triphosphatase activity2 23344 Q9CPC5 nucleotide binding ; ATP binding ; ATPase activity ; nucleoside-triphosphatase activity3 23338 Q3EG14 Protein not found in GO4 20508 Q33HH1 Protein not found in GO5 20446 Q47AY9 nucleotide binding ; ATP binding ; ATPase activity ; nucleoside-triphosphatase activity6 20446 Q4UQ62 nucleotide binding ; ATP binding ; ATPase activity ; nucleoside-triphosphatase activity7 20446 Q8P4K7 nucleotide binding ; ATP binding ; ATPase activity ; nucleoside-triphosphatase activity8 20446 Q8PG73 nucleotide binding ; ATP binding ; ATPase activity ; nucleoside-triphosphatase activity9 20415 Q426Q5 Protein not found in GO

10 20398 Q3BNR9 nucleotide binding ; ATP binding ; ATPase activity ; nucleoside-triphosphatase activity11 20393 Q73PA3 nucleotide binding ; ATP binding ; ATPase activity ; nucleoside-triphosphatase activity12 20273 Q50XK7 Protein not found in GO13 20246 Q66C16 nucleotide binding ; ATP binding ; ATPase activity ; nucleoside-triphosphatase activity14 20244 Q5NU40 No function in GO15 20244 O32748 nucleotide binding ; ATP binding ; ATPase activity ; nucleoside-triphosphatase activity

Protein Function Prediction

Table 5.3 shows the protein function for the Top-15 highest degree proteins. We notice that 10of the 15 proteins have exactly the same functions. There are four proteins (rank is 3,4,9,12respectively) that do not have entries in the GO database. Protein Q5NU40 (rank 14) has arecord in GO database, but there is no function assigned to it in GO database. With highprobability, we can say that the four proteins with no known function are likely to have thesame function as the other 10 proteins. We plan to verify these functions by searching thebiology literature in future.

We use the z-score as a measure of significance of the relationship between two proteins.For a given random variable x, the z-score is defined as follows. z = x−µx

σx, where µx is the mean,

and σx is the standard deviation.

To predict the function for a protein say PA, we compute the z-scores for the number ofoccurrences of given function for the proteins in the respective In-edge and Out-edge sets forprotein PA. We then assign the protein function as the function with z-score above a threshold.

Table 5.4 shows the prediction results on 9 multidomain protein sequences using the unionof the functions of the In-edge and Out-edge proteins at different thresholds on the z-scores.


Table 5.5 shows equivalent result using intersection.

We also conducted an experiment to predict the protein functions in the Top-500 highestdegree proteins of these 156 proteins were not found in the GO database. Table 5.6 showsthe prediction performance in terms of precision, recall and the F-measure, where FP is thenumber of false positive; FN is the number of false negative; T P is the number of true positive.The recall is calculated as T P

T P+FN and the precision is calculated as T PT P+FP . The F-measure is

calculated as 2× recall×precisionrecall+precision .

From the F-measure of Table 5.6, we notice the union method at z ≥ 3 has the highestF-measure 0.84. This indicates the union method at z ≥ 3 provides the best result of all thecombination.

5.4.3 Multidomain Protein Networks using Circular Patterns

Introduction

Philipp et al. [100] introduced a tool to discover the potential relationships between proteinsusing the protein domain network. This protein domain network was based on the protein do-main interaction networks. They built a web resource to explore the Protein Domain InteractionMAp(DIMA). In this network, the nodes are the protein domains and the edges are the inter-actions between two protein domains. In our work, network formation is based primarily oncyclic relationships between multidomain proteins.

Dataset

In this experiment, we further studied the use of our proposed ECPM and ACPM algorithms onthe problem of analyzing multi-domain protein sequences. Based on the patterns found by ouralgorithms, we constructed multidomain protein networks by connecting different multidomainproteins that are found to be associated by some matching circular or non-circular patternsfound by our algorithms. We use the Pfam database [15, 39] to identify the families for themultidomain proteins in ProDom. (We had introduced the Pfam database earlier in Chapter


Table 5.4. The predicted protein functions using union for In-edge and Out-edgeProtein Function Predicted Predicted PredictedAC Number Function Function Function

(z≥3) (z≥2) (z≥1)Q7VMZ1 GO:0000166 GO:0000166 GO:0000166 GO:0000166

GO:0005524 GO:0005524 GO:0005524 GO:0005215GO:0016887 GO:0016887 GO:0016887 GO:0005524GO:0017111 GO:0017111 GO:0017111 GO:0016787

GO:0042626 GO:0016887GO:0017111GO:0042626

O32184 GO:0003824 GO:0003824 GO:0003824 GO:0003824GO:0005488 GO:0005488 GO:0005488 GO:0004316GO:0016491 GO:0016491 GO:0016491 GO:0005488

GO:0016491Q2Y7W6 GO:0000156 GO:0000155 GO:0000155 GO:0000155

GO:0004871 GO:0004871 GO:0004871Q33CH5 GO:0003723 GO:0003723 GO:0003723 GO:0003723

GO:0003968 GO:0003968 GO:0003968 GO:0003968Q30U32 GO:0000156 GO:0000156 GO:0000155 GO:0000155

GO:0004871 GO:0000156 GO:0000156GO:0004871 GO:0004871

O93828 GO:0004585 GO:0004585 GO:0004585 GO:0004585GO:0016597 GO:0016597 GO:0016597 GO:0016597GO:0016740 GO:0016740 GO:0016740 GO:0016740GO:0016743 GO:0016743 GO:0016743 GO:0016743

Q30SN9 GO:0003824 GO:0003824GO:0004252 GO:0004252

GO:0005515Q2YTY7 GO:0003723 GO:0003723

GO:0009982 GO:0009982O78911 GO:0008137 GO:0008137 GO:0008137 GO:0008137

GO:0016491 GO:0016491 GO:0016491 GO:0016491


Table 5.5. The predicted protein functions using intersection for In-edge and Out-edgeProtein Function Predicted Predicted PredictedAC Number Function Function Function

(z≥3) (z≥2) (z≥1)Q7VMZ1 GO:0000166 GO:0000166 GO:0000166 GO:0000166

GO:0005524 GO:0005524 GO:0005524 GO:0005524GO:0016887 GO:0016887 GO:0016887 GO:0016887GO:0017111 GO:0017111 GO:0017111 GO:0017111

O32184 GO:0003824GO:0005488GO:0016491

Q2Y7W6 GO:0000156 GO:0004871 GO:0004871 GO:0000155GO:0004871

Q33CH5 GO:0003723 GO:0003723 GO:0003723GO:0003968 GO:0003968 GO:0003968

Q30U32 GO:0000156 GO:0004871 GO:0000155GO:0004871

O93828 GO:0004585 GO:0016597GO:0016597 GO:0016740GO:0016740 GO:0016743GO:0016743

Q30SN9 GO:0003824GO:0004252

Q2YTY7 GO:0003723GO:0009982

O78911 GO:0008137 GO:0008137 GO:0008137 GO:0008137GO:0016491 GO:0016491 GO:0016491 GO:0016491

Table 5.6. Performance in Protein Function Prediction using the Top-500 ProteinsMethod Parameter TP FP FN Recall Precision F measure

Union z≥3 1349 186 317 0.81 0.88 0.84z≥2 1353 1302 313 0.81 0.51 0.63z≥1 1464 1950 202 0.88 0.43 0.58

Intersection z≥3 162 522 1504 0.1 0.24 0.14z≥2 1269 3891 397 0.76 0.25 0.37z≥1 1345 6857 321 0.81 0.16 0.27


4). There are 40807 protein sequences in ProDom database which is also in Pfam database.There are 12104 families in Pfam database. In our experiment, Proteins in ProDom that do nothave corresponding families in Pfam were not included in this analysis. Similar to Chapter 4,for function prediction based on the circular pattern networks, we use the protein functions asmaintained in the GO database.

To ensure diversity, and reduce the problem of redundancy in the protein families, eachfamily is represented by only two member proteins. We select two proteins from each familyin the Pfam database to construct a new network. For each family, we select the respectiveproteins with the maximum and minimum number of circular pattern relationships. These oftencorrespond to the longest and shortest protein sequences in the family. Some proteins belong tomultiple families, thus some proteins will be chosen several times. We only keep one of themon our network. This network presents non-redundant relationships with all protein families.There are 3659 proteins and 4725 families in this network. The resulting data set contained3,659 proteins from 4,725 families.

Network Formation

We construct two types of network. One is based on the circular permutation relationship be-tween proteins. We call this the “Protein” network. The other is based on the families. Anycircular relationships between two proteins will be circular permutation relationship betweenthe two families to which the two proteins belong. We call this the “Family” network. Further,we construct three networks for each of the two networks. The first is a network using onlynon-circular patterns (non-CPs) found between the proteins. This is constructed based on non-circular matching relationships between proteins. The second network is the circular patternnetwork. This network is constructed using only the circular matching relationships (excludingnon-circular matches) between the multidomain proteins. The last network is the combinednetwork which is constructed from all matching relationships (including both circular and non-circular matches). Thus, we construct six networks from our data. Table 5.7 shows the statisticsof these six networks. The networks are shown in Figures 5.8-5.13.


Table 5.7. Network statistics for multidomain protein networksParameters Protein Protein Protein Family Family Family

non-circular circular combined non-circular circular combinednetwork network network network network network

Clustering Coefficient 0.157 0.102 0.174 0.181 0.084 0.249Connected Components 756 407 786 600 369 608Network Diameter 7 9 9 13 15 11Shortest paths 23515 17104 25593 103683 58164 123380Characteristic Path Length 2.079 2.099 2.083 3.525 3.814 3.294Average number of neighbors 2.612 2.496 2.723 5.325 4.073 6.844Number of nodes 3458 2299 3659 4416 3140 4725Number of edge 4517 2869 7386 13031 6741 19772

Significance of Circular Pattern Networks

The networks in Figures 5.4.3 and 5.11 have two colors for the network nodes. The red nodesare the nodes in the non-circular pattern networks. The green nodes are the nodes found only inthe circular pattern networks. We notice there are very few green nodes in the networks. Thereare also two colors for the edges. The blue edges indicate edges that occur in the non-circularnetworks. The pink edges indicate the edges that only occurred in the circular networks. Fromthese two figures, we see that the pink edges show more significant differences between thenon-circular pattern network and the circular pattern network.

For example, in the combined Protein network (with CPs and non-CPs), we can observethe pink edges between between some major clusters. This shows an important relationship be-tween these clusters that are only exposed using the circular permutations. For function predic-tion work based on this network, one can expect that these edges will provide more informationabout the clusters, potentially leading to improved prediction results.

In the family networks, there are some interesting observations between the combined net-work and the non-circular network. The part labeled ”G” in the combined network (Figure5.4.3) is connected to the main component, but in the non-circular pattern network, the part Gis a large component which is disconnected from the main component. This indicates that thereexist some important edges that only occur red in the circular network. Such relationships cannot be found using direct pattern matching, and require methods for circular pattern matching,as proposed in this chapter.


Table 5.8 shows the Top 25 proteins that exhibited the highest difference (in terms of nodedegree) between protein circular-pattern network and non-circular pattern network. Perhaps, thesignificance of the CP network is clearer in this table, which shows the quantitative differencesbetween selected nodes in the two network. We can observe that, in some cases, the more thanhalf of the associations to some nodes in the Protein networks are due mainly to the circularpattern relationships. We can also see that some small networks are formed only exclusivelybased on circular patterns, with no associations using direct patterns (i.e non-CPs).

Table 5.9 shows the proteins found on the longest path in the Protein network. Giventhe nature of the network, it means that these multidomain proteins must have some commoncircular patterns (and perhaps some non-circular) shared between them. Table 5.10 shows thecorresponding longest chain of families in the Family network.

5.5 Summary

In this chapter, we present four ACPM algorithms to solve the ACPM2 problem and onealgorithm to solve the ECPM problem. ECPM matching algorithm is not only the best algo-rithm in theory, but also it is the fastest in practice. The ACPM-BIDIRECTIONAL algorithmis the best of our ACPM algorithms. Comparing with other algorithms in literature, ACPM-BIDIRECTIONAL algorithm also has the best time complexity on average.

Based on the results, we analyzed circular permutations in multidomain proteins and usedthis to perform protein function prediction. Our results show a performance of 0.88, 0.81 inprecision and recall respectively, at z≥ 3.0, using the union of the functions inter In-edge andOut-edge proteins.


Table 5.8. Top 25 proteins with the highest node degree differences between protein net-

works using the circular and non-circular patterns.Protein Degree in Degree in Degree in Degree in circular network

Degree in combined network (%)circular network combined network non-circular network

P03304 266 704 438 37.78%P19525 221 592 371 37.33%P35409 199 567 368 35.10%Q00962 175 470 295 37.23%P17546 175 464 289 37.72%P42684 122 390 268 31.28%P43699 147 402 255 36.57%P48633 174 403 229 43.18%P29476 123 348 225 35.34%Q04610 122 320 198 38.13%Q99323 113 306 193 36.93%P22009 95 276 181 34.42%Q06889 94 260 166 36.15%Q03351 431 588 157 73.30%P10272 82 238 156 34.45%P03336 83 225 142 36.89%P16112 101 240 139 42.08%P38939 77 216 139 35.65%P41381 81 203 122 39.90%P19560 59 175 116 33.71%P30963 68 176 108 38.64%P26762 64 168 104 38.10%P27742 72 169 97 42.60%P19559 62 150 88 41.33%P08049 65 150 85 43.33%


Table 5.9. The longest path in the Protein networkorder Protein Family

1 P25930 Pfam-B 80891 P25930 Pfam-B 80902 P34540 Pfam-B 23863 P43141 Pfam-B 28234 Q03351 ig4 Q03351 Pfam-B 117534 Q03351 Pfam-B 37064 Q03351 Pfam-B 37075 P29276 Pfam-B 108076 P13677 Pfam-B 1356 P13677 Pfam-B 7357 P11461 Pfam-B 89108 P03967 ras9 P31133 Pfam-B 3874

10 P46870 Pfam-B 21010 P46870 Pfam-B 22910 P46870 Pfam-B 46110 P46870 Pfam-B 54710 P46870 Pfam-B 54810 P46870 Pfam-B 61611 P13068 Pfam-B 538312 P42686 Pfam-B 700812 P42686 Pfam-B 700913 P44768 Pfam-B 1043314 P32745 Pfam-B 491914 P32745 Pfam-B 492015 P23678 Pfam-B 725215 P23678 PH16 P25892 Pfam-B 381617 P31134 Pfam-B 183218 P35409 7tm 118 P35409 Pfam-B 395218 P35409 Pfam-B 72519 P21838 Pfam-B 809520 P07199 Pfam-B 606221 P09803 Pfam-B 448221 P09803 Pfam-B 614622 P24710 Pfam-B 162022 P24710 Pfam-B 1621


Table 5.10. The longest path in the Family networkOrder Family

1 Pfam-B 65442 Pfam-B 111603 ins4 vwc5 Pfam-B 51156 Pfam-B 16777 Pfam-B 46868 Pfam-B 74819 Pfam-B 1063

10 Pfam-B 268211 Pfam-B 319712 Pfam-B 897413 Pfam-B 60414 Pfam-B 81815 adh zinc16 Pfam-B 1115917 Pfam-B 84218 Pfam-B 2068

i $

ppi$ ssi

ssi

ssippi$ ppi$

ppi$

missississippi$

p

i$ pi$

s

si

i

ssi

ssi

ppi$

ppi$ ssippi$

ppi$

ssippi$

ppi$ 8

7

5

4

3

2

14

13

12

11 10

9 6

$

N 1

N 9

N 8

N 7

N 6

N 5

N 4

N 3

N 2

1

0

root

Figure 5.1. Suffix tree for the string T = missississippi$ with some suffix links.(This figure is the same as Figure 3.1 with some suffix link.)


Algorithm 5.1: ECPM Algorithm

ECPM(ST,Pattern)1 PP←Pattern ◦ Pattern[1...m-1]2 Current←ST.root,top←1,len←13 for ( i← 1 to 2m-1) do4 if (len = 1) and Current.child.label[1]=PP[i] then5 Current← Current.child6 end if7 if Current.child.label[1]=PP[i] then8 len← len + 19 if len > label.length then10 len← 111 end if12 if i-top=m-1 then13 output ”(matched, Current)”14 Current← Current.SuffixLink, top← top + 115 end if16 else17 Current← Current.SuffixLink18 top← top + 1, i← i - 119 end if20 if i < top or top > m then21 break22 end if23 end for

Algorithm 5.2: All vs. All ECPM Algorithm

ALL-VS-ALL-ECPM(SeqDB,Z)1 ST←Get Suffix Tree(SeqDatabase)2 for ( i← 1 to Z) do3 Pattern←SEQDB[i]4 ECPM(ST,Pattern)5 end for


Algorithm 5.3: Pattern Matching Using LIS

APM-VIA-LIS(T,P,k)1 Build the mapping table mapTable which stores the positions in P of each symbol in decreasing order2 seq← NULL3 for ( i← 1 to n) do4 seq← seq ◦ mapTable[T [i]]5 end for6 Generate LIS from seq7 Calculate LCS between T and P from LIS8 if verify(LCS,k) is true then9 return matched10 else11 return mismatched8 end if

Algorithm 5.4: ACPM2 with Greedy Algorithm

ACPM-GREEDY(SeqDB,Z,k)1 for ( i← 1 to Z) do2 for ( j← 1 to Z) do3 P← SeqDB[i], m← |P|, PP← P[1...m]◦P[1...m−1]4 T ← SeqDB[ j], n← |T |5 APM-via-LIS(T,PP,k)6 end for7 end for


Algorithm 5.5: ACPM2 Algorithm with LIS

ACPM-LIS(SeqDB,Z,k)1 for ( i← 1 to Z) do2 for ( j← i+1 to Z) do3 P← SeqDB[i], m← |P|4 T ← SeqDB[ j], n← |T |5 for ( v← 1 to n−m+ k) do6 subT ← T [v...v+m+ k]7 for ( h← 1 to m) do8 APM-via-LIS(subT, f h(P),k)9 end for10 end for11 end for12 end for

Figure 5.2. The number of hypotheses with q-gram


Algorithm 5.6: ACPM2 Algorithm with q-gram and Suffix Array

ACPM-QGRAM(SeqDB,N,Z,q,k)1 seq← NULL, pos← NULL, s← 12 for ( i← 1 to Z) do3 seq← seq ◦ SeqDB[i]4 for ( j← 1 to mi) do5 pos[s]← i, s← s + 16 end for7 end for8 <SA,lcp>← BuildSA(seq)9 for ( i← 1 to N) do10 Candidates← {}11 do while (lcp[i] ≥ q )12 Candidates← Candidates ∪ {i}, i← i+113 end do14 for each Pair {x,y} ∈ Candidates do15 P← SeqDB[pos[SA[x]]], m← |P|16 T ← SeqDB[pos[SA[y]]], n← |T |17 for ( j← min(1,y−m− k +q) to y+m+ k−q) do18 subT ← T [ j... j +m+ k−1]19 for ( h← 1 to m) do20 APM-via-LIS(subT, f h(P),k)21 end for22 end for23 end for24 end for

m

0

j

i n

Figure 5.3. Dynamic Programing in q-gram matching


P2 P1

1

23

Text QT

QPP[j+q...m] P[1...j-1] P[j+q...m] P[1...j-1]

Figure 5.4. Three cases in computing the circular edit distance in the ACPM algorithm using

the bidirectional edit distance. The numbered double-header show the symbol positions

involved in each case.

Algorithm 5.7: q-gram ACPM Hypothesis Generation

ACPM-BIDIRECTIONAL(SeqDB[],k,N)1 ST← Build-Suffix-Tree(SeqDB)2 for i=1 to N3 P← SeqDB[i]4 m← |P|5 q← b m

k+1c6 start← 17 {< SeqDB ID,position>} ← Search-in-ST(ST,P[1...q])8 for each pair < SeqDB ID,position>

9 T ← SeqDB[SeqDB ID]10 BIDIRECTIONALED(P,1,T ,position,k,q)11 end for12 start← start + 113 do while (start ≤ m-p)14 {< SeqDB ID,position>} ←

Search-in-SuffixLink(ST,P[start...start+q-1]) via suffix link15 for each pair < SeqDB ID,position>

16 T ← SeqDB[SeqDB ID]17 BIDIRECTIONALED(P,1,T, position,k,q)18 end for19 start← start + 120 end do21 end for


Algorithm 5.8: ACPM Hypothesis Verification

BIDIRECTIONALED(P, posP,T, posT,k,q)1 m← |P|2 P1← P[posP+q...m] ◦ P[1...posP-1]3 P2← P1R

4 ed1← DP(P1,T[posT+q...posT+q+m+k-1],k); ed2← DP(P2,T[posT-m+q...posT-1]R,k)5 ED← ComputeED(ED1,ED2,m,q)6 for h=1 to m-q+17 if (ED[h] + ED[h+m-1] ≤ k) then do8 return match9 end for

0 50000 100000 150000

010

020

030

0

Number of protein

Tim

e(M

inut

es)

● ● ●●

●

●

●

●

GreedyACPMACPMLISACPMq−gram(q=2)ACPMq−gram(q=1)

0e+00 2e+05 4e+05 6e+05 8e+05 1e+06

−2

02

46

8

Number of protein

log(

Tim

e(M

inut

es))

●

●

●

●

●

●

●

●

●

●

GreedyACPMACPMLISACPMq−gram(q=2)ACPMq−gram(q=1)

Figure 5.5. The time cost of the CPM algorithms


(a) Degree distribution (b) Degree distribution in Top-100 degree nodes

(c) Log degree distribution (d) Log degree distribution in Top-100 degree nodes

Figure 5.6. Degree distributions in the network of multidomain proteins constructed based

on the circular patterns they contain.


(a) (b)

Figure 5.7. Number of directly connected pairs in Top-K highest degree proteins


Figure 5.8. The Protein network (using both CPs and non-CPs)

Red nodes are nodes found in the non-CP network. Green nodes are nodes found only in theCP-network. Blue edges denote edges found in the CP network. Pink edges are edges foundonly in the CP network.


P 1 9 7 1 6

P 1 1 2 3 6

P 4 1 3 6 0P 0 1 1 4 2

P 2 4 1 2 8

P 0 1 2 8 3

P 1 3 1 2 1

P 2 0 5 0 4

P 1 1 7 0 5P 1 6 0 2 5

P 4 2 4 8 6

P 4 8 1 2 0

P 1 7 5 4 6P 4 8 7 5 6

P 1 8 6 8 0P 3 7 8 7 1

P 1 3 1 2 2

P 0 9 2 5 9

P 4 7 5 8 2

P 2 2 7 0 4

P 1 2 0 9 3

P 0 7 3 9 2

P 3 2 5 9 5

P 4 3 7 3 9

P 0 3 6 8 0

P 2 8 3 4 0

P 2 2 1 3 9

P 2 1 4 0 2

P 1 7 1 9 2

P 1 3 8 4 6

P 2 8 3 3 9P 0 7 9 1 7

P 0 5 6 6 4

P 4 1 5 5 6

P 1 9 8 1 1

P 3 4 7 7 8

Q 0 3 5 8 6

P 1 3 9 1 1 P 4 8 3 3 7

P 3 0 3 2 0

P 2 8 8 5 7 P 1 7 3 9 3

P 3 0 3 1 8

P 3 1 8 1 3

P 0 7 1 6 7

P 4 7 8 7 2

Q 0 8 3 4 1

P 0 1 1 4 3

P 2 3 8 1 1

P 4 3 7 9 9

Q 0 6 1 4 5

P 1 8 5 4 0

P 0 2 7 2 4

P 2 1 8 5 0

P 2 7 1 0 6

P 1 5 1 7 3

P 1 4 0 0 3

P 2 3 9 9 9

P 1 3 0 9 7

P 1 3 0 8 8

P 1 3 0 8 9

P 3 3 1 4 4

P 0 6 2 9 5

P 4 6 5 9 2P 2 4 7 9 3

P 1 7 6 6 7

P 1 5 1 7 2

P 3 5 2 0 8

P 1 7 9 2 0

P 1 3 9 0 3 Q 0 0 9 4 2

P 4 8 3 7 1

P 4 5 0 4 7

P 4 8 3 7 2

P 3 0 1 9 0

P 1 6 8 5 0

P 0 7 0 6 5

P 1 5 3 4 8

P 0 1 0 8 5

P 3 4 2 0 3

P 3 6 4 2 9

P 2 5 3 4 5

P 1 0 7 2 3

P 3 8 7 0 7

P 4 3 8 2 9

P 4 7 3 8 2

P 4 5 0 4 6

P 4 7 3 5 9

P 3 5 7 1 0

Q 0 6 8 3 1P 4 1 2 5 5

Q 1 0 1 1 5

P 4 7 7 9 2

P 4 8 4 3 2

P 4 8 4 3 6

P 4 1 2 2 5

P 4 8 4 3 3

P 0 7 6 6 6

Q 0 6 9 4 5

P 4 9 5 9 6

Q 0 9 1 7 3

P 2 1 3 2 8

P 4 9 5 9 8

Q 0 9 1 7 2

P 4 9 4 4 4P 3 4 2 2 1

P 3 9 9 6 6

P 3 5 4 2 8

P 4 1 1 3 4

P 2 0 0 6 7

P 2 2 8 1 6

P 4 7 9 2 8

Q 0 2 3 6 3

P 0 7 8 5 1

P 4 0 8 7 5

Q 0 5 0 6 6

P 4 8 0 4 6

Q 0 7 9 6 5

P 1 7 6 0 0

P 4 3 8 2 5

P 2 2 8 3 1

P 1 5 1 7 8

P 1 4 8 6 8

P 3 6 4 1 9

P 0 6 6 2 2

P 0 7 8 6 1

P 4 3 8 3 3P 4 9 5 9 1

P 0 1 1 0 6

P 4 4 7 1 5

P 1 2 7 3 3

P 0 1 3 4 7

Q 0 0 4 2 0

Q 0 4 9 6 0

P 1 0 0 3 1P 0 6 2 3 6

P 3 8 9 3 9

P 2 1 6 2 0

P 0 0 9 6 7

Q 0 2 3 9 4

P 0 9 7 4 5

P 1 6 3 9 3

P 1 3 2 1 3

P 2 8 6 6 1

P 4 2 2 0 9 P 3 5 2 2 1

P 3 7 4 6 4

P 3 2 4 4 0P 4 6 1 8 2

P 4 4 3 5 4

P 4 2 1 8 8

P 4 4 3 4 9

P 4 5 7 6 7

P 3 5 9 9 2

P 1 7 5 6 1 P 2 7 5 2 6

P 4 7 7 6 0

P 0 3 2 9 6

P 2 4 3 5 8Q 0 9 8 8 2

P 2 5 6 8 9

P 1 5 6 4 4

P 3 7 8 8 7

P 2 2 8 8 6

P 2 2 8 8 1

P 2 4 6 9 5

P 3 2 7 8 2

P 3 7 7 0 3

P 0 1 2 1 1

P 4 2 2 2 0 P 3 0 5 7 3P 2 2 4 3 4

P 3 2 2 3 2

P 4 2 1 8 6

P 4 2 1 8 7

Q 0 2 4 7 3

P 4 9 3 8 6

P 2 7 7 5 4

P 1 4 3 0 8

P 2 1 3 0 2

P 4 4 7 2 1

P 1 6 1 2 2

P 0 7 3 7 4P 2 3 7 0 1

P 1 8 5 0 9

P 2 1 6 7 0

P 0 9 4 8 9P 1 8 4 8 5

P 4 8 1 4 4 P 1 9 9 5 4P 3 2 6 7 0

Q 0 3 2 8 3

Q 0 7 9 6 4

P 2 7 5 7 0

P 0 1 0 9 3

P 2 9 6 1 7

P 1 0 0 8 5

P 0 7 2 0 4

P 0 9 4 4 1

P 4 6 5 1 8

P 1 2 3 4 9

P 1 2 3 4 8

P 1 4 2 8 3

P 2 6 6 8 7

Q 0 6 2 3 4

P 0 9 9 3 3

Q 0 8 3 4 5

P 9 8 0 9 2

P 2 3 0 2 5

P 4 5 3 5 8

P 3 2 7 7 0

P 4 2 5 3 0

Q 0 8 4 0 0

P 0 0 4 5 1

P 0 0 5 3 3

P 1 5 3 0 6

Q 0 1 1 5 8

P 4 3 8 6 9

P 0 3 7 0 1P 1 4 5 4 3

P 9 8 0 9 5

P 2 5 1 5 5

P 1 0 4 9 3

P 4 6 5 1 9

Q 0 1 2 7 9

Q 0 6 3 9 3

P 4 2 2 3 1

P 0 2 8 0 9

P 0 2 8 0 8

P 1 4 6 3 9

P 4 3 6 5 2

P 4 2 2 2 7

P 4 4 4 2 3

Q 0 0 2 8 8

P 0 9 4 9 8

P 1 1 3 6 8

P 1 5 4 0 2

P 2 3 7 5 9

P 1 7 1 7 7

P 4 0 5 7 3

P 2 9 5 7 8

Q 0 0 4 9 6

P 1 7 8 5 4

P 0 0 3 0 9

P 1 8 5 4 8

P 1 7 8 4 6

P 3 5 5 2 5

P 4 0 3 3 9

P 3 4 4 2 9

P 3 7 8 8 9

Q 0 8 8 7 9

P 2 9 5 7 9

P 3 2 1 5 5

P 4 7 1 9 0

P 2 7 1 7 0

P 3 1 3 8 2

P 2 7 1 6 9

P 4 6 0 8 1

P 0 4 9 5 8

P 3 7 3 4 2 P 1 2 6 8 8 P 1 8 9 6 1 Q 0 2 4 2 6 P 2 7 2 7 6 P 0 7 2 1 0 P 4 4 6 2 4 Q 0 7 0 7 5 P 3 3 7 3 3Q 0 2 1 8 7P 0 8 5 3 8 Q 0 4 6 0 9P 0 6 8 1 1P 2 7 8 9 6P 2 9 7 1 6 Q 0 1 6 7 9 P 2 3 8 4 3P 1 6 0 9 9P 0 5 6 5 4P 4 1 0 0 8Q 0 7 2 8 2 P 2 5 9 4 1P 2 5 0 4 7 P 1 2 6 9 4 P 1 1 7 2 5 P 1 2 9 5 4 Q 0 9 6 7 1 Q 0 9 6 7 0 P 3 8 0 6 9 P 4 9 6 3 8 P 4 9 1 1 2P 1 1 4 7 8P 1 1 1 7 8P 3 3 1 8 2P 2 9 2 4 7P 4 2 5 9 3

P 3 9 9 5 4 P 3 9 5 1 8 P 0 9 1 8 1 P 0 5 1 6 4P 0 8 9 3 4 P 0 1 0 4 8P 1 1 8 1 9 P 1 0 4 1 5Q 0 0 7 0 9P 1 9 7 7 4P 3 7 2 4 8P 0 9 4 7 0P 3 7 1 3 6P 2 1 8 9 9P 1 1 2 4 7 P 1 5 1 0 9P 0 9 1 8 0 P 0 0 4 3 6 Q 0 3 2 1 7 P 4 6 0 1 1 P 4 2 6 7 5 P 4 2 6 7 6 P 2 0 8 1 0 P 2 7 3 2 1 P 0 4 3 2 4 P 2 3 8 9 3P 1 7 3 5 1 P 4 2 1 7 9P 0 3 4 7 2P 0 5 8 5 7 P 0 3 4 8 3 P 3 0 6 8 7P 1 5 9 2 1

P 4 9 2 3 7 P 4 3 9 1 2 P 4 5 1 3 0 P 1 6 4 0 7 P 1 9 9 0 3 P 0 2 8 4 4P 2 6 6 9 9 P 0 6 7 2 3P 0 3 7 3 9 P 1 1 0 7 7

P 2 0 5 8 5P 1 4 7 1 5P 4 9 2 9 3P 0 6 8 9 5 P 1 0 9 6 5

Q 0 1 5 1 8P 3 7 1 4 1P 2 7 6 0 8P 2 9 9 7 6P 4 5 2 2 1P 2 3 0 5 4P 0 2 9 8 2P 3 2 9 3 5P 3 2 9 3 3

P 2 3 1 0 1P 2 5 7 2 3P 4 5 2 9 7P 4 2 7 9 9P 4 6 2 8 8

P 3 8 5 1 1 P 4 6 5 4 8 P 4 9 1 5 1 P 2 0 9 1 8P 1 4 7 4 3 P 1 9 2 5 7P 1 8 7 5 9

P 3 7 0 0 2 P 2 0 6 9 8 P 0 9 9 1 8 P 2 3 9 9 8 P 1 0 3 6 8 P 3 1 7 8 3 P 4 1 6 8 8 P 4 0 2 6 6

P 2 3 4 4 6

P 2 5 3 5 8

Q 1 0 0 0 2

P 9 8 1 3 6P 1 0 4 6 3 P 4 1 1 5 0

P 4 8 5 7 2

P 1 7 9 1 5

P 0 0 4 5 0

P 2 2 2 5 3

P 3 6 4 1 6

P 1 3 2 1 7 P 1 4 2 6 2 P 0 8 9 5 4 P 3 6 6 2 3 P 3 7 5 9 0P 1 0 6 8 8

P 1 1 0 2 4

P 1 8 5 8 1

P 4 3 4 5 2P 1 2 7 9 3

P 1 9 7 1 1P 4 5 6 7 7

P 4 8 9 8 0

P 4 5 6 7 8P 2 2 9 9 0

P 4 5 5 8 2

P 4 4 9 5 7

P 9 8 0 8 5

P 4 1 0 6 4P 0 9 2 7 4

P 4 2 1 2 6P 4 3 1 5 4 P 4 2 1 7 5

P 4 3 0 1 0

P 0 0 9 5 8

P 1 0 1 7 2P 2 1 5 3 0

P 2 9 0 1 6 P 1 5 8 1 2

Q 0 1 8 2 6 P 4 8 7 6 9P 4 3 3 4 6P 4 3 0 0 7

P 0 3 5 1 5P 1 2 0 4 7P 1 2 0 4 6

P 4 8 6 2 0

P 1 7 9 0 4 P 3 0 1 8 1 P 4 4 9 2 8 P 0 6 8 3 9 P 4 3 8 5 3 P 0 0 4 9 9P 2 8 3 4 8

P 3 3 2 1 7

P 1 5 7 2 2

P 1 3 8 3 9

P 3 5 6 8 9

P 1 2 7 7 7

P 1 7 5 9 5P 4 1 3 2 6

P 3 3 0 7 1 P 4 5 1 0 2

P 3 4 7 3 2

P 0 1 1 3 4P 0 1 1 3 5

P 2 9 1 5 5

P 0 9 9 1 6

Q 0 3 4 9 9P 1 7 4 4 9

P 0 7 0 6 1P 1 5 5 5 3 Q 0 4 8 7 0

P 4 3 5 4 5

Q 0 3 4 6 7

P 3 9 3 0 5

P 4 1 8 9 5

P 3 7 5 2 7

P 3 7 4 7 4

P 1 6 6 1 2

P 3 0 9 5 8

P 2 0 6 6 5

P 2 0 6 9 6

Q 0 4 6 1 2

P 0 7 0 6 2

P 2 5 2 2 1

P 0 4 6 3 7

P 3 5 4 5 6

P 2 1 5 7 7

P 3 1 0 9 8P 0 6 6 1 5P 0 3 9 5 8P 3 7 4 5 5 P 3 1 0 9 6P 1 6 9 2 2P 3 4 6 0 1

P 1 0 3 6 0 P 2 9 7 6 9

P 3 7 5 0 4

P 4 9 6 1 6 P 2 1 1 1 5

P 0 6 6 7 5

P 4 6 4 6 3

P 0 4 7 0 1P 3 7 5 0 0P 2 2 4 4 0

P 2 4 0 0 4

P 3 5 1 5 9

P 2 1 0 7 7

P 3 6 8 7 2

P 4 9 4 6 6

P 0 6 8 6 7 Q 0 0 0 0 5

P 0 8 7 2 1 P 3 6 2 6 0 P 2 0 0 2 4P 1 4 7 2 9 P 3 6 2 6 6 P 2 7 3 0 3

P 4 0 1 0 5

P 0 9 7 8 1

P 0 5 8 7 6

P 3 5 4 2 5

P 1 5 6 8 7

P 2 7 3 3 6

Q 0 2 1 2 3

P 2 7 9 8 1

P 1 8 8 6 9P 2 0 8 2 5

P 3 0 5 9 4P 3 9 0 5 8

P 1 0 6 1 4 P 3 6 5 9 9

P 1 0 4 2 4

P 0 0 8 0 7

P 2 9 6 9 6

Q 0 3 0 4 6

P 3 0 5 7 2

P 2 2 5 0 6

P 4 0 4 0 6

P 3 3 8 1 8

P 0 9 2 8 6P 0 3 3 5 4

P 2 0 6 4 2

P 2 7 0 3 4 P 2 0 6 4 3

P 4 6 2 4 0

P 0 7 5 4 7 Q 0 6 7 5 8P 0 9 9 7 6

P 1 1 6 3 5 P 0 8 9 7 3

P 0 8 6 5 1

P 0 2 9 6 4P 0 9 9 7 5

P 2 5 9 8 0

P 1 0 9 5 5

P 4 6 3 1 4

P 2 6 7 6 2

P 1 7 8 6 9

P 3 2 0 0 1

P 3 5 4 4 1

P 0 5 3 5 7

Q 0 8 8 7 5

P 3 3 0 8 7

P 4 3 5 3 8

P 2 4 0 8 8

P 3 8 9 0 0

P 4 0 8 8 9

P 2 2 1 0 2

P 1 2 6 8 4

P 1 2 0 4 0

P 2 1 8 7 2

P 1 0 8 4 5

P 0 4 0 3 5

P 3 1 0 2 0P 2 6 3 7 9

P 4 7 3 1 5

P 4 3 9 2 2

P 3 3 7 7 5P 1 6 0 4 6

P 0 4 0 2 5

P 1 4 0 0 2

P 0 7 3 7 5

P 0 3 3 6 3

P 4 7 1 9 8

P 3 5 2 5 8

P 2 8 7 1 5

P 1 6 0 4 3

Q 0 6 4 5 8

P 3 2 2 2 8

P 4 6 8 6 2

P 3 7 8 4 0

P 3 7 3 7 9

P 3 7 3 7 7

P 4 3 9 2 0

P 2 8 5 9 4

P 1 8 3 1 0P 3 6 6 1 7

P 1 6 0 9 7

P 3 6 8 3 7

P 1 2 8 8 0

P 3 8 0 3 3

P 4 6 0 5 9

P 0 6 1 7 9

P 0 0 8 6 4

P 2 7 9 6 7

P 3 9 8 6 8

Q 0 3 4 0 0P 4 6 0 6 7

P 3 6 2 4 4

P 4 4 9 9 0

P 1 7 1 4 4

P 2 5 4 5 6

P 1 7 1 4 3

P 3 4 0 4 7

P 2 5 0 9 6 P 2 9 9 1 6

P 2 4 0 8 3

Q 0 5 8 1 5

P 3 3 8 1 9

P 1 2 7 4 7P 0 5 5 4 7

P 1 0 0 3 3

P 0 5 8 3 5

P 3 2 2 2 9

P 1 6 6 0 5

Q 0 2 2 1 9

P 0 8 2 0 1

P 2 5 1 0 4

P 4 5 3 4 0

P 1 7 3 7 1P 0 3 5 1 6

P 2 9 0 4 4

P 0 9 3 9 6

P 3 1 6 3 9

P 4 9 1 7 0

P 2 1 0 1 9

P 1 4 2 8 6

P 4 9 1 6 9

P 3 3 8 7 1

P 3 3 3 9 6

P 0 6 8 2 5

P 3 0 2 6 8

P 2 3 5 9 6

P 1 5 3 6 9

P 1 7 4 2 4 P 2 2 8 0 5

P 0 4 8 4 0

P 4 7 4 7 1

P 0 3 5 5 2

P 4 3 3 0 4

P 3 7 9 6 5

P 1 6 7 0 6

P 1 6 3 1 6

P 1 3 0 3 5

P 4 6 6 0 5

P 0 3 5 5 1P 4 3 4 7 1

P 3 7 0 7 5

P 1 5 4 2 3

P 1 3 3 9 4

P 2 7 4 0 5

P 4 6 9 7 3

P 1 2 2 0 4

P 1 3 2 2 6

P 2 5 0 9 0

P 1 9 2 3 7

P 1 0 5 1 2

P 1 6 0 2 3

P 3 0 7 6 0

P 3 6 5 8 1

P 4 7 5 8 3

P 3 3 5 1 4

P 1 4 0 3 7

Q 0 2 1 9 0

P 1 6 0 7 3P 3 5 3 9 8

Q 0 1 1 2 9

P 1 8 1 2 6

P 3 5 8 5 2

P 3 3 8 5 8

P 1 6 9 1 7

Q 0 1 8 8 0

P 4 8 9 9 3

P 1 6 5 3 0P 4 0 9 7 3

P 1 3 2 0 1

P 0 6 4 7 3

P 2 6 0 1 3

Q 0 5 5 2 6

P 0 8 4 0 8

Q 0 2 2 0 1P 2 6 0 1 2

P 2 5 4 1 5

P 4 3 8 7 9

P 3 5 6 3 3

P 4 6 8 3 1

P 2 8 2 9 8

P 1 7 9 2 3

P 2 1 9 9 9

Q 0 3 3 5 0

P 1 0 5 4 9P 4 6 4 8 3

Q 0 6 4 4 1

Q 0 5 8 9 5

P 4 4 3 3 3

P 3 5 4 4 3P 0 8 9 7 0

P 4 6 8 1 3

P 1 4 2 2 6

P 3 0 8 7 8

P 0 6 2 8 0

P 1 3 4 6 0

P 0 6 6 6 5

Q 0 2 5 8 1

P 4 3 5 0 2

P 1 5 6 4 5

P 3 1 7 4 3

P 1 1 7 1 1

P 2 2 9 4 5

P 4 9 1 0 2

P 2 7 7 8 3

P 0 2 9 3 6

P 0 3 5 4 4

P 4 4 4 8 7

P 3 1 4 6 0

P 3 1 4 7 5

P 1 8 3 9 5

P 1 8 1 8 4

P 2 6 6 8 3

Q 0 3 0 6 5

P 0 7 8 6 0

P 1 7 5 2 1

P 3 6 5 7 4

P 3 9 2 7 6

P 1 0 6 1 2

P 2 0 8 5 2

Q 0 5 7 6 3

P 1 6 1 2 6

P 0 5 7 9 4

P 2 0 7 1 2 Q 0 0 7 6 3

P 2 2 4 7 4P 2 4 1 8 8

Q 0 3 0 4 0

Q 0 0 5 5 6 P 4 2 7 1 2

Q 0 6 2 0 2

P 3 9 1 6 1

P 2 2 3 1 4

P 2 0 9 7 3

P 2 2 5 1 5

P 3 8 8 2 0

Q 0 2 0 5 3P 4 1 2 2 6

Q 0 5 1 1 3

P 0 3 1 0 9

P 0 5 0 5 5

P 3 6 1 0 1

P 4 9 3 0 7

P 4 0 9 7 4 P 1 2 6 2 3 P 1 2 3 5 2

P 3 7 8 9 4

P 0 2 9 5 9

P 1 0 0 4 7

P 1 6 4 9 7

P 1 5 2 7 3P 4 4 7 4 4P 3 9 3 3 7P 2 0 6 0 8P 0 3 5 3 9P 3 1 0 6 4P 0 3 8 2 8Q 0 2 1 8 9 P 4 7 1 9 4 P 1 3 7 0 1 Q 0 8 0 2 1

P 3 2 1 1 2P 0 5 8 2 5Q 1 0 1 3 4P 0 9 0 3 0P 3 5 9 3 1 P 2 7 8 6 4 P 4 6 1 1 8

P 4 4 9 4 7 Q 0 1 2 2 2 P 3 3 8 2 4 P 2 8 9 3 9 P 1 0 2 3 8 P 3 7 5 3 7 P 3 3 8 0 3

Q 0 1 2 4 1

P 2 6 6 5 8

Q 0 1 2 4 0

P 2 7 2 0 8P 0 6 5 9 0

P 2 3 5 9 1P 3 8 9 7 9

P 3 2 0 3 2P 1 0 3 6 2

P 2 2 1 7 4

Q 0 4 8 5 3

P 3 8 9 8 4

P 0 5 3 5 1

P 4 4 6 3 2

P 0 9 1 3 1P 2 6 4 3 0 P 1 5 2 9 2

P 1 7 7 8 0 P 1 7 7 9 1

Q 1 0 0 9 9 P 2 0 6 1 6

P 2 2 1 7 3P 2 9 0 4 5

P 1 2 0 4 2 Q 0 2 3 8 5 P 1 8 4 5 0 P 3 3 4 7 2

P 2 1 6 0 7

P 2 0 4 0 2 P 4 9 4 0 9

P 4 6 5 7 9

P 0 1 5 4 5 P 2 4 7 7 0 P 3 2 0 9 7P 2 1 0 0 5

P 2 1 8 1 5P 0 5 0 6 0P 2 3 3 8 9P 4 0 8 3 2

P 2 6 9 8 2P 3 9 1 9 2

P 4 6 2 5 0P 3 1 2 1 4

P 3 9 1 8 9

P 3 1 2 1 3

P 3 2 9 2 1

Q 0 6 4 2 7

Q 0 9 1 7 0P 3 0 7 1 0

P 3 7 2 7 1

P 2 0 9 8 1P 2 9 0 7 7

P 1 1 9 3 4P 3 2 5 5 1

P 1 3 6 2 9

P 4 7 2 3 1

P 2 7 8 9 8 P 0 9 9 5 8

P 1 5 2 8 8 P 4 4 8 1 7

P 1 0 2 9 0 P 1 7 8 8 8P 2 3 3 7 7

P 0 7 6 0 3Q 0 8 4 8 1

P 2 3 0 0 4

P 0 7 2 3 3 P 1 1 0 7 8

P 1 3 6 2 1P 0 3 3 2 2

P 1 6 2 8 4P 4 1 7 1 8

P 2 1 3 7 5

P 4 1 7 0 5P 4 8 7 7 7

P 3 6 0 3 3

Q 0 7 3 0 7

P 0 5 4 3 7P 1 6 9 0 0

P 3 6 2 6 5

P 1 9 1 9 9

P 0 3 5 5 6

P 1 9 5 6 1

P 0 3 3 1 4

P 2 5 0 5 9

P 1 9 9 0 1

P 2 2 0 5 6

P 1 9 0 2 8P 2 1 4 8 0P 1 0 2 7 2

P 0 5 8 4 4

P 2 2 4 9 5

P 0 3 3 0 2

P 0 8 7 6 8

P 1 9 5 6 0

P 3 6 3 3 0

P 3 8 6 7 5

Q 0 4 7 2 6Q 0 0 1 8 4

P 0 8 0 1 2

P 1 9 9 0 7

P 3 2 4 7 9

P 2 4 3 8 4

P 1 7 0 1 0

P 3 4 9 5 6P 0 3 8 7 8

P 2 4 7 9 4

P 0 0 3 9 7P 0 6 0 1 9

Q 0 5 2 1 5

P 2 6 6 3 2

P 1 8 1 4 6

P 2 0 0 1 4

P 0 3 0 0 1

P 1 5 2 7 0

P 3 5 8 8 0

P 0 8 1 5 1

Q 0 1 0 1 4

Q 0 5 1 5 9

P 1 0 0 7 1

P 2 8 1 5 9

P 3 8 4 8 8P 2 5 3 8 7

P 1 1 4 9 0

P 0 8 0 7 8P 4 1 8 3 8P 2 9 3 8 7

Q 0 9 8 9 3

P 3 6 3 3 1 P 1 1 2 0 4P 3 1 8 2 2

P 0 7 5 6 6P 3 6 9 9 9

P 1 3 5 2 9P 0 4 2 8 1

P 2 0 0 2 1

Q 0 2 0 4 0

Q 1 0 0 8 7P 3 9 1 0 2

P 0 5 2 2 2

P 2 0 3 1 0

P 2 3 6 7 8

P 1 1 4 6 1

Q 0 2 4 1 3P 2 8 7 1 3

Q 0 1 9 3 1

P 4 0 1 8 0

P 4 6 2 9 8

Q 0 0 8 9 9

P 3 3 1 5 1

Q 0 0 8 6 1

P 1 6 6 8 3

Q 0 3 5 1 9P 3 9 1 0 9

P 4 9 5 0 1P 4 1 6 4 7

P 2 1 4 3 9

Q 0 0 6 1 9

P 2 2 0 3 6Q 0 3 0 2 5

P 1 2 8 7 8

P 0 3 3 1 9

P 4 6 4 7 1

P 3 3 2 9 9

P 3 4 1 2 3P 1 2 2 3 4P 2 7 4 3 9

P 1 0 8 9 6P 4 5 2 1 9

P 4 7 6 9 5

P 3 2 7 9 5

P 3 2 4 6 8

P 4 1 8 3 6

P 1 7 6 4 1

P 3 3 7 6 0

P 4 4 3 2 5

P 3 3 2 8 9

P 2 3 7 8 7

P 3 3 6 6 1P 1 9 9 5 0

P 1 5 3 7 0P 2 3 5 9 8 P 4 5 0 5 1 P 3 0 2 9 6

P 0 7 1 0 9

P 2 4 6 8 3

P 4 7 4 2 1

P 0 9 8 0 3

P 2 1 8 3 8

P 1 3 0 6 8

P 3 0 3 3 6

P 1 1 5 1 6

P 3 1 2 4 2

P 3 8 9 5 2

P 1 3 4 7 3

P 3 6 4 9 9

P 2 4 5 0 3

P 4 6 8 7 2P 4 5 9 6 2

P 4 6 8 6 4

P 2 8 0 2 5

P 2 8 7 4 3

P 2 8 7 3 9

Q 0 7 9 7 0

P 4 6 0 7 0

P 4 6 6 8 4

P 2 4 1 2 0

P 1 8 9 1 7

P 4 1 8 1 0

P 4 3 0 8 8

P 4 8 9 8 4P 3 7 6 9 3

P 4 7 0 2 5

Q 0 6 1 1 0

P 4 2 8 4 1

P 1 4 7 8 7

P 1 1 0 4 7P 1 7 3 4 3P 3 8 1 2 3P 1 1 1 6 1

P 4 5 7 9 1P 2 1 4 4 1

P 1 1 0 9 2

P 4 5 6 3 7

Q 0 5 5 9 7

P 3 6 0 2 8

Q 0 4 9 8 2P 3 9 5 8 3

P 2 2 9 4 0

P 1 3 5 6 8

P 4 0 9 6 7

P 4 6 4 6 6

P 3 3 2 9 7

P 4 6 5 0 2

Q 0 2 5 9 2

P 1 7 9 8 0P 2 1 4 4 8

P 0 3 3 2 0P 4 6 4 6 5

P 3 8 7 3 5

P 4 4 8 8 7

P 4 4 0 4 7

P 3 3 3 1 0P 3 6 3 7 1

P 3 7 0 2 9

P 1 9 7 7 1

P 3 3 3 1 1

P 4 5 0 1 9

P 4 5 8 6 1

P 4 4 9 1 7

P 2 5 9 9 7P 1 5 1 8 7

P 0 8 2 6 6

P 0 3 9 5 6

Q 0 9 4 2 7

P 4 2 4 3 6

P 3 8 0 4 6

P 0 5 9 5 5

P 3 9 4 5 6 P 2 3 8 1 5P 4 5 6 0 0

P 0 9 8 3 3

P 1 6 6 8 4

P 0 4 1 7 6

P 4 5 1 0 5

P 4 2 0 6 5

P 2 8 2 4 6

P 2 1 8 5 2

P 1 0 9 6 4

P 2 4 1 3 6

P 3 2 7 1 8

P 4 5 1 6 7

P 1 6 3 5 5

P 1 5 3 9 8

Q 0 3 2 0 3

P 4 6 9 2 0

P 1 2 3 8 3

P 1 7 2 5 9P 2 6 3 6 1

P 3 3 3 0 2

P 0 3 5 9 3

P 4 3 0 7 4

P 0 9 0 1 2

P 4 2 3 3 7

P 3 3 2 8 6 P 4 7 8 4 0

P 4 6 3 0 2

P 1 4 7 2 8

P 3 5 0 9 3

P 3 1 0 6 0 P 3 3 2 0 0

P 3 0 6 2 8

P 4 5 0 5 2

P 0 8 0 0 7

P 3 1 7 7 4P 4 5 3 2 1

P 4 5 1 7 0P 3 7 7 5 9

Q 0 3 2 5 2

P 4 0 0 2 4

P 3 0 9 6 3

P 1 5 9 6 2

P 1 0 6 3 6P 1 6 5 2 1

P 0 8 3 6 4

P 0 6 9 3 5

P 2 7 4 1 0

P 0 9 8 1 4

P 3 6 3 0 4

P 2 0 1 2 6

P 0 3 2 0 0

P 1 3 5 6 1

P 3 5 9 2 8P 0 3 5 9 9

P 1 8 2 4 7

P 1 3 8 9 7

P 0 3 3 1 6

P 1 3 9 0 0

P 1 6 6 0 4

Q 0 2 5 9 7

P 0 5 9 5 9

P 1 0 9 7 8

P 2 7 2 8 5

P 2 9 3 2 4

P 0 3 3 0 6

P 3 1 6 3 0

Q 0 4 5 3 8

P 1 7 5 9 3

P 2 7 2 8 2

P 3 6 3 2 7

P 3 6 3 0 9P 0 3 3 0 5

P 0 6 6 5 4

P 0 7 9 4 9P 3 2 1 1 3

P 1 3 9 5 9

P 2 9 3 9 3

P 1 6 6 4 9

P 4 9 1 7 7

Q 0 9 7 1 5

Q 0 1 9 8 1

P 3 9 7 7 0

P 1 7 0 9 7

P 1 5 2 6 9 P 4 7 1 6 4

P 3 3 7 4 9

P 2 5 4 9 0

P 0 7 2 6 8

P 3 3 7 4 8

Q 0 6 8 8 9 P 3 9 8 0 6

Q 0 4 6 1 0P 2 7 4 0 9Q 0 4 5 4 4

P 0 3 3 0 4

Q 0 5 0 5 7

Q 0 0 9 6 2P 0 7 1 0 5

P 2 5 0 4 9P 4 2 0 0 0

P 3 6 1 3 0

P 2 3 4 6 3P 2 5 0 6 6

P 4 8 8 1 0

P 2 5 8 9 2 P 1 2 7 5 7

P 3 5 4 1 8

P 0 1 0 0 8

P 3 4 2 1 6

Q 0 3 3 9 6

P 1 3 6 1 5P 3 2 3 8 0

P 1 7 8 6 3

P 1 7 7 7 1

P 1 5 3 1 0

Q 0 2 9 2 6

P 2 4 7 1 0

P 4 6 8 6 7

P 4 6 8 7 0

P 3 5 3 5 2

P 3 8 7 4 8

P 1 0 5 6 7

P 3 3 1 7 6

P 4 2 5 6 6 Q 0 6 8 5 1

P 3 4 3 8 3

P 3 4 5 4 0

P 3 3 6 8 1

P 4 6 8 7 1

Q 0 0 6 0 9P 2 1 1 7 8

P 3 6 6 0 8P 3 5 9 3 7P 2 9 3 8 4

Q 0 3 4 1 6

P 4 2 0 0 3

Q 0 0 3 8 1

P 1 4 0 2 5

P 4 6 3 2 9

P 1 3 0 3 6

P 0 9 6 0 3

P 3 6 6 0 9

P 2 5 5 0 2

P 4 7 7 2 8

P 9 8 0 7 3

P 3 5 0 3 7P 2 6 9 2 7

P 9 8 0 7 4

P 1 8 2 9 2

P 0 0 7 3 8

P 3 5 0 3 6

Q 0 9 6 9 0

P 4 2 9 7 1

P 3 5 9 5 6

Q 0 8 2 8 9

P 2 5 4 7 2

P 2 7 0 3 5

P 2 3 6 6 5

P 1 4 0 9 0 P 1 6 2 1 6

P 1 9 4 2 4

P 3 6 9 0 9

P 0 7 9 8 7

P 4 8 4 2 4

P 4 3 6 3 4

P 4 7 6 3 2P 2 9 1 4 9

Q 0 1 2 0 6

P 1 1 3 6 9

P 2 8 4 8 0

P 2 1 2 4 0

P 1 3 0 2 5

P 4 6 9 7 6

P 2 2 6 7 0

P 0 2 8 5 8

Q 0 7 8 6 8

P 1 3 2 8 0

P 4 0 3 9 1

P 3 6 9 3 8

P 4 8 3 7 7

P 4 0 3 9 0

Q 0 4 7 0 7

P 2 5 2 4 7

P 0 4 3 2 3

P 2 6 3 7 4

Q 0 1 8 4 2

P 1 4 5 4 7

Q 0 4 5 7 4

P 2 9 9 9 0

Q 0 1 3 6 5

P 2 0 0 0 0

P 2 0 7 9 7

P 0 7 9 4 4

P 4 5 3 4 5 P 3 8 7 7 6

P 0 0 9 5 6

P 0 2 9 1 9

P 4 9 5 6 3

P 3 1 3 3 4 P 1 8 6 6 5

P 2 7 2 0 6P 4 6 0 6 5

Q 0 2 0 7 8

P 2 0 7 0 8

P 3 5 5 0 0

P 3 7 0 9 1

P 3 7 0 8 9

P 2 7 6 7 6

P 2 0 2 8 5

P 1 1 1 8 2

P 1 6 2 6 3

P 3 6 9 5 7

P 2 7 7 4 7

P 4 5 1 1 8

P 1 6 4 5 1

Q 0 1 6 5 7 P 1 3 5 1 6

P 1 2 6 9 5

P 1 1 5 2 2P 1 6 6 0 3

P 2 9 4 7 4

P 2 7 7 5 1

P 1 8 4 0 8

Q 0 6 5 1 8

P 2 9 4 7 6

P 3 8 0 3 9

P 1 2 2 1 7 P 1 2 2 1 9

P 4 6 3 1 7

P 4 3 8 5 0

Q 0 3 4 3 2

P 4 6 7 0 1

Q 0 8 4 6 9

P 3 1 6 5 2

P 2 0 7 1 7

P 3 7 0 9 0

P 1 5 0 0 9

P 2 0 2 8 1

P 0 6 9 5 9

P 4 5 6 4 6 P 2 9 0 9 3 P 2 6 6 7 8

P 4 1 1 1 8

P 3 0 5 1 8

P 0 7 2 5 9

P 1 9 9 7 6 P 0 9 2 3 3

P 1 1 6 5 7

P 1 8 5 2 0

P 1 6 0 5 3P 3 6 5 0 1

P 2 5 6 9 1

P 0 8 5 5 1 P 0 4 2 6 4

P 0 8 7 7 7

Q 0 4 6 9 5

P 0 8 7 7 9

P 3 1 6 4 1

P 3 1 6 4 6P 3 1 6 4 3

P 4 8 0 2 9 P 3 1 6 5 0

Q 0 5 0 0 1Q 0 1 9 5 9

P 3 1 6 4 5

P 2 1 0 3 6

P 2 9 4 7 5

P 2 3 9 7 7

P 3 1 6 6 2

P 3 6 6 4 9P 0 6 0 1 2P 4 9 3 3 1P 0 0 4 1 5P 2 2 4 6 5P 1 6 5 4 9P 0 1 2 2 9 P 3 6 9 2 4 Q 0 7 4 1 1 P 0 9 2 2 3P 0 9 4 5 1

P 4 6 4 5 5Q 0 3 0 3 0

Q 0 6 9 2 7

P 2 6 5 0 3

P 3 3 9 0 9

P 0 6 4 0 7

Q 0 1 6 4 7

P 0 9 3 2 3P 1 4 1 5 1

P 0 2 8 2 8

P 3 2 1 9 8 P 4 9 2 4 2

P 1 4 8 5 0

P 0 5 0 2 0

P 0 7 1 2 4

P 1 7 5 1 8

P 4 6 5 3 7P 4 1 5 8 6

P 3 1 9 5 6

Q 0 9 4 3 9

P 0 8 9 5 5

P 4 8 1 1 3 P 2 1 9 7 7 P 9 8 1 3 1 P 4 6 2 0 8 P 4 4 8 6 2

P 4 3 9 7 9

P 3 5 2 3 6P 3 1 5 2 2

P 1 8 6 4 2

Q 0 2 9 3 4P 1 8 2 4 9

P 4 9 5 3 0

P 3 0 4 3 8P 4 1 4 4 1

P 4 4 5 7 8

P 0 7 1 6 6

Q 0 9 7 6 5

P 3 8 7 5 6

P 3 1 2 5 1

P 0 1 3 3 8

P 3 0 5 3 0

P 4 6 7 0 2 P 4 0 9 5 4

P 4 3 8 5 1

Q 0 1 9 6 9

P 1 1 4 7 2

P 4 9 4 8 1

P 1 0 5 2 0

P 3 6 2 1 3

P 3 6 8 3 6

P 1 7 7 9 9

P 2 9 5 7 7P 4 2 4 4 9

P 2 6 6 7 7P 2 1 3 1 4

Q 0 5 1 4 6

P 4 3 7 0 8P 2 4 0 1 4

P 1 7 9 5 5

P 0 7 8 3 8P 0 6 1 0 7 P 3 2 1 9 1

P 2 3 7 7 6 Q 0 7 8 6 1P 0 6 4 5 7

P 3 2 6 0 3Q 0 0 0 1 3

P 4 5 4 3 8

P 4 3 4 3 3P 2 5 5 1 0

P 1 8 2 8 5P 0 3 0 8 9

Q 0 2 8 6 9

P 4 2 1 7 6

P 1 1 0 8 0P 1 9 3 1 8

P 0 8 0 1 7

P 4 5 6 0 4

P 2 6 2 0 7

P 4 0 5 9 9

P 2 2 0 7 1

Q 0 8 8 9 0

P 2 2 3 0 4 P 0 0 4 6 1P 4 2 0 4 2

P 2 6 6 7 0

P 1 2 1 5 5

P 8 0 5 1 7

P 3 1 5 6 2

P 3 9 8 4 2

P 3 4 4 1 0

P 2 8 2 3 5 P 4 2 0 3 4

P 1 8 8 6 0P 1 5 0 4 3

P 3 1 0 0 7

P 3 8 0 6 0

P 4 9 6 9 7

P 4 5 0 4 8

P 8 0 4 4 8

P 0 9 8 9 1

P 3 3 5 6 7

P 2 1 0 3 7

P 1 2 7 9 2P 1 7 5 0 2

P 2 5 5 1 2

P 3 4 0 2 8

P 3 3 7 6 8

P 4 5 0 0 3

P 4 9 5 2 2

P 1 8 7 7 6

P 4 3 3 3 6

P 2 9 9 2 1

P 3 5 8 8 8P 4 6 3 8 8

P 0 3 0 0 4

P 0 3 0 7 0

P 3 1 0 5 3

P 0 8 4 0 7

P 0 4 0 0 8

P 0 3 4 2 5

P 1 1 2 2 3

P 0 7 9 4 6

P 4 2 0 5 5

P 2 4 8 5 1

P 1 3 7 2 0

P 4 7 4 2 7Q 0 4 6 1 9

P 4 9 4 2 6P 1 8 1 5 8P 3 2 1 3 1

P 0 7 9 8 5P 3 6 7 4 0

P 1 0 2 5 3

P 2 4 2 6 4

P 0 5 9 7 6

P 1 1 6 2 4

P 4 0 4 6 7

P 3 4 7 3 6

P 4 0 1 4 2

P 0 6 4 7 6P 0 8 5 1 0

P 2 2 4 6 0

P 3 5 4 9 9

P 0 8 1 0 4

P 2 5 1 2 2

P 0 4 7 7 5

P 2 2 4 6 2

P 1 5 3 9 0

P 2 8 3 6 5

P 0 7 3 9 5

P 1 1 8 3 1

P 1 9 8 2 8

P 1 1 4 5 4

P 1 5 2 1 5 P 4 2 1 9 9

P 0 7 6 4 5

P 1 1 1 8 1

P 3 7 0 5 1P 2 8 5 7 0 P 2 8 5 7 1

P 3 1 6 4 7Q 0 1 2 0 5P 2 8 5 7 3

P 2 2 0 0 1 P 4 9 3 8 0P 2 2 1 8 9

P 3 9 5 2 4

Q 0 5 0 3 7 P 1 2 8 3 0

P 2 7 7 4 3

Q 0 9 8 9 1P 1 7 9 7 0 Q 0 8 4 3 5

P 3 0 7 1 4P 2 3 9 8 9

P 4 8 6 3 3

P 0 5 8 0 4

P 1 7 3 2 6

P 2 7 7 4 2

Q 0 8 7 8 8

P 2 6 2 5 7

P 2 5 4 6 4P 4 0 8 7 2

P 0 6 8 6 4

P 2 6 0 4 6

P 4 8 5 4 7

P 3 7 1 1 6

P 2 2 7 0 0

P 1 0 3 7 8

P 0 0 7 2 2

P 4 1 8 1 1

P 1 3 5 8 7

P 2 3 6 3 4

P 2 2 0 3 7

P 1 1 7 1 8P 0 5 0 3 0P 1 3 5 8 6

Q 0 8 4 3 6

P 3 2 4 8 1

P 4 6 1 9 8

P 2 4 8 8 0

P 0 8 4 2 4

P 1 5 5 5 9

P 3 1 9 7 1P 0 5 5 1 0

P 2 6 8 4 9P 3 4 8 5 4

P 4 1 2 9 8

P 0 0 7 9 7

P 1 0 8 7 0

P 0 0 5 3 8P 3 2 6 3 9

P 2 2 0 8 2

P 0 4 8 0 0

P 4 1 0 5 9

P 4 8 4 1 9

P 3 1 3 7 3

P 4 1 3 4 3

P 3 2 2 6 4

P 1 4 5 5 0

P 3 2 2 9 6

P 4 3 8 9 0

P 3 8 6 9 0

P 2 3 9 8 8

P 2 0 9 5 1

Q 0 4 5 7 5

P 3 1 6 2 3

P 1 2 8 9 4P 2 2 5 9 1

P 1 7 7 7 9

P 4 9 3 3 7

P 0 9 5 4 4

P 4 9 3 4 0

P 3 1 2 8 5

P 1 7 9 6 5P 0 0 4 5 5

P 2 1 8 9 0

P 3 6 5 9 2

P 2 1 5 5 2

Q 0 3 4 6 8P 4 0 8 2 5

P 3 9 7 4 8

P 2 5 4 0 4

P 1 5 6 8 4

P 1 0 9 3 3

P 1 7 1 2 5

P 1 6 1 7 6

P 2 3 3 5 9

P 1 6 0 4 7

P 4 3 0 2 7

P 3 8 5 5 2

P 3 4 8 2 0

P 0 7 2 0 0

P 1 3 4 9 7

P 3 0 8 8 4 P 1 8 0 7 5

P 0 7 3 3 7

P 4 4 0 2 0

P 3 6 3 0 7

P 0 7 1 3 2

Q 0 0 7 3 2

P 3 3 8 1 7

P 2 6 8 0 8

P 2 3 4 2 6Q 0 1 0 0 2

P 2 6 7 6 4

P 0 3 5 7 9P 2 1 3 7 6

P 3 6 3 5 1

Q 0 4 5 1 9

Q 1 0 0 5 7

P 1 2 8 7 0P 1 5 1 8 3

P 1 0 9 5 0P 1 4 7 4 9

P 1 4 6 7 7

P 4 8 6 0 1

P 1 7 2 4 7 P 4 9 0 0 3P 2 5 6 2 1

P 3 3 9 4 5

P 3 7 8 9 8P 2 0 7 2 2

Q 0 6 4 4 3

P 1 5 5 4 1

P 3 9 8 7 5

P 2 1 2 7 4

P 4 0 7 5 0

P 3 4 8 2 1

P 3 6 1 7 8

P 2 4 6 5 1P 4 8 8 2 5P 3 3 3 6 3

P 4 9 0 8 6

P 1 4 1 2 1P 3 4 0 5 7

P 0 9 0 0 1

P 2 1 4 6 1

P 4 5 0 4 5Q 0 3 1 5 7

P 4 4 3 6 6P 2 9 3 8 5 P 2 8 0 3 7

P 4 9 4 0 4

P 9 8 0 5 6

P 2 7 9 7 3P 3 3 4 2 9

P 0 7 5 7 2P 0 5 3 4 2P 3 3 9 0 5 P 4 4 8 5 4

P 2 7 4 2 2

P 1 4 6 2 5

P 0 2 6 3 6P 0 0 3 8 2

Q 0 7 8 0 1

Q 0 1 9 9 2 P 2 3 6 5 8

P 2 3 9 6 5 P 1 9 2 1 7 P 3 6 7 8 8

P 4 6 2 3 6

P 0 9 1 5 2

P 3 7 5 8 9

P 2 8 3 2 9P 0 7 6 6 8Q 0 5 8 8 5P 2 2 9 3 6

P 1 2 9 0 5P 0 6 0 2 4

P 0 7 1 4 7 P 2 6 5 7 0P 0 3 3 6 2 P 2 8 0 6 2

P 0 6 1 2 5

P 1 7 9 8 9 P 0 6 6 7 0P 0 6 2 0 2 P 2 1 2 4 9P 4 4 6 0 4

P 1 7 4 9 0Q 0 0 6 8 9Q 0 6 0 3 1P 1 6 6 6 5Q 0 6 9 0 8

P 0 0 2 5 9

P 4 6 3 1 2 P 1 6 4 6 6 P 1 5 3 2 0 P 4 7 2 6 7 P 1 3 6 3 5 P 4 0 3 1 9 P 2 6 0 0 7P 4 1 5 6 3

P 4 7 8 7 7

P 3 4 4 2 5P 3 3 6 9 6P 2 4 0 8 1

P 2 4 9 0 6 P 0 3 0 7 4

Q 0 0 5 1 8

P 1 3 7 0 5 Q 0 5 7 4 9 Q 0 0 2 8 6

P 2 7 0 9 2 P 4 4 3 3 0

P 0 6 7 9 8Q 0 0 0 5 6

P 2 4 5 0 1 P 1 5 7 5 0

P 2 7 6 6 8 P 4 3 7 7 5 P 2 9 7 3 3P 4 7 0 6 4P 0 9 8 3 5 P 1 2 1 5 9

P 2 3 1 0 0 P 1 7 9 2 1 P 2 7 0 0 1 P 3 3 5 6 2 P 4 9 1 1 6 P 4 9 1 1 7 P 4 4 9 8 4 P 3 2 0 7 0P 4 0 3 4 0P 1 7 7 4 3

P 4 6 0 2 5 P 4 6 0 2 6 P 3 9 6 5 6 P 3 3 7 6 7 P 2 4 5 2 9 P 0 4 1 7 7 P 0 9 6 0 7 P 4 1 5 1 0

P 4 8 0 3 2 Q 0 0 3 3 5 P 1 5 9 1 0 P 0 8 8 1 9 P 4 9 2 5 3 P 4 0 2 7 5P 2 5 7 8 5

P 1 6 5 6 9P 4 2 9 0 5P 4 2 9 0 9P 3 7 0 8 1

P 2 5 9 8 6P 3 6 7 2 1Q 0 5 1 1 1P 4 2 1 9 1Q 0 0 9 9 1P 0 7 3 0 5

P 3 4 5 4 6 P 3 2 8 4 2 P 2 6 4 6 9 P 4 5 0 4 2 P 1 7 5 2 0 P 1 4 0 4 1P 1 1 6 2 3 P 4 1 1 1 5

Q 0 7 5 7 4Q 0 7 5 7 1P 1 3 2 8 8Q 0 1 0 1 5P 4 2 8 7 1P 2 9 0 6 2 P 4 6 8 1 0 P 4 4 3 9 6P 4 9 0 5 7

P 2 7 5 9 4 P 3 9 1 4 8 P 2 0 1 0 3 P 4 3 8 3 8 P 1 0 6 4 1 P 1 5 3 6 8 P 1 9 0 9 7

P 1 6 0 2 7 P 4 8 5 5 6 P 3 9 8 2 2 P 4 1 3 9 0 P 0 4 0 4 6 P 2 6 3 6 7 P 4 7 2 3 8 P 4 8 5 0 6

Q 0 7 2 1 1 P 3 5 0 5 2 P 3 5 0 5 3 P 2 2 4 4 9 P 2 6 0 2 2 P 4 9 2 5 9 P 4 9 2 6 0P 0 8 7 2 2P 4 9 6 9 8

P 3 5 3 3 4P 2 2 2 8 4P 2 6 9 8 4Q 0 1 2 6 3P 1 9 7 8 2 Q 1 0 0 9 4 P 3 5 3 5 4P 2 2 2 8 5P 2 6 4 2 0P 0 4 3 9 4Q 0 1 5 8 1P 2 3 2 2 8Q 0 9 7 6 8 P 3 2 7 4 7

P 4 3 4 6 8P 2 8 6 6 8P 1 5 4 0 7P 2 9 1 7 6P 1 8 6 2 5

P 0 4 8 9 0

P 3 5 5 3 8

P 2 1 8 1 0

P 1 8 4 6 6

P 2 3 9 5 0

P 2 4 4 3 0

P 4 9 4 1 6

P 0 7 1 0 2

P 3 9 7 7 3

P 4 7 8 5 3

P 1 4 9 2 4

P 4 2 7 9 7

P 1 7 4 3 1

P 3 1 4 3 1

P 3 6 3 3 6P 3 6 3 3 7

Q 0 7 0 2 0

P 3 6 0 3 1

P 4 0 8 9 0

P 3 4 1 1 8

Q 0 1 1 3 0

P 1 9 4 1 6P 3 2 2 8 2

P 3 9 7 6 6

P 3 6 3 1 9

P 2 7 9 8 0

P 2 6 6 5 4

P 3 6 3 0 0

P 1 5 4 7 7

P 1 9 9 0 9

P 0 7 6 1 7

P 3 0 6 7 3P 3 6 3 2 0

P 4 4 3 5 6P 1 5 6 2 9

P 4 0 5 8 4

P 4 0 4 3 8P 1 1 7 6 8

P 1 6 7 1 7

P 4 0 9 5 9P 4 8 9 6 6P 3 0 3 0 5P 1 7 8 1 7 P 0 4 0 7 0

P 3 0 3 5 2

P 2 1 0 3 3

P 1 5 5 2 0

P 4 1 0 0 6

P 2 5 1 9 2

P 1 2 1 4 6P 0 8 2 4 7 P 2 1 0 8 0

P 3 3 8 6 2P 0 7 8 2 5Q 0 9 6 8 2 P 2 0 9 8 5

P 2 3 6 3 8 P 4 6 7 8 6 P 3 3 8 3 2P 0 6 4 9 0

P 0 2 6 3 5P 1 0 1 7 0P 4 7 4 5 6P 1 0 1 7 1

P 4 7 5 5 1P 4 2 9 6 3

P 2 5 6 0 5Q 0 2 1 4 0

P 3 7 1 1 8 P 1 1 9 7 6P 3 8 6 0 8P 0 1 7 3 1 P 2 2 5 4 9P 2 7 5 1 2 P 3 0 4 3 3 P 2 7 2 5 7

P 1 3 0 5 3 P 1 9 1 1 3 P 4 4 5 2 4

P 0 4 8 4 4P 2 3 8 0 0

P 0 0 8 9 4

P 2 6 0 1 0

P 3 9 7 6 8

P 4 5 1 1 2 P 0 4 2 5 6P 4 9 3 1 2P 1 2 6 8 7P 4 8 5 3 5P 1 1 1 5 7P 3 1 3 5 0P 2 5 2 3 5P 2 5 9 1 6

P 2 4 5 5 5 P 2 0 1 1 6

P 3 5 3 1 2P 4 0 9 1 5

P 4 6 6 5 5

P 2 5 4 6 8

P 3 3 8 6 1P 3 2 2 0 6

P 0 5 0 3 4P 1 6 4 5 3

P 1 8 1 3 3 P 1 4 0 7 8 P 1 8 1 3 9 P 3 3 3 2 9P 4 0 1 2 6P 0 6 7 2 5

P 4 8 7 5 5P 3 4 8 9 4P 2 9 9 3 0

P 2 1 8 9 3

P 1 2 8 4 3 P 3 0 1 9 5P 2 9 9 3 1P 4 8 7 4 7

P 1 8 7 5 4P 3 8 7 5 4P 4 7 6 5 7Q 0 8 6 8 4P 1 1 0 9 5 P 1 1 8 3 5

P 0 6 5 8 6

Q 0 2 3 2 6

P 1 9 6 4 2

P 2 9 1 9 1P 3 9 9 6 5

P 4 2 7 5 6 P 0 9 9 0 1 P 1 8 0 7 6 P 2 7 6 8 8Q 0 1 9 5 3 P 2 1 4 1 3P 1 4 5 5 3 P 3 9 5 9 9 P 2 1 7 7 5 P 1 7 9 5 9P 0 7 8 4 2P 3 3 2 9 1P 1 1 6 8 0P 4 6 5 8 6 P 4 4 4 6 2

P 2 8 0 6 3 P 1 0 4 8 6 P 1 7 2 2 4 Q 9 9 1 3 2 P 1 2 8 0 6 P 1 1 4 6 5 P 1 8 1 4 8 P 0 2 7 8 1

P 2 5 9 4 2P 4 5 0 3 5P 3 7 1 1 7P 4 1 0 8 5P 0 7 2 0 6P 2 7 9 1 9P 0 4 9 9 6P 2 0 0 9 3 P 2 9 9 6 1P 3 5 7 2 1

P 4 1 5 8 5P 2 3 7 0 0

P 1 0 2 1 2P 2 3 5 9 9P 2 3 6 0 3

P 2 5 4 4 5P 2 5 4 4 6

P 3 5 5 2 0

P 2 3 2 7 9

P 1 0 6 1 5

P 1 4 2 6 3

Q 0 2 1 8 4 Q 0 2 1 8 5 P 3 3 2 4 0 P 3 9 3 9 6 P 1 3 1 5 6 Q 0 3 0 3 1 P 1 3 6 4 9

P 2 1 0 4 6Q 0 1 4 7 8

P 2 8 0 7 0P 3 2 2 1 6

P 3 3 8 0 0P 0 9 5 6 0

P 0 4 0 4 9 P 1 6 9 1 3 P 1 0 4 9 8

P 1 6 5 5 1

P 3 5 1 5 5 P 3 3 8 6 3 P 4 0 1 1 2

P 2 0 2 3 2 P 4 9 3 7 3 P 3 7 9 7 1 P 3 3 6 5 8 P 0 6 1 6 3 P 0 6 1 6 2 P 2 8 7 0 2 P 2 8 7 0 4P 1 9 9 4 0P 3 7 9 7 0

P 2 2 6 7 5Q 0 9 1 4 5Q 0 7 7 4 4P 0 7 6 0 6P 4 2 3 2 6P 2 4 2 2 0 Q 0 3 8 4 5P 4 3 0 6 1

P 3 5 4 9 4P 0 8 3 2 0P 0 7 7 3 8P 0 6 1 5 5

P 0 8 4 5 6Q 0 9 7 5 0

P 0 0 4 5 2P 2 5 4 5 1 P 0 5 1 3 7P 3 7 4 2 6 P 2 6 7 1 1

P 3 6 6 7 4P 3 6 6 7 3P 2 1 8 0 3P 1 6 0 9 2P 3 0 6 1 3P 4 1 0 0 3P 2 9 9 8 7P 2 9 9 8 9

Q 0 6 7 9 3P 1 0 2 3 1P 2 2 7 0 9P 1 3 5 1 3 P 1 3 9 8 0 P 0 8 3 1 3P 2 7 6 7 9Q 0 1 0 5 6

P 3 4 3 3 5

P 0 0 5 4 9

P 0 7 1 1 7

Q 0 1 0 8 5

P 4 3 7 5 1

P 2 9 7 3 2 P 3 2 9 0 8

P 3 1 4 8 3

P 1 0 5 0 3 P 4 9 0 4 1 P 2 8 5 2 7 P 4 3 8 0 4

P 2 5 9 7 6 P 2 5 9 7 7P 1 3 2 8 6

P 2 3 4 0 2P 4 2 6 3 2

P 3 1 3 2 5P 2 4 4 2 5

P 4 2 6 9 7

P 3 7 6 7 6

P 0 0 5 8 8

P 1 8 2 7 8

P 3 0 6 7 2

P 1 4 0 4 2 P 2 4 6 5 2

P 3 8 4 9 3 Q 0 5 8 8 0

P 0 0 5 8 9

P 2 3 3 8 6 P 4 6 0 1 8

P 4 1 5 6 2

P 4 3 4 4 9 P 2 5 2 5 0

P 1 1 1 7 2 P 1 3 2 1 5P 1 4 0 1 7

P 0 8 7 0 8Q 0 5 0 9 4

P 4 3 8 9 3P 2 3 2 2 9

P 4 8 8 0 5P 4 3 7 9 4P 1 4 6 5 5 P 3 6 3 8 6P 4 3 2 1 9 P 0 9 6 8 1 P 1 4 6 6 2 P 4 0 2 7 9 P 2 5 8 5 7 P 0 9 3 1 7 P 3 1 8 3 8 P 3 0 3 4 7

P 2 6 6 3 9P 4 7 1 9 1P 1 6 5 1 8P 1 3 8 3 7P 0 3 7 1 0 P 4 0 1 1 1Q 0 5 2 5 9 P 0 8 3 8 6P 1 7 0 5 3

P 0 4 1 4 4 P 1 8 1 1 5P 0 4 6 2 6P 1 3 6 0 8P 4 1 7 1 3 Q 0 5 7 9 3P 4 2 6 6 2 P 1 8 5 6 9P 4 2 6 7 4 Q 0 2 9 1 7

P 0 3 5 1 9P 1 6 7 2 9P 0 3 2 2 6P 4 3 2 6 2P 4 4 0 4 2P 3 5 7 7 7P 3 3 8 5 7P 2 9 8 1 6P 3 6 2 2 7 P 2 0 9 9 1 P 1 9 3 4 8P 3 3 8 3 9

P 4 0 7 9 1Q 0 0 4 9 5

Q 0 4 7 4 7

Q 0 4 7 7 7

P 2 6 2 6 2

P 2 7 0 9 0P 4 5 2 7 4

P 2 0 3 9 6

P 2 6 8 4 6

P 0 4 5 4 0P 1 5 5 8 1

P 2 9 9 1 3P 0 8 7 3 9

P 3 4 8 5 2P 4 0 0 1 0 P 1 5 5 8 2

P 2 0 4 5 9

P 0 6 2 6 2

P 3 3 5 1 0

P 2 4 5 8 8

Q 0 1 2 0 7

P 8 0 3 1 3

P 4 6 2 3 8

P 4 7 6 7 2

P 2 0 0 2 8P 2 2 1 3 8

P 0 3 0 4 2P 1 4 1 1 0

P 4 0 5 2 7

P 3 2 6 6 0

P 1 3 5 4 7

P 0 6 5 6 5

P 4 5 0 7 5P 4 3 7 6 3

Q 0 1 8 3 6

P 4 8 4 0 9

P 0 7 0 0 4

P 1 4 0 4 3

P 4 5 6 3 8

P 2 1 3 4 7

P 1 7 6 5 8

P 1 3 8 0 6

Q 0 2 3 4 3P 2 2 0 0 2

P 2 2 3 1 6

P 3 7 0 8 8 P 3 3 6 1 3P 4 8 9 1 5P 2 4 8 8 4

P 3 4 8 5 5

P 2 9 8 0 1

P 0 5 9 8 2

P 1 5 5 6 4

P 1 4 6 1 4 P 4 5 9 7 5

P 3 6 6 4 2

P 0 3 1 0 5

Q 0 9 6 8 3

P 2 7 2 3 5

P 3 6 7 6 3

P 3 8 1 4 4

P 4 0 1 1 8P 8 0 1 5 1

P 2 9 0 7 2

P 2 3 2 6 5P 4 8 9 7 4P 4 6 9 6 0P 2 1 4 6 3

P 4 7 9 0 1

P 3 9 1 4 3P 4 3 6 5 7

P 4 8 4 8 3

Q 0 8 2 0 9

P 0 7 6 2 1

P 2 6 0 4 5

P 4 8 4 5 2

P 3 1 3 8 9P 4 8 4 5 9

Q 0 6 1 8 0

P 1 4 7 4 7

P 2 9 0 7 4P 4 3 3 7 8

P 2 8 8 2 8

P 1 6 4 7 3

P 4 6 0 2 3

P 2 4 9 3 9

P 0 5 0 7 8P 0 6 7 2 4P 4 2 8 4 2

Q 0 3 5 6 0

P 4 7 7 9 9

P 2 1 4 5 0

P 2 7 5 8 0

P 2 7 3 5 2

Q 0 2 9 4 2P 4 2 3 4 7

P 4 3 2 5 3

P 2 3 3 6 2

P 1 0 7 2 0

P 4 7 8 0 3

P 4 3 1 4 2P 1 4 1 2 6

P 2 1 0 8 4

P 2 1 0 3 9

P 3 5 3 5 0

P 4 3 5 0 5

P 4 7 3 1 9

P 3 7 2 8 5

Q 0 7 8 6 6

P 3 7 6 5 0

P 1 5 7 0 5

P 3 1 9 4 8

P 4 4 0 9 2

P 4 1 8 4 2

P 3 8 8 2 5

P 0 4 0 0 1

P 3 2 3 1 1

P 2 2 3 3 2

P 2 3 1 6 3

P 3 5 3 8 3P 3 2 2 5 0

P 4 9 6 5 0P 3 4 9 7 9

P 3 2 9 4 0

P 1 1 6 1 3

P 2 8 0 8 8

P 2 0 6 3 8

P 4 3 1 1 5

P 2 5 1 0 6

P 3 4 9 8 0

P 3 2 2 4 0

P 3 0 5 5 7

P 1 3 9 4 5

P 3 6 1 7 6

P 1 9 3 9 8

P 1 5 4 0 9

P 3 0 5 4 6

P 3 5 4 0 9

P 3 8 8 6 7

P 2 0 9 8 9

P 2 5 4 7 3

P 3 7 9 7 2

Q 0 1 7 1 7P 3 5 4 0 8

P 3 2 5 1 2

P 2 3 8 0 1

P 2 2 2 9 7

P 1 2 8 0 5

P 3 0 3 7 2

P 4 8 7 4 8

P 4 7 8 0 4

P 1 0 4 7 5

P 0 4 9 5 5

P 0 6 5 6 6

P 1 9 5 5 9

P 2 6 8 0 6

P 0 3 3 4 5

P 0 3 3 3 6

P 2 9 7 1 9

P 0 4 9 5 6

P 2 8 6 2 1

P 1 0 3 9 4

Q 0 2 0 9 9

P 4 8 2 7 9

P 3 4 4 5 5 P 0 4 0 1 3

P 3 6 7 5 3

P 3 6 7 4 9

P 3 6 7 5 4

Q 0 7 8 7 5

P 3 3 2 9 2

P 3 5 0 5 6

P 4 4 5 2 0

P 4 9 5 3 4

P 4 9 5 3 3

P 4 8 2 7 1

P 4 9 5 1 8

P 3 4 0 5 5

P 4 6 7 3 7P 3 2 4 8 2

Q 0 3 5 6 9

P 3 0 5 5 9

Q 0 5 3 9 4

P 3 2 3 0 6

P 1 0 9 0 9

P 4 1 2 3 1

P 2 2 3 2 2

P 4 1 1 4 3

P 4 7 7 5 1P 3 0 5 4 9

P 3 4 9 7 5P 2 8 3 3 6

P 2 8 6 8 0

P 3 2 2 3 6

P 2 0 3 4 6

P 0 8 1 7 3

P 3 5 3 7 2

P 3 0 0 9 8

P 3 3 5 3 3

P 4 1 1 4 4

P 4 7 7 4 8

P 3 5 3 7 7

Q 0 4 5 7 3

P 0 5 3 6 3P 3 5 3 7 1

P 4 6 0 9 0

P 3 5 3 4 3

P 1 7 1 2 4

P 3 4 8 1 3

P 2 2 3 2 1

P 4 9 5 7 8

P 0 8 1 7 2

P 0 8 9 1 2

P 4 7 8 9 8

P 2 5 1 1 5

P 1 8 9 0 1

P 1 1 2 2 9

P 2 1 9 1 7

P 3 2 2 1 1

P 1 8 8 2 5

P 2 9 2 7 6

P 2 0 3 0 9P 4 2 2 8 9

P 3 0 6 8 0

P 2 8 6 4 6

P 3 0 8 7 4

P 3 4 9 6 9

P 2 5 9 6 2

P 3 0 8 7 2

P 0 1 4 5 2

P 3 0 9 3 6

P 3 2 7 4 5

P 2 5 9 3 0

P 3 5 4 0 7

P 3 0 9 3 8

P 3 1 3 9 1

P 3 0 9 3 7

P 3 1 1 3 3

P 0 7 7 0 0

P 4 4 7 6 8

P 3 1 1 3 4

P 0 4 2 7 4

P 1 0 6 0 8

P 4 2 2 9 0

P 4 3 1 4 1

P 0 9 4 9 9

P 0 3 5 6 6

P 1 3 7 9 2

P 0 8 1 9 9

P 0 8 6 8 9

P 4 3 8 2 3

P 0 6 8 6 8

P 1 8 2 5 4

P 4 9 6 0 8

P 3 5 4 4 7

P 3 5 7 7 8P 3 7 2 3 1

P 3 9 0 6 1

P 1 9 7 6 5

P 0 7 7 0 3

P 3 9 8 9 8

P 4 6 9 2 5

P 1 2 0 8 1

Q 0 3 1 8 1

P 3 5 4 4 6

Q 0 0 9 9 3

P 4 6 4 5 6

P 3 8 9 7 2

P 4 3 8 4 7

P 3 7 1 9 8

P 2 0 0 7 3

P 2 4 8 1 4

P 3 4 9 4 9

P 3 8 0 9 2

P 2 2 4 5 8Q 0 1 0 2 0P 3 7 8 9 6P 1 5 1 5 5Q 0 4 9 1 6P 0 9 7 7 6P 1 5 4 0 3

P 4 4 8 3 6

P 0 9 9 5 0

P 1 9 7 6 3

P 1 3 3 8 3

P 3 1 1 7 1

P 1 1 8 3 4

P 2 6 9 7 5

P 2 5 3 8 4P 3 7 0 3 2

P 2 9 3 5 3

P 1 5 0 0 1

P 0 8 7 7 8

P 4 0 8 0 1

P 3 3 0 6 8

P 4 7 8 4 6

P 3 2 2 5 7

P 0 4 3 8 6

P 1 1 6 2 1

Q 0 6 6 6 0 Q 0 0 9 6 6 Q 0 3 2 4 4 P 1 8 2 9 4 P 3 7 5 7 3 P 3 4 9 1 3P 2 1 8 7 6

P 2 9 9 5 1 P 1 3 6 0 5 P 2 7 2 1 4 P 4 8 8 9 2 P 1 1 7 0 1P 3 2 0 8 6

P 0 9 8 3 2P 4 4 7 9 5 P 3 4 6 5 0

P 3 8 2 8 5 P 3 7 1 2 7

P 2 3 2 4 8

P 3 2 2 1 5

Q 0 8 6 4 2

P 3 1 3 0 1

P 4 6 5 3 8

P 0 2 6 6 2

P 1 2 2 5 6

Q 0 5 2 3 7

P 1 8 6 2 6

P 2 0 8 2 8

P 2 9 0 2 9 P 4 2 8 8 3

P 3 3 6 5 5

P 3 3 7 5 2

P 3 9 5 2 9

P 4 5 5 8 4

P 2 0 5 3 3

P 4 1 9 7 8

Q 0 9 9 2 3

Q 1 0 1 1 3Q 0 3 4 6 0

P 2 0 8 7 4

P 4 8 0 4 4P 2 9 3 3 6

P 2 0 1 4 8

P 0 4 2 8 2

P 3 8 0 2 5

P 1 4 8 5 3Q 0 6 8 2 8

P 1 7 3 6 5P 0 6 4 8 5P 0 6 5 3 1Q 0 5 7 4 0 P 2 1 0 4 5P 1 3 2 9 2

P 2 0 4 0 9

P 2 0 5 3 4 P 2 6 6 3 7 P 4 7 9 1 3 P 8 0 0 5 9 P 4 5 6 1 0 P 0 5 3 2 8 Q 0 8 0 9 9 P 2 1 0 4 4

P 3 7 5 5 1

P 1 4 9 0 7

P 1 5 5 6 7 P 2 5 7 6 5P 1 6 1 6 7

P 3 3 8 2 7 P 1 0 8 5 7Q 0 9 8 2 8 P 4 1 5 6 9

P 2 9 4 5 3P 3 6 6 1 9

P 4 5 0 7 7P 2 6 1 8 5

P 0 7 0 5 6P 0 9 7 5 8P 1 6 4 2 2P 0 4 1 4 3P 4 5 8 3 7P 2 3 5 4 9 P 1 5 6 9 8P 1 0 4 8 1P 0 7 9 8 4

P 2 5 3 8 3 P 2 8 3 2 4 P 4 1 1 5 8 P 3 3 8 7 9 P 3 4 5 3 1P 2 9 8 5 9P 4 3 0 0 2

P 1 5 3 0 9P 0 9 8 8 9P 1 4 3 8 1Q 0 0 2 6 9P 2 6 6 4 7P 3 0 7 0 6P 3 4 7 5 4 P 1 2 8 5 2P 2 0 6 4 6P 2 9 1 3 0

P 2 7 6 9 9

P 4 8 5 6 2P 4 6 0 6 1

P 1 8 6 0 9P 0 3 5 3 8

P 2 9 9 1 5

Q 0 0 2 7 4

P 4 3 3 0 9P 4 0 0 1 2P 4 3 1 1 9P 1 3 6 0 9P 2 1 5 5 6 P 1 0 1 2 4 P 1 0 6 8 6P 0 8 4 8 7P 2 5 1 0 5P 3 2 7 3 6 P 4 3 2 5 2

P 4 3 3 0 7 P 1 9 7 6 1 P 1 4 1 4 7 P 4 5 6 0 8 Q 0 9 0 3 7 Q 0 6 2 2 2P 3 5 8 5 6 Q 0 2 4 4 5P 1 6 9 6 7 P 2 1 1 5 1

P 4 6 0 6 0P 2 4 9 1 8

Q 0 3 0 6 1P 1 0 8 6 6P 8 0 2 9 9 P 1 2 5 5 9

P 4 2 0 5 9

P 1 3 3 9 3

P 1 5 6 3 9

Q 0 1 0 6 0

P 3 0 5 9 7

P 2 7 6 3 3P 1 4 6 9 8

P 0 7 8 7 1

P 4 9 1 0 8

P 2 7 5 0 7

P 2 9 4 6 5

P 4 3 8 5 2

P 0 4 0 0 9

P 2 4 2 7 1

P 2 3 8 4 9

P 2 3 0 5 2

P 1 9 2 4 4

P 4 0 3 0 7

P 2 8 9 6 8

P 2 2 5 5 7

P 1 7 7 9 3 P 1 3 3 7 2 P 2 4 4 4 3 P 0 8 1 9 5 P 1 0 8 5 2 P 4 3 3 6 0 P 4 3 3 6 6P 1 9 2 7 5 P 4 2 3 9 5

P 3 5 9 7 2

P 3 5 1 6 9

P 3 1 3 9 3

P 0 4 8 5 1

P 2 3 3 8 3

P 3 2 6 0 0

P 1 3 7 9 7

P 1 6 9 1 6P 0 6 8 0 4

P 0 4 7 7 6

P 4 5 2 9 4

P 0 4 3 4 7

Q 1 0 1 0 6 P 3 4 2 4 7

P 3 0 6 2 5P 3 1 3 1 9

P 3 7 5 2 8

P 2 2 1 4 1 P 4 4 2 8 6 P 3 2 5 0 6 P 1 5 1 5 1 Q 0 6 6 0 0 P 4 0 2 4 2P 0 1 3 7 4P 0 7 2 2 4P 3 3 6 1 0

P 2 1 0 6 9P 2 1 0 6 6P 3 3 8 5 4P 2 4 7 6 6P 2 4 7 6 1P 0 9 5 0 6 P 3 3 8 5 5

P 3 5 5 6 5P 3 8 7 5 5

P 4 4 8 4 3

P 3 0 2 3 6

P 3 1 8 1 2P 2 3 0 0 3P 4 6 8 0 1

P 1 1 1 3 1P 1 4 6 0 5

P 4 9 3 8 9

P 3 0 5 2 8 P 2 0 1 7 4

P 3 7 7 6 5

P 0 8 6 8 0

P 2 2 6 4 8 P 4 4 0 7 4

P 2 4 5 9 8 P 1 5 5 0 9 P 1 1 0 5 2

P 3 4 0 8 8

P 0 4 2 2 0P 0 1 8 6 0

P 0 7 5 6 7

P 4 9 6 4 3

P 2 0 4 0 6

P 4 7 4 1 8

P 0 5 3 5 6

P 1 7 7 9 7P 0 9 7 7 9

Q 0 8 1 0 0

P 0 7 1 6 9

P 0 6 4 7 5

P 3 8 1 7 4

P 2 9 1 7 5

P 1 5 3 1 9

P 3 0 7 6 4

P 1 3 4 9 6

P 1 9 2 5 4

P 3 1 3 9 6

P 4 8 9 3 7

Q 0 0 7 6 1

P 3 7 5 1 2

P 0 5 3 6 9 P 2 3 1 7 6

P 0 4 3 2 9

P 0 3 6 0 2

P 2 8 5 8 3

Q 0 8 1 0 3

P 2 4 1 9 3

Q 0 0 2 5 9

P 3 3 7 2 5P 4 5 2 7 7

P 0 9 1 4 4

P 2 1 0 9 7

P 1 1 3 4 9

P 0 5 9 4 5

P 2 6 1 7 7

P 3 3 8 5 9

P 3 5 8 9 0

P 3 6 4 9 2

P 2 6 2 3 7

P 0 5 3 5 9

P 4 5 1 7 5

P 0 9 7 8 3P 2 8 9 1 7

Q 0 7 7 6 2

P 3 3 8 5 1

P 1 0 2 1 1P 2 8 9 1 2

P 2 8 9 1 6

P 0 3 1 7 3

P 1 3 3 7 4

P 2 8 5 4 8

P 2 9 5 9 0

P 4 9 1 3 7

P 3 9 7 4 5P 4 7 8 1 2

P 3 1 3 1 4

P 2 0 2 6 5

Q 0 0 6 8 0

P 4 3 3 4 5

P 2 0 2 6 7

P 3 2 4 9 0

P 4 3 1 2 0

Q 0 1 8 6 0

Q 0 4 9 9 6

P 1 3 5 9 4

P 2 1 9 5 2

P 0 3 2 7 4

P 1 0 1 8 1

P 3 3 2 1 6

Q 0 9 4 3 5

P 4 2 0 8 6

P 0 7 3 1 3

P 1 5 2 4 2

Q 1 0 0 5 6

Q 0 9 4 9 9

P 1 1 7 3 0

P 1 3 5 9 2

P 0 7 5 2 4

P 3 4 8 9 1

P 2 7 0 3 7

Q 0 3 0 4 3

P 3 1 1 6 9

P 2 3 4 3 5

P 4 1 2 7 9

Q 0 6 5 4 8

P 4 8 5 1 0 P 2 7 7 0 4P 4 5 9 8 4

P 1 2 5 7 5

P 3 5 4 1 6

P 1 6 4 7 7P 3 2 4 9 2

P 0 5 6 6 1

P 1 4 1 0 5

Q 9 9 3 2 3

P 1 8 4 3 1

P 0 9 3 8 6

P 4 9 1 8 5

P 1 8 2 6 5P 3 6 0 0 5

Q 0 4 8 9 9

P 3 2 3 5 8P 4 7 8 1 1

P 3 5 4 1 7

P 3 5 5 7 9

P 0 8 7 9 9 P 3 5 4 1 5

P 3 1 3 6 7

P 3 1 3 6 0

P 0 9 6 2 9

P 2 7 6 0 9

P 3 7 8 0 6P 1 0 1 8 0

P 4 9 3 3 5

P 0 9 0 1 5

P 3 1 8 9 9

P 1 0 6 2 8

P 3 4 6 9 4P 2 1 0 0 1

P 3 9 9 8 4P 3 1 2 4 9

P 4 0 5 9 2

Q 0 1 2 2 6

P 0 3 4 3 4

P 3 6 1 9 7

Q 0 8 7 2 7P 2 0 2 6 3P 2 0 7 1 9

P 4 9 6 4 0

P 4 8 2 4 1

P 3 7 9 3 5

P 1 7 2 7 8

P 2 2 3 1 7

P 1 0 2 6 7P 3 1 3 6 4

P 2 9 8 2 5

Q 0 0 4 6 6

P 1 6 1 4 3P 3 7 9 3 8

P 4 0 4 2 6

P 4 2 5 8 7

P 1 4 8 5 8

P 3 7 2 7 5

Q 0 1 6 3 0

P 3 1 3 6 2

P 4 2 5 7 1

P 3 1 2 5 8

P 0 2 8 3 2

P 1 3 5 9 0

P 4 6 6 0 8P 1 8 2 6 4

P 4 3 6 9 9Q 0 5 4 6 6

P 1 4 6 5 2

P 4 6 3 2 0

P 4 0 7 6 4

P 2 2 0 0 9P 4 5 5 7 7

P 0 5 8 2 4

P 1 7 9 1 9

Q 0 3 9 7 4

P 2 3 8 1 2

P 0 9 0 7 9

P 4 3 6 9 8P 0 7 5 4 8

P 0 9 0 2 6

P 4 6 6 0 4

P 4 0 3 1 7

P 0 9 6 3 2P 2 3 4 5 9P 0 2 8 3 6

P 2 4 0 6 1

P 2 1 0 0 0

P 4 6 7 3 5

P 1 2 8 4 5

P 1 9 5 2 4

P 4 2 5 2 2

P 1 0 5 6 9

P 3 4 0 9 2

P 3 5 7 4 8 Q 0 2 4 4 0P 1 9 7 0 6

P 3 6 0 0 6

P 0 1 1 9 3

P 0 1 0 2 9

P 0 5 9 9 7

P 4 4 4 1 0

P 9 8 1 3 7

P 1 4 2 0 5

P 0 2 7 0 3

P 2 8 3 6 6

P 3 8 3 8 0P 4 7 8 4 7

Q 0 6 4 6 1

P 3 8 0 4 1P 4 9 6 4 9

P 3 9 9 4 0

P 2 4 5 0 7

P 0 6 6 3 4

P 4 2 6 9 4

P 4 6 9 3 4

P 0 5 1 2 9

P 4 1 8 2 3

P 3 4 6 8 9

P 2 0 7 9 3

P 3 2 8 9 2

P 2 5 8 0 8

P 3 9 5 4 6

P 1 5 4 2 4

P 3 9 6 8 7

P 2 1 6 9 3

P 8 0 2 0 6

P 0 2 4 8 2

P 8 0 2 0 5

P 2 2 7 3 5P 1 6 9 8 9

P 3 2 2 4 3

P 4 1 3 8 1P 2 0 4 4 8

P 3 2 2 4 2

Q 0 4 9 1 2

P 1 6 2 5 8

P 1 3 2 3 8

P 2 2 0 5 9

P 4 1 2 4 1

P 0 7 1 9 9

P 2 7 4 1 4

P 3 5 8 4 5

P 4 4 5 2 6

P 2 6 8 0 2

Q 0 9 7 2 7

P 2 4 7 8 2P 3 3 9 0 6

P 4 2 3 0 5

P 2 7 6 5 8

P 1 0 6 4 3 P 1 5 9 2 5

P 3 3 4 8 4

P 1 2 0 8 0

P 2 9 4 0 0

P 3 1 8 9 4

P 3 5 2 4 7

P 3 5 2 4 8

P 0 6 6 8 1

P 2 6 6 4 5

P 2 1 7 5 8Q 0 1 1 4 9

P 0 7 8 7 6

P 1 4 2 8 2P 2 0 7 0 1

P 1 4 9 9 6

P 2 1 1 8 0

P 4 4 9 5 3

P 1 5 9 8 9

P 2 5 5 2 4

P 0 2 4 5 8

P 3 4 3 4 0

P 2 8 4 8 1

P 0 5 5 5 5

P 1 8 5 4 6P 2 0 9 0 8

P 2 1 7 5 7

P 2 4 0 6 3P 4 2 8 9 0

P 3 5 2 4 6

P 3 1 6 9 5

P 3 2 3 2 8

P 2 3 2 9 8

P 0 4 5 8 4

Q 0 5 6 5 5

P 4 8 6 1 5

P 2 3 2 1 9 P 2 6 9 9 3

P 1 3 1 8 5

P 3 5 7 6 1

Q 0 9 8 9 8

P 4 7 8 0 9 P 1 7 7 8 9

P 0 5 1 2 6

Q 0 6 2 2 6

P 4 9 3 3 9

P 1 0 6 6 5

Q 0 3 3 5 1

P 1 9 5 2 5

P 3 9 9 6 8

P 2 1 8 6 0

Q 0 9 5 3 7

P 1 3 2 4 4

P 1 3 1 8 6

P 4 6 5 3 0

P 4 3 5 3 5

P 3 8 3 6 1

P 2 4 7 2 3

Q 0 1 7 0 5

P 0 5 8 1 3P 2 9 5 9 8

P 4 2 6 8 6

P 3 6 3 1 4

P 0 3 6 4 1

P 4 0 5 5 0

P 4 7 7 3 5

P 0 9 5 9 9P 2 5 6 9 3

Q 1 0 0 7 1P 2 7 9 6 6

Q 0 3 3 6 4

P 0 9 2 1 5

P 3 4 3 6 9

P 8 0 4 1 2

P 0 6 2 4 5

P 3 3 4 9 7

P 1 8 6 5 3

P 2 4 5 8 3

P 0 0 5 1 6

P 2 3 3 3 9

P 0 4 1 8 5

Q 0 0 3 4 2

Q 0 1 8 8 7

P 0 9 9 8 9

P 0 6 6 2 5P 3 7 3 0 5

P 2 3 4 4 3

Q 0 4 9 7 6P 3 8 4 3 8

Q 0 2 1 5 6

P 1 6 6 7 1

P 2 5 3 8 9

P 0 3 6 4 3

P 4 5 8 9 4

Q 0 5 9 9 9

P 4 8 7 4 9

P 2 6 7 7 9

P 0 7 6 0 2

P 3 5 5 1 3

P 3 3 5 3 0

P 3 4 2 4 4

P 0 8 4 1 3

P 2 3 4 5 8

P 3 6 9 7 8P 3 2 6 1 2

P 4 7 7 0 9

P 3 6 0 9 5

P 3 3 4 0 2

P 0 6 8 4 5

P 3 4 7 1 1

Q 0 6 8 4 6

P 0 3 9 6 7

P 1 8 2 9 3

P 3 8 4 3 2

P 1 3 6 7 7

P 2 5 3 2 1 P 4 0 0 6 1

P 3 9 0 0 0

P 1 3 5 9 6P 1 3 5 9 5

P 0 4 4 0 9

P 4 2 2 8 2P 4 7 9 8 7

P 0 6 7 8 2

Q 0 7 0 0 5P 4 4 6 0 2

P 4 0 4 2 4

P 2 2 9 8 5

P 2 1 1 4 6

P 2 9 5 9 7

P 1 8 6 1 2

P 2 9 3 1 7

P 2 4 5 7 1

Q 0 2 2 1 6

P 3 4 1 5 2

Q 0 5 5 1 3 P 3 4 9 4 7

P 3 8 9 7 0

P 2 1 7 0 9

P 2 8 6 9 3

Q 0 2 7 6 3

P 2 0 8 0 6

P 3 4 8 9 2

P 0 5 6 2 2

P 0 8 4 1 4

P 3 5 5 9 0

Q 0 6 8 0 6

P 2 6 6 1 9

Q 0 6 8 0 5

P 0 9 6 1 9

P 1 6 1 4 4

P 1 8 7 6 0P 4 6 3 7 8

P 1 1 2 7 6

P 1 3 3 6 8

Q 0 9 0 1 4P 2 9 3 6 6

P 1 8 1 6 8 P 3 4 0 2 4

P 4 5 7 2 3

P 0 9 3 3 3

P 0 6 5 4 6

P 4 0 9 9 6

Q 0 1 4 8 1

P 2 7 4 4 6

Q 0 9 7 4 6

Q 0 1 2 2 5P 3 5 8 2 7

P 2 7 4 0 0

P 0 8 6 3 0P 4 6 3 6 0

P 1 5 4 9 8

P 4 8 0 2 5

P 0 0 5 1 9

P 0 8 6 7 4

P 0 0 7 1 9

P 0 8 6 7 3

P 0 0 7 1 7

P 1 2 9 8 0P 2 7 7 9 2

P 0 7 3 0 7

P 2 4 7 2 1

P 1 3 8 1 4

P 2 0 4 8 9P 1 2 3 1 9

P 0 7 8 9 7

P 1 1 4 4 4

P 2 1 2 8 8

P 0 2 8 9 3

P 1 6 1 1 2

P 3 4 9 2 7

P 0 2 8 7 4Q 0 7 2 5 2P 2 7 1 1 3

P 3 2 7 3 9

P 0 8 3 1 8

P 0 8 6 9 1

P 3 9 4 1 4

P 4 6 5 5 6

P 0 2 6 9 4

P 1 1 7 9 8

P 0 0 2 6 2

P 2 6 8 1 8P 3 4 9 2 5

P 2 0 6 1 3Q 0 6 0 6 0

Q 0 8 9 4 2

P 0 6 2 4 4

P 0 7 9 3 1

P 1 6 3 8 5

P 3 2 3 6 1

P 0 3 2 1 9Q 0 0 7 0 1

P 4 2 9 0 6

P 2 7 5 1 4P 4 2 9 8 3

P 3 5 9 0 0

P 1 5 8 0 0

P 2 5 9 5 2

P 1 5 7 1 0

P 0 2 5 3 8

P 1 2 8 3 9

P 3 7 2 7 4

P 0 3 1 7 2

P 2 9 3 5 2

P 3 7 0 0 5 P 3 5 2 3 3

P 2 8 8 2 7

P 3 4 4 4 2

P 0 5 9 9 0

P 1 7 4 7 4

P 2 3 4 6 9

P 3 3 8 1 1

P 3 2 7 9 0

P 4 9 4 4 6

P 3 5 8 2 1P 1 8 0 3 1

P 4 9 4 4 5

P 2 6 0 4 0P 1 9 9 2 4

P 1 7 7 0 6

P 1 0 5 8 6

Q 0 5 9 0 9

P 1 8 0 5 2

P 2 3 4 7 0

P 2 3 4 6 7

P 4 8 6 1 1

P 4 2 8 6 6

P 2 9 3 6 7

P 1 0 0 3 9

P 2 4 8 2 1

P 0 4 9 3 7

P 4 9 6 8 4P 3 0 9 8 9

P 2 0 6 0 7

Q 0 6 8 0 7

Q 0 3 6 9 6

Q 0 9 7 3 4

P 2 5 1 8 4

P 3 3 0 0 5

P 2 3 3 5 2

P 3 8 0 4 7

P 3 5 3 3 1

P 2 1 7 9 5

P 0 2 6 7 9 P 0 2 6 8 0

P 1 2 7 9 9

P 0 8 0 2 5

P 0 4 5 0 1

P 4 6 1 1 0

P 3 6 8 4 4

Q 0 8 2 6 4

P 0 8 1 0 3

P 4 3 4 0 4

P 4 2 6 8 4

P 9 8 0 8 3P 4 6 1 0 8

P 1 6 9 1 1

P 2 4 1 3 3

P 0 0 5 3 0

P 2 7 8 7 0

P 3 5 9 9 1

P 2 0 9 9 9

P 1 5 7 9 0P 0 8 6 3 1

P 4 0 2 3 0

P 1 3 1 8 7

P 3 7 1 7 3

P 1 3 4 0 6

P 4 3 4 0 3

P 2 3 2 9 2

P 3 9 9 6 2

P 0 4 2 7 8

P 3 1 1 3 5

P 1 5 2 1 6

P 3 6 1 3 5

P 1 6 2 3 0Q 0 1 4 0 6

P 4 2 6 9 0P 1 5 0 5 4

P 2 3 3 2 7

P 4 5 5 3 9

P 4 2 6 8 1

P 9 8 0 6 4

Q 0 2 2 0 7

P 0 8 1 5 9

Q 1 0 0 5 9

P 4 2 2 2 8P 4 2 8 2 5

P 1 7 6 9 2

P 2 7 0 5 8P 1 6 7 3 2 P 0 6 5 4 7

P 4 8 2 0 7

P 4 8 2 0 8P 2 6 4 9 5

P 3 5 2 3 2

Q 0 9 8 0 3

P 4 3 6 4 4

P 3 5 0 8 5

P 3 5 1 9 1

Q 0 5 5 2 8

P 3 1 6 1 7

P 0 6 8 4 7

P 3 8 0 8 5P 4 0 5 6 4 Q 0 0 3 3 8

P 0 5 1 7 4

P 4 4 9 5 5

P 2 7 0 0 0

P 4 5 0 5 9

P 4 5 1 6 1

P 2 8 8 6 8P 3 8 5 1 3

P 4 9 0 8 3

P 1 6 9 5 4P 0 9 2 3 1

P 2 9 1 2 0

P 4 2 0 8 7

P 0 5 4 0 7P 2 5 3 5 3P 4 2 7 3 1

P 0 3 8 0 3

Q 0 9 1 7 5 P 0 0 0 0 8

P 4 0 9 0 2

P 4 9 0 8 2

Q 0 6 3 1 7

P 3 0 5 3 8

P 1 0 3 4 2

P 1 3 1 3 4

Q 0 8 0 4 7

P 2 5 3 0 6

Q 0 7 0 0 9P 0 7 2 8 4

P 0 8 0 4 9

P 0 8 1 4 4

P 3 8 5 3 6P 1 0 5 2 9

P 1 9 2 6 9

P 1 6 2 4 6

P 2 8 9 6 9

P 3 8 7 0 5

P 4 0 7 6 3

Q 0 2 5 7 2

Q 0 9 9 2 8

P 2 3 6 2 2P 2 1 9 4 4

P 2 7 2 5 9P 2 5 6 8 5

P 1 4 9 8 0

P 2 8 8 2 1 P 2 8 3 0 5

P 2 0 9 6 5

P 1 3 5 8 3

P 2 2 9 9 7

P 3 6 0 5 1

P 3 1 9 0 9

P 1 1 9 0 8

P 3 3 0 4 6

Q 0 6 5 4 7

P 0 9 9 4 3

P 3 9 1 3 7

P 0 6 3 8 5P 0 7 1 3 3

P 0 6 3 8 7

P 3 6 3 1 7

P 3 6 2 1 2

P 1 7 0 7 9

P 4 9 5 5 7

P 4 1 1 2 8

P 3 4 7 5 0

P 3 1 4 3 4

P 4 7 1 5 4

P 4 2 7 8 9

P 4 1 1 0 9

P 1 9 3 3 2

P 4 1 9 3 1

P 3 3 6 4 1

P 3 2 4 8 5

P 3 4 3 0 5

Q 0 9 6 8 7

P 1 6 5 1 9

P 3 3 1 1 7

P 3 0 6 7 9

P 2 4 2 2 8P 0 9 5 7 0

P 4 6 1 5 3 P 4 2 8 9 1

P 0 6 1 3 4P 2 3 7 7 3

P 3 7 0 6 2

P 3 9 4 7 3

P 3 8 5 0 8

P 3 4 2 8 6 P 3 8 9 3 8

P 4 2 4 3 4

P 4 5 7 5 7 P 0 7 1 4 2 P 2 6 5 1 1

P 4 4 8 0 1P 0 7 1 4 3P 4 5 7 7 8P 1 8 8 7 1 P 3 0 1 8 3

P 0 6 3 8 0

P 3 9 0 9 7P 1 9 3 2 8

P 1 4 0 1 0

P 0 8 9 1 3

P 1 0 7 5 8

P 2 2 7 3 1

P 0 1 0 2 6

P 1 0 8 5 6

P 4 5 3 6 4

P 4 3 6 2 9

P 3 9 9 9 7P 3 9 8 4 4

P 1 1 9 4 0P 0 8 0 1 8

P 2 0 5 0 6

P 0 4 9 6 8

P 1 2 6 8 0P 0 8 5 3 9

P 4 0 5 9 1

P 4 1 3 9 4P 1 3 6 2 7P 3 1 7 8 0P 1 5 2 0 6Q 0 3 1 8 8P 0 7 8 2 6

P 4 5 8 5 6P 4 5 6 0 9P 4 5 8 8 9P 4 8 5 1 1 P 0 6 6 8 4

P 2 3 3 7 2 P 1 2 6 2 7 P 1 5 3 3 3 P 2 9 0 3 7 P 4 3 1 5 7 P 2 5 7 8 1 P 1 6 3 8 1P 4 4 9 7 1P 3 6 9 5 8

P 1 3 5 0 7P 1 1 4 1 0P 0 0 7 8 0

P 3 7 9 8 6

P 0 7 2 9 6 Q 0 6 9 8 7

P 2 6 7 1 9

P 4 5 1 0 0

P 1 2 0 2 3

P 0 3 2 8 4

P 3 3 9 8 5 P 2 7 9 2 8 P 2 8 7 7 2 Q 0 0 5 1 1

P 0 7 2 4 4

P 4 3 9 0 9

P 3 8 5 3 0

P 0 4 3 9 1

P 2 1 2 0 3Q 0 5 8 6 6

P 3 8 5 3 2 P 3 2 6 7 5

Q 0 0 6 1 3P 2 7 5 4 2

P 1 4 5 6 0 P 4 0 8 7 9P 1 3 0 1 5

P 4 6 7 4 0P 3 3 9 8 4

P 4 5 3 8 0

P 2 7 1 2 0 P 4 9 1 5 5 P 1 3 0 4 5 Q 0 6 5 6 2 P 3 8 9 7 1 P 2 7 6 9 3 P 4 6 0 7 3P 4 8 8 4 8P 4 0 8 5 1

P 0 7 8 9 6Q 0 8 4 2 6Q 0 1 8 3 3P 0 1 0 2 3P 1 4 2 0 6 P 1 4 0 4 6

P 1 4 2 5 1

P 1 5 3 3 2P 3 3 8 6 4P 1 6 8 5 5 Q 0 9 0 5 6P 2 1 0 4 9

P 4 9 3 6 9P 1 5 0 8 2P 4 9 6 4 2 P 1 6 7 0 0

P 3 3 8 6 5

P 0 0 6 0 6P 3 7 4 4 7

P 4 5 0 1 8

P 3 3 2 8 2

P 2 2 1 4 6 P 3 7 4 4 6

P 2 8 8 2 5

P 3 3 0 0 7 P 2 0 6 1 8

P 4 3 6 2 6

P 4 9 4 5 4

P 3 0 2 7 6

P 0 8 6 6 4

P 4 1 7 7 2

P 3 0 3 4 1

P 3 0 5 4 5P 1 7 3 4 9

P 3 4 1 7 9

P 1 5 8 2 3

P 3 7 0 6 1

P 0 8 5 6 8 P 4 1 1 5 1P 1 8 1 8 6 P 0 6 9 6 0

Q 0 8 2 7 6 P 4 1 1 5 2P 0 0 4 8 0

P 4 7 2 0 0 P 4 7 4 2 4P 1 1 0 6 6

P 3 8 0 8 6P 4 2 0 7 2

P 2 1 8 7 9

P 4 6 6 8 1

P 3 1 0 0 2

P 4 5 0 8 5

P 0 9 7 9 3

P 3 0 8 5 1

P 3 9 5 6 7

P 0 4 0 9 0

P 4 7 9 3 2

P 1 9 8 3 8

Q 0 0 6 5 3

Q 0 4 8 6 1

P 3 2 6 7 2

P 3 2 1 5 4

P 2 6 3 8 1P 3 8 6 9 7

P 0 4 8 0 8

P 2 3 7 3 9

P 0 0 6 8 9

P 0 7 7 6 8

P 2 6 3 8 2

P 3 8 6 2 8

P 4 4 9 0 0

P 2 9 0 4 1P 0 7 0 7 5

P 4 3 7 5 3P 3 9 4 0 9

P 0 9 0 5 7

P 3 9 4 7 4

P 2 7 1 2 1

P 4 8 8 2 8P 2 0 7 2 4

P 1 7 5 4 2

P 3 7 3 8 8

P 1 9 6 6 9

P 0 3 0 7 2

P 4 4 9 4 6

P 4 3 6 9 3

P 2 2 0 9 1 P 4 4 3 8 7

P 4 0 4 2 9

P 0 5 7 4 8

P 1 5 7 7 2

P 0 8 0 5 5

Q 0 8 3 6 9 P 4 3 6 9 4

P 4 8 9 8 3

P 1 5 9 7 6

P 1 8 6 4 0

P 3 7 0 2 0

P 0 8 0 8 1

P 2 6 2 3 1

Figure 5.9. The Protein network using only non-circular patterns



P 4 5 6 3 8

P 4 8 4 0 9

P 0 1 1 0 6

Q 0 6 2 3 4P 4 0 3 3 9Q 0 5 2 1 5

P 0 7 2 6 8

P 2 6 6 3 2

P 1 5 2 7 0

P 1 0 0 7 1

Q 0 6 8 8 9

P 0 0 4 5 1

Q 0 8 3 4 5

P 4 3 8 9 0

P 0 7 1 6 7

P 3 2 2 6 4

P 4 3 7 6 3

P 3 2 7 7 0

P 3 8 6 9 0

P 1 4 5 5 0

P 4 3 7 9 9

P 3 8 4 8 8

P 2 7 0 9 0

P 1 6 0 4 7 Q 0 6 1 4 5P 1 8 1 4 6

P 4 6 6 8 4

P 3 4 4 5 5

P 1 7 1 2 5P 3 4 8 2 1

P 1 1 3 6 9

P 1 3 4 9 7

P 1 6 1 7 6

P 2 3 3 5 9

P 2 1 2 7 4P 3 0 8 8 4

P 1 8 0 7 5

P 3 3 2 9 7

P 3 2 4 6 8

P 3 3 7 6 0P 4 6 5 0 2

P 3 3 2 8 9

P 4 4 3 2 5

P 4 3 0 2 7

P 0 7 2 0 0

P 1 7 9 8 0

P 4 9 0 0 3

P 0 3 3 3 6P 0 3 3 4 5

P 2 0 0 1 4P 3 9 7 7 0

P 2 5 4 9 0

P 0 4 3 2 3

Q 0 4 5 7 4

P 2 7 1 0 6P 3 5 4 2 8

P 0 6 2 9 5

P 4 6 5 9 2

Q 0 2 3 6 3

P 2 2 8 1 6

P 1 7 9 2 0

P 1 5 1 7 3

P 2 4 7 9 3

P 1 3 0 9 7 P 1 7 6 6 7

P 4 6 4 6 6

P 4 6 4 7 0

P 3 2 7 6 3

P 4 1 8 3 6

P 2 4 6 9 5

P 1 7 6 0 0

P 2 6 8 2 1

P 2 1 3 0 2

P 4 9 4 8 1

P 4 3 1 5 7P 0 6 0 1 2

P 3 2 1 3 1P 4 1 5 5 6

P 0 5 0 2 0P 1 4 3 0 8

P 4 2 1 8 8

P 3 8 0 6 0

P 2 7 2 5 9

P 4 4 5 7 8Q 0 6 9 4 5P 3 8 7 5 6 P 0 7 3 3 7

P 3 0 5 3 8 P 9 8 0 7 4

P 3 5 0 3 6

P 1 0 3 4 2

Q 0 8 0 4 7

P 0 4 9 6 8P 4 3 6 2 6

P 0 8 9 5 5

P 1 5 4 0 2 P 3 8 6 9 7

P 2 0 5 0 6

P 4 8 4 1 9

Q 0 5 0 0 1

P 4 6 4 6 5

P 1 8 4 0 8

P 4 8 4 3 6

P 1 2 8 3 0

P 2 2 0 3 7

P 1 1 7 1 8

P 4 7 6 9 5

P 3 2 7 9 5

P 1 7 3 2 6

P 3 0 7 1 4

Q 0 8 4 3 6

P 2 5 0 4 9

P 0 6 6 5 4

P 0 8 0 1 2

P 2 4 3 8 4

P 3 2 1 1 3

P 3 1 6 5 0

P 3 1 6 4 7

P 0 8 0 7 8P 4 8 9 8 4

P 0 8 4 2 4

P 2 7 0 3 5

Q 0 2 3 4 3

P 4 9 0 8 3

P 2 8 3 4 0

P 0 2 9 1 9

P 3 8 0 3 9

Q 0 6 5 1 8

P 2 9 4 7 6

P 1 6 6 0 3

P 2 1 0 3 6

P 2 9 4 7 4

P 3 7 1 1 6

P 4 5 3 5 8P 4 3 7 3 9

P 0 7 3 9 2

P 2 2 7 0 4

P 1 6 2 4 6

P 0 9 2 5 9

P 4 7 5 8 2

Q 0 8 4 0 0

P 1 2 0 9 3

P 1 3 1 2 2

P 3 5 8 8 8P 2 7 7 4 3

P 0 7 3 9 5P 1 1 8 3 1

P 3 5 8 9 0

P 1 0 3 7 8

P 1 9 8 2 8

P 2 8 3 6 5

P 4 8 6 3 3P 1 7 5 4 6 P 1 8 6 8 0P 2 2 1 3 9

P 4 0 8 7 2

P 1 3 8 4 6

P 0 5 6 6 4

P 2 8 8 6 8

P 0 8 5 3 9Q 0 6 3 1 7

P 4 5 0 5 9P 3 0 6 7 9

P 4 5 6 0 4

P 4 5 1 6 1 P 3 9 8 4 4

P 2 8 3 3 9

P 0 4 2 9 2

P 2 8 8 5 7 P 2 5 4 6 4

P 2 7 7 4 2

P 0 3 6 8 0

P 1 7 3 9 3

P 1 7 1 9 2P 3 0 3 2 0

P 0 7 9 1 7P 2 1 4 0 2

P 4 9 0 8 2

P 2 6 2 0 7P 3 1 2 4 2P 2 7 7 9 2

P 3 4 9 2 7P 0 2 8 9 3

P 1 2 3 1 9P 2 0 4 8 9

Q 0 8 4 6 9

P 1 2 9 8 0

P 0 0 7 1 9

P 4 6 5 5 6

P 1 3 8 1 4

P 3 2 7 3 9

P 3 1 6 4 6

P 0 2 8 7 4

P 0 8 6 9 1

P 0 7 3 0 7

P 1 6 1 1 2

P 0 6 4 5 7

P 3 6 7 4 0

P 1 6 2 0 8

Q 0 7 8 6 1

Q 0 0 9 4 2

P 0 7 0 6 5

P 4 8 3 7 1

P 1 5 3 4 8

P 0 3 1 0 9P 4 3 9 7 9 P 3 4 3 0 5

P 4 7 4 2 7

P 0 2 8 5 8

Q 0 9 8 8 2

P 2 2 6 7 0

P 4 5 0 1 8

Q 0 9 1 7 2P 2 1 5 5 1

P 3 4 2 2 1

P 4 9 4 4 4P 4 9 3 3 7

P 2 1 5 5 2P 2 3 7 7 3

Q 0 8 3 6 9

P 0 3 0 4 2

P 1 4 1 1 0 P 4 3 6 9 4

P 4 5 2 7 4

P 1 4 7 4 9

P 2 5 4 0 4

Q 0 7 8 6 8

P 4 5 3 4 5

P 3 5 9 0 0

P 0 8 7 7 8

Q 0 3 4 1 6

P 2 7 7 5 1

P 3 1 3 3 4P 1 5 2 1 5 P 2 9 3 8 4

P 2 7 2 0 6P 2 5 1 0 4

P 3 3 3 9 6

P 3 0 2 6 8

P 2 2 8 0 5

P 0 7 9 4 4

P 3 8 7 7 6

Q 0 4 7 0 7

P 2 0 7 9 7

P 4 2 9 7 1

P 4 0 7 5 0

P 0 2 5 3 4

P 0 4 2 6 4

P 0 8 5 5 1

P 1 6 0 5 3

P 0 9 0 0 1P 1 1 5 2 2

P 1 5 8 0 0

P 1 8 6 6 5

P 3 6 5 0 1P 0 2 5 3 8

P 1 8 5 2 0

P 2 5 6 9 1 P 4 9 4 0 4

P 0 8 7 7 9

P 2 0 6 0 7

P 4 8 4 5 9

P 4 8 4 5 2

Q 0 4 7 2 6

P 4 2 0 0 0

Q 0 0 1 8 4

P 4 8 4 8 3

P 4 9 1 7 8

Q 0 8 2 0 9

P 1 4 7 4 7

P 2 3 6 7 8

P 1 1 4 6 1

Q 0 2 0 4 0

P 3 9 1 0 2

P 2 0 3 1 0

Q 1 0 0 8 7Q 0 2 4 1 3

P 2 8 7 1 3

P 3 3 1 5 1

P 2 8 6 2 1

Q 0 1 9 3 1

P 0 3 9 6 7P 4 7 7 0 9

P 3 5 9 9 1

P 0 9 3 3 3

P 0 5 1 2 9

P 2 4 5 0 7

P 4 6 9 3 4

P 1 7 6 5 8

P 3 2 8 9 2Q 0 9 7 2 7

P 3 3 9 0 6

P 2 5 8 0 8

P 1 6 9 8 9

P 2 0 4 4 8

P 0 6 6 3 4

P 2 6 8 0 2

P 4 4 5 2 6

P 4 9 4 4 5

P 2 3 4 6 9

Q 0 5 9 0 9P 2 3 4 6 7

P 1 0 5 8 6

P 1 8 0 5 2

P 2 6 0 4 5

P 2 3 4 7 0

Q 0 1 2 0 6

P 4 2 6 9 4

P 1 5 4 2 4

P 2 4 7 8 2

Q 0 6 1 8 0P 1 7 7 0 6

P 4 8 6 1 1

P 1 8 0 3 1

P 3 5 8 2 1

P 4 9 4 4 6

P 0 5 9 9 0

P 2 9 3 5 2

P 3 7 0 0 5

P 2 6 8 4 9P 3 4 8 5 2

P 0 5 5 1 0

P 1 5 5 5 9

P 0 4 5 4 0

P 3 4 8 5 5

P 2 9 9 1 3

P 2 9 8 0 1P 4 0 0 1 0

P 4 1 2 9 8

P 1 5 5 8 2

P 1 5 5 8 1

P 2 2 0 0 2

P 2 2 3 1 6P 1 3 8 0 6

P 0 8 1 0 4

P 0 4 7 7 5

P 3 7 0 8 8

Q 0 9 6 8 3P 3 8 1 4 4P 1 0 8 7 0

P 3 2 6 3 9

P 0 4 8 0 0

P 4 0 7 9 8

P 2 2 0 8 2

P 4 9 1 7 7

P 0 7 0 6 1P 3 2 4 8 1

Q 0 6 1 1 0

P 1 1 4 9 0

P 2 0 4 5 9

Q 0 9 7 1 5P 4 7 0 2 5

P 1 4 7 8 7

P 0 4 2 8 1

P 1 7 0 1 0

P 1 7 9 7 0

P 4 0 1 1 8P 2 2 0 0 1

P 2 5 1 2 2

P 8 0 1 5 1

Q 0 5 0 3 7

P 1 4 6 6 6

P 2 9 0 7 2

P 1 5 5 6 4

P 4 8 9 1 5

P 0 6 2 6 2 P 3 1 9 7 1

P 3 4 8 5 4

P 2 6 8 4 6

P 2 4 8 8 0

P 1 3 9 5 9

P 2 9 3 9 3

P 0 7 9 4 9

P 1 6 6 4 9

P 2 5 3 8 7

P 4 1 8 1 0

P 4 3 0 8 8

P 1 0 4 7 5

P 0 7 9 8 7P 0 4 9 5 6

P 1 4 0 9 0

P 2 3 6 6 5

P 1 9 4 2 4

P 2 5 4 7 2

P 1 6 2 1 6

P 3 9 5 2 4

P 2 2 1 8 9

P 0 9 9 7 5

P 4 8 4 3 2

Q 0 9 8 9 1

P 9 8 0 5 6

P 2 5 9 8 0

P 1 0 9 5 5

P 0 5 0 3 0

P 2 3 6 3 4

P 0 6 8 4 7P 0 5 1 7 4

P 2 8 5 7 0

P 2 3 9 7 7

P 3 1 6 5 2

P 1 3 5 4 7

P 4 0 5 2 7

P 1 3 5 8 6

P 2 2 7 0 0

P 3 1 6 6 2

P 0 8 0 1 8

P 2 4 2 2 8

P 4 4 7 2 1

P 2 6 9 2 7

P 4 2 1 8 7

P 0 9 3 2 3P 4 0 1 9 3

P 1 6 9 5 4

P 1 1 0 4 7

P 2 1 2 8 8

P 1 3 4 7 3P 0 8 6 7 3P 2 4 7 2 1

P 3 6 4 9 9P 0 8 6 7 4

P 0 0 7 1 7

P 3 7 0 8 9P 1 9 3 3 2P 4 1 5 8 6

P 0 7 2 0 4

P 3 7 8 8 9P 3 3 7 5 1

P 0 0 7 4 2P 3 7 0 9 1P 0 8 1 4 4

Q 1 0 0 5 9P 2 1 9 7 7

P 0 7 8 6 1

P 3 8 9 3 9

P 0 6 5 4 7P 2 6 5 0 3

P 1 4 5 4 3

P 9 8 0 9 5

P 2 7 6 7 6

P 4 2 1 9 9

P 4 1 1 0 9

P 1 0 4 9 3

P 4 6 5 1 9

Q 0 1 2 7 9

P 2 1 9 4 1

P 3 9 1 3 7 P 1 2 2 1 9

P 3 1 5 6 2 P 3 2 2 1 5

P 3 4 4 1 0

Q 0 0 7 6 1P 2 0 7 1 7Q 0 7 0 0 9

P 3 3 0 4 6

P 1 7 6 9 2

P 0 8 0 4 9

Q 0 1 6 4 7

Q 0 6 9 2 7P 2 0 0 0 0

P 0 0 0 0 8P 4 7 7 2 8

P 1 0 5 2 9P 3 8 5 3 6

P 4 5 0 4 5 P 1 3 2 2 6

P 3 9 7 4 8

P 4 0 8 2 5

P 4 1 3 4 3

P 0 0 4 5 5

P 2 5 6 2 1

P 3 9 8 7 5 P 3 6 5 9 2

P 0 0 9 5 6

P 4 6 3 2 9

P 2 8 0 3 7

P 2 5 5 0 2

P 1 3 0 3 6P 2 1 8 9 0

P 1 0 9 3 3

Q 0 0 4 2 0Q 0 6 5 4 7

P 4 5 0 4 8

P 4 2 0 8 7

P 4 0 9 0 2 P 1 2 2 1 7 Q 0 8 6 4 2 P 3 3 6 4 1

P 1 1 6 5 7

Q 0 8 7 8 8P 0 3 0 0 4

P 1 1 4 5 4P 3 4 0 2 8

P 3 5 2 3 6

P 3 0 3 1 8

P 3 3 7 6 8P 1 1 7 0 5

P 4 6 3 1 7

P 2 6 0 4 6

P 3 6 9 5 9P 2 2 5 9 1P 3 2 1 9 8

P 0 9 4 9 8

P 1 8 6 2 6

P 4 6 5 3 7P 1 7 7 7 9 P 3 1 6 2 3

P 1 2 8 9 4

Q 0 4 5 7 5

P 3 1 0 0 2

P 2 1 8 7 9

P 1 7 9 6 5

P 3 1 3 0 1

P 2 5 9 9 5P 0 7 2 5 9

P 0 2 6 6 2

Q 0 5 2 3 7

P 0 7 1 2 4

P 4 0 7 5 7

P 4 6 5 3 8

P 2 2 7 3 1

P 4 3 6 2 9

P 2 3 1 3 2

Q 0 2 4 7 3

P 1 8 2 9 2

P 1 0 2 5 3P 4 6 6 0 5

Q 0 3 0 3 0 P 1 4 1 5 1

P 0 5 3 4 2

P 9 8 1 3 1P 4 8 6 0 1

P 9 8 0 7 3

P 3 5 0 3 7

Q 0 9 6 9 0

P 4 8 8 2 5

P 2 1 8 3 8

P 1 3 0 6 8

P 3 5 9 5 6

Q 0 8 2 8 9

P 2 3 7 5 9P 1 6 0 2 5

P 4 2 4 8 6

P 2 0 5 0 4

P 4 4 4 2 3

P 1 3 1 2 1

P 4 8 1 2 0

P 3 7 8 7 1

P 2 2 9 9 7 P 4 9 4 5 2 P 1 1 7 1 1

P 2 8 3 0 5 P 4 2 7 3 1

P 3 1 4 7 5P 1 5 2 0 6

P 2 8 8 2 1 Q 1 0 0 5 7 Q 0 1 4 1 5 P 1 6 3 0 4

P 1 1 9 4 0

P 2 0 8 5 2 P 4 6 3 5 8

P 2 0 9 6 5

P 4 9 4 5 4 P 8 0 0 5 9

P 1 3 0 4 5P 3 6 2 1 3P 4 3 9 0 9P 0 0 5 7 1P 4 5 0 7 7P 4 0 8 0 1P 1 0 6 1 2P 3 0 2 7 6

P 3 4 8 9 4Q 0 8 4 8 0P 3 2 2 5 7

P 3 0 4 3 8

P 2 1 2 0 3

P 0 0 4 1 5Q 0 0 7 6 3

Q 0 6 5 6 2P 3 6 6 1 9

P 2 6 2 5 7 P 2 1 9 9 9

P 1 7 9 2 3

P 4 3 3 3 4

P 2 3 8 1 5

Q 0 2 8 6 9P 2 8 2 7 2

P 4 6 9 7 3

P 2 7 7 8 3

P 0 3 4 2 5

P 3 6 8 3 7

P 2 5 4 6 8P 0 4 1 7 6P 0 8 6 5 1

P 1 5 6 4 5

P 4 1 4 4 1 P 0 6 1 3 4

P 3 6 8 3 6

P 1 8 8 7 1

P 4 6 0 5 9

P 1 5 8 2 3

P 2 8 5 9 4

P 1 2 8 8 0

P 2 2 9 4 5P 0 0 8 6 4

P 9 8 1 3 6 P 4 8 7 4 7 P 0 3 9 5 8

P 3 1 7 4 3

P 0 6 6 6 5 P 0 8 9 1 3

P 1 3 7 9 4 P 3 7 7 2 6

P 4 2 8 9 1P 3 1 7 0 5

P 4 7 8 4 6P 2 0 7 2 4P 2 6 6 8 3P 0 9 0 5 7P 0 5 7 9 4

P 4 3 4 7 1

Q 0 7 8 0 1

P 0 9 9 5 0

P 1 6 7 3 2P 3 4 9 4 9

P 1 4 6 3 9P 4 3 6 5 2P 0 0 9 4 6P 1 8 1 8 4P 2 7 1 2 1P 2 7 1 2 0

P 1 6 1 2 6

Q 0 5 7 6 3

P 0 7 8 6 0 P 2 9 9 5 1 P 1 6 3 9 7 Q 0 3 1 5 6

Q 0 5 1 1 3P 3 4 2 0 3Q 0 4 6 1 9P 4 6 9 7 6

P 3 3 9 0 9P 4 9 6 4 2

P 1 3 2 8 0

P 0 2 8 2 8

P 4 6 2 0 8P 3 3 9 0 5P 1 4 6 2 5

P 4 0 3 9 1

P 1 9 2 6 9

P 4 6 4 5 5

P 3 1 0 5 3

P 2 7 4 2 2

P 0 9 7 9 3

P 3 7 0 7 5

P 4 7 3 1 5

Q 0 9 6 8 7

P 3 2 1 3 8

P 4 8 8 4 1 P 1 2 6 8 4

P 1 6 3 9 3

P 0 4 0 3 5 P 3 2 4 4 0

P 4 7 3 5 9

P 3 9 9 9 7 P 2 6 3 7 9

P 1 7 5 9 9

Q 0 2 3 9 4

P 4 3 9 2 2

P 1 0 7 2 3

P 2 3 7 3 9

P 0 0 6 8 9

Q 0 2 0 7 8

P 4 2 0 7 2

P 4 4 9 4 6 P 0 0 5 3 3

P 4 8 3 7 2

P 4 6 1 5 3

P 1 2 2 5 6

P 4 2 2 0 9

P 1 5 9 7 6

P 1 4 0 1 0

P 4 9 3 4 0 Q 0 9 1 7 3 P 0 9 7 4 5 P 3 5 2 2 1 P 4 4 8 6 2 P 2 4 3 5 8

P 4 3 6 9 3

P 2 8 6 6 1

P 1 3 2 1 3 P 2 6 2 3 1

P 4 0 5 9 9 P 4 3 8 5 1 P 1 7 5 0 2P 4 9 6 9 7

P 3 3 7 2 5P 3 1 8 1 3

P 2 6 1 7 7

P 0 6 1 0 7P 0 0 4 6 1P 4 2 0 4 2P 1 5 5 6 7 P 1 6 3 1 6P 4 2 0 3 4 P 3 3 0 6 8

P 3 5 4 9 4 Q 0 2 9 3 4

P 4 5 4 3 8

P 0 2 9 3 6

Q 0 6 4 5 8

P 4 9 5 9 1P 0 8 2 0 1P 4 3 4 3 3

P 3 5 6 3 3

P 3 6 6 0 9

Q 0 0 4 9 5

P 0 7 5 7 2P 3 6 6 0 8

P 2 1 4 5 7P 4 5 8 8 9

P 2 5 7 8 1

P 4 6 0 6 5P 1 6 7 0 0

P 1 2 0 4 0 P 2 2 1 0 2

P 2 1 8 7 2P 0 7 2 4 4

P 0 0 9 6 7

P 4 6 6 9 1 P 4 4 5 2 0 P 2 3 5 9 6 P 0 8 0 1 7 Q 0 2 4 3 1 P 4 9 4 6 5 Q 0 3 5 8 6 Q 0 0 0 1 3P 3 1 0 0 7P 2 4 1 9 3

P 4 3 6 4 4

P 4 8 2 0 8

P 4 3 6 3 4

P 4 2 9 4 3

P 0 3 1 0 5

P 4 8 5 6 1

P 2 5 3 0 6

P 3 8 5 1 3

P 2 1 3 7 6

P 0 6 2 8 0

P 3 0 8 7 8

P 2 0 0 2 8Q 0 2 5 8 1

P 3 3 8 1 1

P 2 9 4 7 5

P 1 5 6 8 4

P 2 0 3 9 6

Q 0 4 9 6 0 P 4 2 8 2 5

P 4 3 7 3 5

P 3 5 0 8 5

P 2 5 6 8 5

P 4 0 5 6 4

P 1 9 8 3 8

P 0 1 3 4 7

P 0 4 0 9 0

Q 0 4 8 6 1

P 2 3 4 2 6

P 2 7 9 7 3

P 1 4 6 7 7

Q 0 4 6 9 5 P 1 2 8 3 9

P 2 2 1 3 8

P 1 7 4 7 4

P 3 7 8 9 7P 1 5 5 4 1

P 2 1 9 4 4

P 3 1 6 1 7

P 0 2 9 5 9

P 2 6 7 6 2

Q 0 0 3 3 8P 1 0 0 4 7

P 1 6 4 9 7Q 0 6 8 3 1

P 3 7 8 9 4

P 3 8 0 8 5P 0 7 6 6 6

P 1 1 4 4 4

P 3 1 2 5 1

Q 0 2 0 5 3

P 2 2 3 1 4

P 4 9 3 0 7

P 2 0 9 7 3

P 3 1 6 4 1P 0 7 8 9 7P 3 1 6 4 5

P 2 2 5 1 5

P 2 8 5 7 3

Q 0 8 4 3 5

P 4 1 8 1 1

Q 0 5 0 6 6

Q 0 1 9 5 9P 1 3 5 8 7

P 4 8 0 2 9

P 2 8 5 7 1

P 3 6 1 0 1

P 4 1 2 2 6

P 1 4 0 0 2

P 3 3 3 6 3

Q 0 9 7 6 5

P 3 8 8 2 0

P 4 4 0 2 0P 2 2 5 0 6

P 4 0 4 0 6

P 4 5 2 1 9

P 2 5 7 8 5

P 4 9 3 8 0

P 4 0 6 4 5

P 1 0 8 9 6

P 4 3 3 4 5

P 3 1 3 1 4

P 3 4 6 9 4

P 3 1 8 9 9

Q 0 0 4 6 6

P 1 8 2 6 4

P 0 9 0 2 6

P 2 1 9 5 2

P 4 8 0 3 2P 1 9 4 5 0

P 4 9 5 1 8

Q 0 7 8 6 6

P 2 8 9 7 0P 4 8 2 7 9

P 0 8 7 3 9

P 3 4 5 4 0

P 4 5 9 6 2

P 2 2 3 1 7

P 2 8 0 2 5

P 2 8 3 6 6

Q 0 6 4 6 1

P 4 9 6 4 9P 4 7 8 4 7

P 2 3 4 5 9

P 3 7 8 0 6

P 3 7 9 3 5

P 3 5 0 5 6

P 2 7 6 0 9

P 4 2 4 6 0P 3 1 7 7 7

P 4 8 2 7 5

P 3 7 2 8 5

P 3 7 6 5 0

P 1 4 9 2 2

P 1 5 7 0 5

P 3 1 9 4 8

P 4 4 4 1 0

Q 0 1 1 4 9

P 0 8 0 2 5

P 2 1 7 9 5

Q 0 3 6 9 6

P 3 1 8 9 4

P 2 7 0 3 4

P 0 5 2 2 2

P 2 0 0 2 1

P 0 9 8 0 3

P 3 1 3 6 7

P 2 0 2 6 5

P 2 0 2 6 3

P 0 5 8 2 4

P 0 9 6 2 9

P 1 6 1 4 3

P 3 3 2 1 6

P 2 4 3 4 2

P 2 3 4 6 3

P 1 0 6 2 8

P 4 3 1 2 0

Q 0 8 7 2 7

P 4 9 6 4 0

Q 0 7 9 7 0

P 2 0 6 9 3

Q 0 6 0 6 0

Q 0 2 2 1 6

P 1 9 5 2 5

P 3 8 0 4 7

P 2 5 1 8 4

P 3 4 2 1 6

P 3 3 0 0 5

P 2 3 3 5 2P 2 4 8 2 1

Q 0 9 7 3 4

Q 0 6 8 0 7

P 3 5 3 3 1

P 0 0 8 9 3

P 4 5 8 5 3

P 0 4 1 8 5

P 1 8 1 6 8

P 2 9 3 6 7

P 1 8 7 6 0

Q 0 1 7 0 5

P 1 6 1 4 4

P 0 3 4 3 4

P 4 2 5 8 7

P 4 8 2 4 1

P 2 3 8 1 2P 4 0 5 9 2

P 4 0 7 6 4

P 3 7 2 7 5

P 0 7 5 4 8

P 0 9 6 3 2

P 4 6 6 0 4

P 0 2 8 3 6P 0 2 8 3 2

P 1 4 8 5 8

P 2 0 7 1 9

P 0 9 0 1 5

Q 0 5 4 6 6

P 2 0 2 6 7

Q 0 1 8 6 0

P 0 9 0 7 9

P 1 4 6 5 2P 3 1 2 4 9

P 3 1 3 6 2

Q 0 4 9 9 6

Q 0 3 9 7 4

P 4 9 3 3 5

Q 0 1 6 3 0

P 2 9 8 2 5

P 4 3 6 9 8Q 0 1 2 2 6

P 1 7 9 1 9

P 2 8 7 3 9

P 4 6 8 7 0

P 4 6 8 7 1

P 4 6 8 7 2

P 1 3 6 1 5

P 2 9 5 9 8

P 3 8 7 4 8

P 1 5 2 1 6

P 1 5 3 1 0

P 3 2 3 8 0

Q 0 3 3 9 6P 1 7 7 7 1

P 4 9 1 4 0

P 1 3 3 6 8

P 1 2 7 5 7P 2 4 7 1 0

P 4 8 9 9 8Q 0 2 9 2 6

P 3 5 4 1 6

P 3 5 4 1 8

P 1 0 5 6 9

P 3 5 4 1 7

P 3 8 0 4 1

Q 0 8 0 9 1

P 0 7 1 9 9

P 4 2 3 3 1

P 1 1 4 4 9

P 1 3 5 9 5

P 1 3 5 9 2

P 4 8 9 8 8

P 1 3 5 9 0P 3 5 2 4 8P 0 4 5 0 1

P 4 6 1 1 0

P 0 2 6 8 0P 0 2 6 7 9

P 3 6 8 4 4

P 3 5 7 4 8P 0 8 7 9 9

P 1 6 4 7 7

P 1 9 5 2 4

P 1 4 1 0 5

P 3 2 4 9 2

P 3 5 4 1 5

P 1 9 7 0 6

P 4 6 7 3 5

P 3 6 0 0 6

P 4 2 5 2 2

Q 0 2 4 4 0

P 3 5 5 7 9

P 0 5 6 6 1

P 1 2 8 4 5

Q 9 9 3 2 3

P 3 4 0 9 2

P 4 0 3 1 7

P 3 5 2 4 7

P 1 8 5 4 6

P 0 6 6 8 1

P 1 5 9 2 5

P 1 0 6 4 3P 2 0 9 0 8

P 2 1 1 8 0

P 0 1 0 2 9

P 0 7 8 7 6

P 1 4 2 0 5

P 0 2 7 0 3

P 1 3 5 9 4

P 0 1 1 9 3

P 2 9 4 0 0

P 1 3 5 9 6

P 1 0 0 3 9

P 2 5 8 9 2

P 1 1 2 7 6

P 1 2 0 8 0

P 1 6 6 7 1Q 0 3 3 6 4

P 3 8 3 6 1

P 3 7 1 7 3

P 1 6 3 8 5

P 3 8 4 3 8

P 2 8 6 9 3

P 2 9 3 1 7

Q 0 9 4 3 5

P 1 3 1 8 5

P 4 0 4 2 4

P 2 3 4 3 5

P 4 5 8 9 4

P 0 7 3 1 3

P 4 9 1 3 7

P 2 8 5 4 8

P 4 8 7 4 9

P 0 6 2 4 4

P 3 8 4 3 2

P 4 7 8 1 1

P 4 7 8 1 2

P 3 4 8 9 1P 4 2 2 8 2

P 2 5 3 2 1

P 4 5 9 8 4

Q 0 3 0 4 3

P 3 2 3 2 8

P 4 8 8 1 0

P 2 0 8 0 6

P 3 5 5 9 0

Q 0 6 8 0 6

P 4 6 5 3 0

P 2 5 6 9 3

P 1 0 6 6 5

P 2 2 9 8 5

P 2 1 7 0 9

P 3 3 4 0 2

Q 0 9 0 2 2

P 4 9 1 8 5 P 1 2 5 7 5

P 3 4 9 6 9

P 4 2 5 6 6

Q 0 6 8 5 1 P 0 1 0 0 8

P 4 1 2 3 1

P 1 0 6 0 8P 4 2 2 9 0

P 4 9 6 5 0P 0 7 7 0 0

P 3 2 2 5 0

P 2 1 1 7 8

P 0 2 4 5 8

P 3 3 4 8 4

P 1 4 9 9 6

P 2 0 7 0 1

P 1 5 9 8 9P 2 7 6 5 8

P 2 7 4 7 9

P 2 7 8 1 8

P 0 5 9 9 7

P 2 8 4 8 1

P 2 6 6 4 5

P 2 1 7 5 7

P 2 1 7 5 8

P 4 2 8 9 0

P 3 5 2 4 6

P 3 4 3 4 0

P 0 4 9 3 7

P 2 5 5 2 4P 1 4 2 8 2

P 0 5 5 5 5

P 2 1 8 5 0P 0 7 0 0 4P 3 4 5 7 6

P 1 5 1 7 2P 2 3 9 9 9P 1 4 0 4 3

P 3 4 4 2 9P 1 8 5 4 0 P 2 3 9 8 8

P 1 3 9 0 3

P 2 1 3 4 7

Q 0 1 8 3 6

P 4 5 0 7 5

P 0 0 5 3 8

P 4 2 5 3 0

P 3 2 2 9 6

P 2 0 0 6 7

P 4 1 1 3 4P 3 2 7 7 6

P 2 4 5 8 8

P 4 7 6 7 2

P 3 5 2 0 8

P 1 0 0 8 5

P 4 7 9 2 8

P 1 0 0 3 0

P 1 1 1 6 1

Q 0 0 8 9 9P 0 3 0 0 1P 1 4 5 4 7

P 0 6 5 6 6 P 2 2 4 6 2P 4 6 1 9 8

P 0 8 3 6 5

P 3 3 2 9 9

P 4 6 4 7 1 Q 0 1 2 0 7Q 0 1 8 4 2

P 2 0 7 2 2P 4 5 9 7 3

Q 0 1 3 6 5P 3 7 6 9 3 P 0 6 5 6 5 Q 0 8 3 4 1

P 0 8 5 1 0

Q 0 5 1 5 9P 0 4 9 5 5 P 2 2 4 6 0P 2 9 3 8 7

P 2 8 1 5 9

P 1 0 3 9 4P 1 9 5 5 9

P 1 3 0 2 5

P 1 7 2 4 7

P 4 6 0 7 0P 2 5 2 4 7P 2 9 1 4 9

P 2 9 9 9 0

P 2 6 8 0 6

P 2 6 3 7 4

P 4 7 1 6 4P 3 3 7 4 9P 1 5 2 6 9

P 3 4 8 2 0

P 0 2 7 2 4

Q 0 1 9 8 1P 0 1 1 4 3 P 4 7 8 7 2

P 9 8 0 9 2P 2 3 0 2 5

P 0 1 1 4 2

P 2 3 8 1 1

P 0 1 2 8 3

P 3 5 4 9 9 P 3 2 5 6 2P 3 8 5 5 2

P 0 3 3 1 9

P 1 8 9 1 7

P 1 5 3 9 0

P 0 6 4 7 6

P 3 6 1 3 0

P 2 9 7 1 9P 0 8 1 5 1

P 3 9 8 0 6

P 3 3 7 4 8

P 1 7 0 9 7

P 4 8 7 5 6

Q 0 1 0 1 4

P 2 9 5 9 0P 1 8 4 3 1

P 2 5 3 8 9

Q 0 3 3 5 1

P 1 0 1 8 0

P 4 6 3 2 0

P 2 1 0 0 1

P 1 0 1 8 1

P 4 5 5 7 7

P 3 6 1 3 5

P 0 9 9 8 9

P 3 2 3 6 1

P 0 6 7 8 2

Q 0 2 0 1 1

P 2 4 0 6 1

Q 0 6 8 0 5

P 4 1 2 4 1

P 3 1 1 3 5

P 2 4 1 3 3

P 3 1 6 9 5

P 3 3 1 7 6

Q 0 2 7 6 3

P 1 8 5 6 0

P 1 6 1 5 7

P 3 1 3 6 0

P 4 6 8 6 7

P 4 6 8 6 4

P 4 6 6 0 8

P 3 7 9 3 8

P 3 8 8 2 5

P 1 0 2 6 7

P 3 1 2 5 8

P 4 6 6 6 7

P 4 2 5 7 1

P 2 1 0 0 0

P 3 1 3 6 4

P 4 3 6 9 9

P 1 7 2 7 8

P 2 2 0 0 9

P 0 3 2 7 4

P 3 6 1 9 7

P 2 3 2 1 9

P 2 7 7 0 4

P 4 2 0 8 6

P 3 4 1 5 2

P 3 3 5 3 0

P 2 6 7 7 9

P 2 3 3 2 7

P 4 7 9 8 7

P 9 8 0 8 3

P 4 5 7 2 3

P 4 6 3 6 0

P 4 8 0 2 5

P 2 7 8 7 0

P 2 7 4 0 0

P 0 0 5 3 0

P 4 3 4 0 3

P 4 3 4 0 4

P 0 4 2 7 8P 1 6 2 3 0

P 3 5 8 2 7P 2 0 9 9 9

Q 0 1 2 2 5P 1 5 7 9 0

P 1 5 4 9 8

P 0 8 6 3 0 P 4 2 6 8 4

P 2 7 4 4 6

P 1 3 1 3 5

Q 0 6 8 4 6

P 0 8 1 0 3

P 3 4 0 2 4

P 0 0 5 1 9

P 3 2 7 9 0

P 4 5 5 3 9

P 0 5 1 2 6

P 2 4 5 0 3

P 3 0 3 3 6

P 1 3 4 0 6

P 0 8 6 3 1

P 1 8 6 5 3

Q 1 0 0 5 6

P 2 1 8 6 0

P 4 2 6 8 6Q 0 1 4 0 6

P 1 5 0 5 4P 4 2 6 9 0P 4 2 6 8 1

P 2 1 6 9 3

P 3 9 5 4 6

P 4 6 1 0 8

Q 0 4 9 1 2

P 2 0 7 9 3

P 2 9 0 7 4

P 4 3 3 7 8

P 2 8 8 2 8

P 3 5 2 3 3

P 1 0 9 0 9

P 4 3 1 4 2

P 0 7 6 2 1

P 2 1 0 3 9

P 3 5 3 4 3

P 3 5 3 7 7

P 3 0 5 5 7

P 1 3 9 4 5

P 0 6 2 4 5

P 0 9 2 1 5

P 0 9 3 8 6

P 0 7 9 3 1

P 4 7 7 3 5

P 3 2 4 9 0

P 2 3 4 5 8

Q 1 0 0 7 1

P 3 6 3 1 4

P 1 3 2 4 4

Q 0 9 8 9 8

P 4 7 8 0 3

P 3 5 4 0 9

P 3 0 9 3 7

Q 0 2 9 4 2

Q 0 2 1 5 6 P 2 4 5 8 3

P 3 8 9 7 0

P 2 1 1 4 6 P 3 6 0 0 5

P 4 1 2 7 9Q 0 8 9 4 2

Q 0 5 9 9 9

Q 0 0 3 4 2

P 4 9 3 3 9

P 1 1 7 3 0

P 0 6 8 4 5

P 0 7 5 2 4

P 0 7 6 0 2

P 3 2 6 1 2

Q 0 9 4 9 9

P 2 3 4 4 3

P 2 4 5 7 1

P 1 8 2 9 3

P 1 8 2 6 5P 3 4 2 4 4

P 2 6 9 9 3

P 3 5 5 1 3

P 1 5 2 4 2P 0 3 6 4 3

P 0 8 4 1 4

Q 0 5 6 5 5

P 4 0 5 5 0

P 1 7 7 8 9

P 3 4 9 2 5

P 3 5 7 6 1

Q 0 6 5 4 8

Q 0 9 5 3 7

P 0 8 4 1 3

P 2 9 5 9 7

P 3 9 9 6 8

Q 0 5 5 1 3

P 2 3 3 3 9

P 0 3 6 4 1

P 0 6 6 2 5 P 3 9 0 0 0

P 4 7 8 0 9

Q 0 4 8 9 9

P 3 4 8 9 2

P 2 0 6 1 3

P 2 6 8 1 8

P 3 4 9 4 7

P 4 0 2 3 0

Q 0 1 8 8 7

P 3 4 3 6 9

P 1 3 6 7 7

P 2 7 0 3 7

Q 0 6 2 2 6

P 0 9 5 9 9

P 4 8 6 1 5

P 3 6 9 7 8

P 3 3 4 9 7

P 3 2 3 5 8

P 2 3 2 9 8

P 2 4 7 2 3

P 1 3 1 8 6

P 2 6 6 1 9

P 0 2 6 9 4

P 4 3 5 3 5

P 2 3 2 9 2

P 3 9 9 6 2

P 0 5 6 2 2

P 1 3 1 8 7

P 0 9 6 1 9

P 1 8 6 1 2

P 0 0 5 1 6

P 0 4 4 0 9

P 3 6 0 9 5P 3 7 3 0 5

P 0 4 5 8 4

P 1 1 7 9 8

P 2 7 9 6 6

P 0 6 4 8 5 P 1 7 0 5 3 P 2 2 0 7 1 P 2 1 0 9 7 P 3 7 8 9 6 P 1 3 3 7 2 Q 0 9 1 6 3 Q 0 1 4 8 1 P 4 4 4 8 3P 1 3 8 3 7 P 4 5 6 0 9

P 3 4 7 3 2 P 2 9 0 1 6P 4 2 8 8 2

Q 0 5 8 6 6

P 1 8 7 5 9 P 4 3 3 0 4

P 1 4 2 0 6

P 3 2 1 9 1 P 1 4 7 7 6P 1 6 4 2 1P 4 1 6 8 8P 3 1 7 8 3P 1 5 8 1 2

P 2 1 3 7 5P 3 6 0 3 3P 0 4 7 7 6Q 0 9 6 7 1Q 0 7 0 7 5P 4 4 6 2 4P 1 2 6 8 8P 2 0 6 0 8 P 0 4 3 4 7Q 0 9 6 7 0P 1 8 9 6 1

P 3 8 6 0 8 P 2 7 2 5 7 P 4 3 4 5 2 P 4 3 0 1 0Q 0 6 9 0 8

P 1 6 9 1 6 P 0 9 9 7 6 P 0 3 5 3 9P 4 9 5 3 0P 1 6 9 1 7 P 4 5 0 8 5P 3 7 0 0 2 P 4 6 4 9 1 P 3 0 8 5 1P 4 4 0 7 4

P 1 1 0 2 4 P 2 9 1 2 0 P 1 6 5 1 9 P 2 9 9 1 5 P 2 4 9 1 8 P 2 4 5 0 1P 1 1 0 5 2 P 1 4 6 0 5 P 3 0 5 2 8 P 4 7 2 5 2 P 3 3 8 0 3 P 2 0 9 3 8 P 2 5 1 8 9 P 4 6 5 4 8P 1 1 5 8 9P 4 0 7 9 6

P 2 2 6 4 8P 1 8 2 4 4P 2 4 6 5 2 P 4 3 3 6 0 P 4 3 3 6 6

Q 0 0 6 8 9Q 0 6 0 3 1P 0 0 9 5 9P 0 0 9 5 8Q 0 5 6 8 5P 4 1 4 3 9P 1 6 3 2 3

P 1 0 4 6 3P 1 5 5 5 3P 1 2 7 7 7P 2 3 9 6 5P 4 2 1 2 6P 0 9 1 5 2P 4 2 1 7 5P 1 0 4 1 5

P 4 0 9 1 5 Q 0 1 2 6 3 P 2 8 2 7 3 Q 0 5 8 8 0 P 2 2 2 8 4 P 2 2 2 8 5 P 0 8 3 7 3 P 0 8 1 9 9P 2 4 4 2 5 P 4 4 6 0 5

P 0 7 8 8 8 P 0 5 3 5 7 P 0 9 7 8 1 P 0 3 8 2 8 P 3 1 0 6 4 P 1 1 8 3 5

P 4 1 3 9 0P 4 2 5 0 2P 1 7 1 1 5P 4 6 2 3 6P 2 3 1 0 7P 2 6 7 1 9 P 0 4 0 4 6P 1 6 0 2 7 P 2 3 6 5 8 P 0 4 3 9 4P 3 5 6 1 3 P 1 7 7 9 0 P 4 3 3 1 5

P 4 9 5 9 8 P 2 9 3 5 3P 0 8 6 8 9

P 1 0 3 5 4 P 2 7 4 0 5 Q 0 5 8 1 5 P 2 5 2 5 0 Q 0 5 0 9 4 P 4 3 8 8 5P 1 1 6 8 0 P 1 8 2 7 8P 3 0 6 1 3 P 4 1 0 8 3 P 2 2 7 5 9 P 2 6 3 3 9P 3 5 1 1 3

P 0 2 8 0 8P 4 2 2 3 1P 0 9 9 1 6P 1 6 0 4 3P 4 7 9 0 7 P 3 1 3 9 6 P 3 5 3 9 8 P 2 1 3 2 8 P 3 6 4 3 8 P 2 8 5 8 3P 4 1 1 5 0

P 4 9 4 6 6P 1 3 3 8 3 P 2 1 5 7 7

P 1 1 5 1 5P 1 2 0 4 7 P 1 3 0 8 8 P 1 6 8 5 0 Q 0 0 7 0 9P 1 2 0 4 6P 4 1 1 5 8P 2 8 3 2 4P 3 7 0 5 1P 4 6 7 0 1P 2 6 0 1 0

P 3 9 4 0 9 P 0 4 2 7 6 P 0 9 9 5 8 P 2 3 3 7 7 P 2 7 3 0 3 P 4 4 9 2 8 P 0 6 8 3 9 P 2 0 8 2 5 P 1 4 5 5 3P 3 3 0 8 7P 3 7 2 7 4

P 3 3 9 8 5 Q 0 3 3 5 0P 4 0 5 9 1

P 0 5 4 0 7P 2 2 8 8 1 Q 0 5 8 9 5

P 0 9 5 7 0

P 2 9 6 9 6P 1 8 8 6 9P 4 8 5 7 2Q 0 0 4 9 6P 4 6 0 8 1 P 2 0 0 2 4P 3 9 5 9 9 P 0 4 9 5 8

P 3 7 0 6 2 P 4 5 8 5 6 P 3 8 5 3 2

Q 0 8 4 2 6

P 0 6 6 8 4 P 3 4 6 0 1

P 1 6 1 6 7P 3 9 0 6 1Q 0 0 6 1 3P 4 1 1 5 1P 3 7 4 5 5P 1 8 3 1 0P 0 7 8 9 6

P 1 0 2 9 0 P 2 7 8 9 8 P 4 8 8 3 9 Q 0 9 9 2 8 P 3 0 4 1 4 P 2 7 9 1 9 P 0 7 2 0 6 P 2 5 1 7 2 Q 1 0 1 3 4P 2 1 1 5 1 P 2 5 9 1 6

P 3 3 0 0 7 P 4 5 3 6 4 Q 0 7 4 3 2 P 3 8 5 3 0 P 3 0 5 3 0P 3 3 9 8 4

Q 0 6 4 4 1P 3 5 4 4 1

P 1 1 9 0 8 P 3 1 9 0 9 P 0 1 0 2 6

Q 0 9 7 8 2

P 0 3 8 0 3P 3 2 1 5 5P 2 5 3 5 3

P 0 0 3 8 2

P 4 1 9 3 1P 0 2 6 3 6P 1 3 3 9 4

P 1 4 0 4 6 P 0 1 0 2 3 Q 0 1 8 3 3 Q 0 7 9 4 6 P 3 0 3 4 1

P 0 6 4 0 7

P 4 3 8 2 5

P 1 4 8 6 8

P 3 6 4 1 9

P 3 8 7 0 7

P 3 5 1 9 1

P 4 3 8 2 9

P 1 2 0 2 3

P 2 6 8 0 8 P 4 7 9 3 2

Q 0 0 6 5 3 P 4 7 6 3 2

P 8 0 3 1 3

P 0 4 8 0 8

Q 0 4 5 5 4

P 1 4 2 8 3P 1 2 3 4 9

P 3 7 4 6 4 P 4 0 2 7 5 P 1 2 3 4 8

P 4 5 3 8 0

P 1 3 0 1 5

P 1 2 1 5 5

P 3 1 5 2 2P 4 0 8 7 9

P 1 9 3 7 5

P 0 0 4 9 9

P 0 1 2 1 1

P 4 3 8 5 3

P 0 5 3 7 4

P 2 8 4 8 0

P 0 0 8 9 4

P 4 8 1 1 3

Q 0 2 1 4 0 P 2 5 6 0 5 P 0 7 0 7 5 P 3 2 6 7 5 P 1 5 7 1 0P 1 3 1 3 4P 0 9 2 3 1P 4 0 6 0 4Q 9 9 2 8 9 Q 0 9 1 7 5

P 0 7 3 0 5

P 4 6 5 8 6

P 0 9 9 3 3

P 2 8 7 1 5

P 1 9 2 1 4

P 3 5 6 8 9

Q 0 2 2 5 6 P 0 1 2 1 4 P 2 2 0 0 5P 3 7 3 2 9

P 0 0 4 3 6P 2 8 2 9 8 P 3 7 1 3 6

Q 0 1 5 8 1

P 0 9 4 7 0

P 3 7 7 3 4

P 1 5 1 0 9P 4 6 8 3 1

P 0 6 7 9 8Q 0 0 0 5 6P 2 3 2 2 8

P 1 0 5 4 9P 1 4 2 2 6P 3 7 2 3 1Q 0 3 1 8 1P 3 4 7 5 4P 0 8 9 3 4

P 4 9 1 1 2

P 2 6 3 1 1

P 3 6 1 9 2 P 1 6 4 6 6

P 0 8 4 8 7

P 0 1 0 4 8

P 4 5 1 7 2

P 1 5 0 0 1

P 4 9 2 9 3

P 1 0 6 8 6 P 1 2 2 7 3

P 1 5 3 2 0 P 0 7 8 3 8

P 3 3 7 7 5 P 4 7 1 9 0

P 2 0 5 8 5 P 1 3 7 0 5

P 4 6 5 4 2

P 2 5 0 9 0

P 2 7 0 2 8 P 0 7 2 1 0

Q 0 0 6 8 0

P 2 7 2 7 6

P 0 8 3 1 8

P 3 3 6 9 6P 2 4 0 8 1

P 4 2 9 0 6P 4 4 8 4 6

P 1 5 1 5 1

P 3 9 7 6 8P 1 0 1 7 0

P 4 0 3 0 7 P 2 2 1 4 1P 4 8 1 4 7

P 3 1 5 5 4P 0 9 1 3 1

P 3 2 5 0 6

P 0 9 9 1 8P 4 6 0 6 7P 1 2 7 4 5P 1 2 7 4 7P 3 5 9 8 6Q 0 8 4 7 0Q 0 1 0 1 5 P 0 8 1 1 1 P 1 4 2 6 9 P 1 5 2 9 2

P 0 0 8 4 8Q 0 8 8 9 0P 1 6 0 9 2 P 2 1 8 0 3P 1 2 1 5 9 P 2 2 3 0 4

P 3 1 7 0 8 P 4 4 3 3 0 P 1 0 5 0 3P 2 3 3 8 6P 1 5 7 5 0

P 0 4 1 4 4 P 1 8 2 5 4 P 0 1 3 3 8 P 1 6 1 0 4 P 4 0 2 7 9 P 0 1 2 2 9P 4 7 6 4 5 P 4 5 6 5 7Q 0 2 9 1 7

P 0 9 1 8 1P 3 9 5 1 8P 2 7 8 6 4P 0 9 0 3 0P 1 2 0 4 2P 3 5 8 5 2P 4 4 9 4 7P 0 4 2 2 0P 0 1 8 6 0 P 2 6 9 8 2P 3 6 3 5 4 P 8 0 2 9 9

P 2 6 0 0 7 P 1 5 7 2 2P 2 3 2 2 9P 2 6 3 6 7 P 4 7 2 3 8 P 2 2 5 5 7 P 0 8 6 8 0 P 3 2 5 9 7 Q 0 0 2 6 9 P 1 4 3 8 1 P 2 9 1 3 0P 9 8 0 8 5 P 0 7 5 6 7P 4 9 1 0 8P 2 7 2 0 2P 2 6 6 7 8P 2 6 6 7 7 P 2 9 1 7 5P 0 9 8 8 9

P 4 2 7 8 9 P 1 8 3 9 5 P 2 4 3 0 4 P 1 6 5 4 9 P 3 4 9 1 3Q 0 0 9 6 6P 1 2 6 2 3

P 3 7 2 7 1Q 0 3 0 4 6P 2 0 6 9 8P 2 0 6 9 6P 4 2 1 7 9P 3 0 9 5 8P 3 7 4 7 4P 3 6 6 4 9 P 4 5 5 8 5 P 1 8 5 4 8P 0 0 2 5 9

P 4 2 6 7 5 P 4 2 6 7 6 P 0 6 8 1 1 Q 0 1 6 7 9 P 4 3 7 7 5 Q 0 4 6 0 9 P 0 7 6 5 4 P 4 5 1 9 0 P 0 6 6 7 0

P 4 0 8 5 1P 1 3 0 8 9P 0 1 0 9 3P 4 9 2 5 3P 0 8 8 1 9P 4 6 4 8 3P 1 1 4 7 2P 0 7 1 1 7

Q 0 1 9 6 9 P 3 5 4 4 6 P 3 5 4 4 7 P 2 3 5 4 9 P 0 9 7 5 8P 1 6 4 2 2 P 1 0 4 8 1P 0 7 9 8 4

P 3 9 7 6 6P 2 5 1 0 5P 1 7 9 5 5P 3 7 1 9 8 P 4 1 0 0 6P 2 3 5 9 1 P 3 3 2 1 7 P 2 1 5 5 6P 0 7 2 2 3P 1 5 4 0 7 P 4 3 2 1 9 P 0 9 6 8 1 P 4 9 6 9 8 P 2 2 4 4 9 P 4 4 6 0 2 P 4 5 3 5 4 P 3 0 3 0 9

Q 0 5 1 4 6P 1 4 6 3 5P 2 4 8 6 0P 2 0 1 0 3P 3 0 1 9 5P 4 5 0 0 3P 1 8 7 7 6P 4 3 5 5 0

P 3 0 6 7 2 P 2 9 1 7 6P 4 2 5 5 7 P 4 1 4 4 0P 1 9 0 9 7P 1 3 2 1 5 P 4 2 6 9 7P 0 6 6 1 5 P 1 5 3 6 8P 2 0 5 9 1

P 3 6 5 8 1P 3 5 5 6 5P 3 8 7 5 5Q 0 2 2 0 1P 0 9 5 4 5P 2 8 0 3 1Q 0 9 7 6 8 P 1 3 2 8 8P 4 8 5 0 6P 1 5 6 9 8P 3 4 4 2 5 P 4 3 4 4 9 P 3 3 2 4 0 P 3 9 3 9 6 P 4 5 5 1 0

P 2 9 1 5 5

P 0 8 9 7 0

P 4 6 8 1 3

P 2 9 2 1 8

Q 0 8 8 7 5

P 4 2 1 7 6

P 3 5 2 3 2

Q 0 2 2 0 7P 2 5 4 1 6P 3 9 8 6 8

P 2 6 4 9 5

P 0 8 1 5 9

P 3 9 2 7 6

P 3 6 5 7 4

P 2 7 9 8 1

Q 0 5 5 2 8P 4 9 1 0 2

P 4 8 0 4 4P 1 1 7 0 1

P 4 9 3 3 1

P 3 0 5 1 8P 2 9 3 3 6

P 3 7 9 8 6

P 1 9 3 1 8

P 3 9 1 4 8 P 3 5 4 2 5P 3 8 0 2 5P 1 5 6 4 4P 1 5 3 0 9Q 1 0 1 1 3Q 0 3 4 6 0P 0 3 7 1 0P 4 8 8 4 8P 0 3 2 8 4

P 1 2 6 8 0 P 3 8 9 3 8

P 3 2 0 8 6 P 4 8 8 2 8

P 4 4 7 9 5 P 3 4 8 9 5

P 3 5 1 6 5 P 4 8 8 9 2P 1 2 3 5 2 P 4 2 7 1 2

P 3 1 4 6 0 P 0 8 3 2 5 P 0 9 8 3 2 P 3 6 9 2 4 P 3 7 2 3 2 P 2 8 2 3 5 P 1 5 6 8 7P 3 4 0 5 5 P 1 3 9 1 1P 3 2 4 8 5

P 1 1 6 2 1P 2 7 3 3 6P 3 4 7 5 0P 2 0 6 4 6P 2 9 0 9 3P 0 3 5 1 9 P 3 7 1 2 7 Q 0 2 2 5 1 P 2 3 1 7 6

P 1 1 0 6 6

P 3 6 2 6 0

P 1 1 3 4 9

P 1 7 5 1 8

P 0 7 9 8 5P 0 7 7 3 8P 3 6 2 6 5

P 2 4 1 3 1

P 1 6 0 9 7

P 3 9 7 7 3P 3 6 2 6 6

P 4 3 9 2 0

P 0 0 7 2 2

P 0 0 3 0 9P 3 0 1 8 3 P 4 4 8 3 6

P 2 7 9 6 7

P 3 2 7 4 7

P 4 9 0 8 6 P 1 4 8 5 3

P 1 8 6 4 2 P 4 4 7 1 5P 2 4 2 2 0

P 1 3 6 4 9P 3 7 8 8 7

P 3 2 2 3 2P 0 9 4 8 9 P 2 7 5 2 6

P 3 2 1 5 4 P 0 6 4 9 0

P 0 5 8 7 6 P 3 2 6 7 2

P 3 5 5 2 0 P 1 0 2 1 2 P 2 8 7 7 2P 0 7 3 7 5

P 2 6 3 8 2

P 0 3 3 6 3P 2 5 9 7 1

P 4 3 8 3 3P 0 6 1 7 9

P 1 0 6 1 4P 1 2 8 7 0

P 0 9 8 9 1 P 2 8 3 4 8

P 0 3 5 7 9

P 2 3 7 7 6P 0 6 9 6 0

Q 0 0 5 5 6

P 1 0 7 6 8P 1 7 5 6 1 P 0 3 5 4 4P 1 5 1 8 3

P 1 0 6 1 5P 1 4 2 6 3

P 3 5 5 3 8 Q 0 3 8 4 5 P 2 0 8 1 0 P 2 7 3 2 1 P 0 4 3 2 4

Q 0 7 3 0 7P 3 6 7 8 8P 2 5 0 6 6

P 3 9 0 0 7P 4 6 9 7 5

P 2 5 5 1 5P 3 8 9 7 2P 4 6 4 5 6

P 1 9 2 1 7

P 4 9 2 3 7

P 1 6 2 8 4

Q 0 3 4 6 7 P 4 4 9 2 0

P 3 2 8 4 2 P 3 7 3 7 7

P 1 4 6 1 4P 0 5 8 5 7

P 4 8 7 7 7

P 3 3 6 1 3P 3 6 6 4 2

P 4 5 6 7 7P 4 5 6 7 8

P 3 7 3 7 9

P 3 4 5 5 8

P 2 4 1 2 8

P 3 7 3 9 8

Q 0 8 4 8 1

P 2 1 0 3 2

P 4 8 6 1 2

P 1 9 4 1 0 P 1 6 0 9 9

P 4 9 6 0 8

P 0 3 3 6 2

Q 0 3 6 1 0

P 1 4 0 7 8

P 3 7 0 3 2

P 1 1 9 7 6

P 0 2 3 8 2

P 0 0 5 4 9

Q 0 8 6 8 4

P 2 1 5 3 0

P 1 5 5 0 9

Q 0 8 0 9 9

P 3 3 8 7 9

Q 0 1 0 8 5P 3 1 4 8 3

P 4 3 0 0 2

P 1 9 7 1 1

P 1 1 0 9 5

P 0 7 5 4 7P 0 2 4 8 2P 3 3 7 5 2P 4 0 9 5 4P 2 5 7 6 5

P 3 8 9 7 1P 3 4 6 5 0

P 3 8 0 9 2

Q 0 3 0 6 5 P 2 7 6 9 3

Q 0 6 7 5 8P 2 5 4 1 5P 3 2 2 4 2P 8 0 2 0 5P 3 9 5 2 9Q 0 9 9 2 3P 2 9 0 2 9

Q 0 0 9 9 3 P 2 0 5 3 3 P 4 0 9 0 8 P 4 0 4 6 7 P 3 2 2 4 3 P 8 0 2 0 6 P 1 1 6 3 5P 4 3 8 7 9

P 4 5 0 3 5P 2 9 4 6 5P 3 0 5 9 4P 0 5 8 2 5 P 3 0 5 7 2 P 2 9 9 6 1P 3 0 5 9 7

Q 0 1 6 5 7

P 1 6 4 5 1P 2 7 7 4 7

P 1 6 2 6 3

P 1 3 5 1 6

Q 0 1 2 0 5

P 4 5 1 1 8

P 1 6 5 2 1

P 0 6 9 5 9

P 2 5 9 9 7

P 1 2 6 9 5

P 0 3 9 5 6

Q 0 2 9 7 5

P 3 0 2 9 6

P 4 5 1 7 0

Q 0 9 4 2 7

Q 0 4 9 8 2

P 2 1 8 5 2

P 3 8 0 4 6

P 4 0 0 2 4

P 4 5 1 0 5P 4 7 3 0 3

P 4 5 6 0 0

P 3 1 0 6 0

P 4 2 4 3 6

P 4 5 3 2 1

P 0 9 8 3 3

P 1 0 6 3 6

P 4 5 0 5 2

P 4 5 0 5 1P 4 1 2 3 3

P 4 1 6 4 7

P 0 8 2 6 6

P 2 4 1 3 6

P 0 8 0 0 7

P 3 9 1 0 9

P 2 1 4 3 9

Q 0 3 5 1 9

Q 0 0 6 1 9

P 3 6 3 7 1

P 3 8 7 3 5

P 3 3 3 1 0

P 1 9 7 7 1

P 4 5 8 6 1

P 2 1 4 4 8 P 3 3 3 1 1

P 4 9 5 0 1

P 2 1 4 4 1P 2 2 0 3 6

P 1 3 5 6 8

P 4 5 1 6 7

P 3 7 6 2 4

P 2 4 6 8 3

P 1 8 7 6 6

P 2 4 1 3 7

P 4 5 1 7 1

P 3 3 9 4 1

P 4 3 0 7 4

P 3 3 2 0 0

P 4 2 3 3 7

P 0 9 0 1 2

P 1 2 3 8 3

P 3 3 3 0 2

P 3 0 9 6 3

P 4 6 9 2 0P 1 5 1 8 7

P 2 6 3 6 1

P 1 6 6 8 4

P 0 3 5 9 3

Q 0 3 0 2 5 P 3 6 0 2 8

P 4 0 9 6 7

Q 0 2 5 9 2

P 4 5 7 9 1P 1 1 0 9 2

P 3 6 3 3 0

P 0 3 5 5 6

P 0 5 8 4 4

P 3 4 9 5 6

P 2 4 7 9 4P 0 6 0 1 9

P 2 1 4 8 0

P 0 3 8 7 8

P 0 0 3 9 7

Q 0 9 8 9 3

P 1 9 0 2 8P 2 2 0 5 6

P 2 2 4 9 5P 3 6 3 3 1 P 1 9 1 9 9

P 1 7 7 5 7 P 2 7 4 1 0

P 3 5 9 2 8

P 0 9 8 1 4

P 0 8 3 6 4

P 1 9 5 6 1

Q 0 5 0 5 7P 1 8 2 4 7

P 3 1 6 3 0

P 1 3 8 9 7P 1 3 9 0 0

P 0 6 9 3 5

P 2 5 0 5 9

Q 0 2 5 9 7

P 1 0 9 7 8

P 1 6 6 0 4

P 0 3 2 0 0

P 1 7 5 9 3Q 0 4 5 4 4

P 0 3 3 0 5P 0 3 3 0 6

P 2 9 3 2 4P 3 6 3 2 7

P 1 6 6 9 1P 3 6 6 3 8

P 2 4 5 8 6P 0 3 3 1 6P 2 7 2 8 5

P 4 4 0 4 7

P 2 9 1 7 2

P 1 0 3 0 6

P 4 4 9 1 7

P 0 3 3 1 4

P 2 0 1 2 6

P 0 5 9 5 9Q 0 4 5 3 8

P 0 3 5 9 9

P 1 3 5 6 1

Q 0 0 9 6 2

P 3 6 3 0 4

P 3 6 3 0 9

P 2 7 2 8 2

P 2 7 4 0 9 Q 0 4 6 1 0

P 0 3 3 0 4

P 1 9 9 0 1

P 1 3 5 2 9

P 0 8 7 6 8

P 1 1 2 0 4

P 1 0 2 7 2

P 1 9 5 6 0

P 0 3 3 0 2P 3 1 8 2 2

P 2 2 3 2 1P 1 7 1 2 4

P 2 1 9 1 7

P 0 8 1 7 2

P 0 8 9 1 2

P 3 2 2 1 1

P 3 5 3 7 2

P 1 8 8 2 5

P 2 5 4 7 3

P 2 1 0 8 4

P 4 7 8 9 8

P 0 8 1 7 3

P 3 3 5 3 3

Q 0 4 5 7 3

P 4 7 7 4 8

P 3 0 0 9 8

P 4 1 1 4 3

P 4 7 7 5 1P 4 6 0 9 0

P 3 4 9 7 5

P 3 5 3 7 1

P 4 1 1 4 4

P 3 5 3 5 0 P 3 0 5 4 9

P 0 5 3 6 3

P 1 4 1 2 6

P 4 9 5 7 8

P 1 1 2 2 9

P 2 0 3 0 9

P 4 2 2 8 9

P 2 5 9 6 2

P 0 4 2 7 4

P 4 7 9 0 1

P 0 6 7 2 4

P 4 8 9 7 4

P 1 0 7 2 0

P 3 8 8 6 7

P 2 2 3 3 2

P 4 2 3 4 7

P 4 3 1 1 5

P 3 4 9 8 0

P 3 5 4 0 8

P 2 8 3 3 6

P 3 2 5 1 2P 4 6 7 3 7

P 0 4 0 0 1

P 3 2 2 4 0

Q 0 5 3 9 4

P 3 0 8 7 4P 2 8 6 4 6

P 3 0 8 7 2

P 3 0 6 8 0

P 3 1 3 9 1

P 3 2 7 4 5

P 0 1 4 5 2

P 3 2 2 3 6

P 3 1 3 8 9

P 2 1 4 5 0P 3 0 5 4 6

P 3 2 9 4 0

P 2 0 3 4 6P 2 3 3 6 2

P 2 8 0 8 8

P 4 8 7 4 8

P 2 3 1 6 3

P 2 8 6 8 0

P 3 5 3 8 3

P 2 5 1 1 5

P 2 9 2 7 6

P 1 8 9 0 1

P 3 0 9 3 8

P 3 0 9 3 6

P 2 5 9 3 0

P 2 8 8 2 7

P 2 2 7 3 5

P 4 1 3 8 1 P 3 4 6 8 9

P 3 9 6 8 7

P 4 2 3 0 5

P 2 2 2 9 7

Q 0 1 7 1 7

P 1 9 3 9 8

P 2 0 6 3 8

P 3 6 1 7 6

P 3 5 4 0 7

P 4 3 5 0 5

P 4 6 0 2 3

P 2 1 4 6 3P 1 6 4 7 3

P 4 7 7 9 9

P 3 7 9 7 2

P 3 2 3 1 1

P 3 2 4 8 2

P 1 5 4 0 9P 2 4 9 3 9P 0 5 0 7 8

P 2 5 1 0 6

P 3 2 3 0 6

P 1 1 6 1 3

P 4 3 2 5 3

P 3 0 9 8 9

P 4 3 6 5 7

Figure 5.10. The Protein network using only circular patterns



Pfam-B_11512

Pfam-B_4788Pfam-B_4842

Pfam-B_6865

Pfam-B_4911


Pfam-B_4947Pfam-B_4948Pfam-B_4987

Pfam-B_1702

Pfam-B_4979

Pfam-B_11850

Pfam-B_1969


Pfam-B_277

Pfam-B_2673

Pfam-B_4454

Pfam-B_4209

Pfam-B_2910

Pfam-B_2973

Pfam-B_5651

Pfam-B_1461

Pfam-B_9091

Pfam-B_692

Pfam-B_2674

Pfam-B_3129

Pfam-B_3239

Pfam-B_3238

Pfam-B_394 Pfam-B_5119


Pfam-B_2258

Pfam-B_1094


Pfam-B_5821

Pfam-B_433

Pfam-B_3539



Pfam-B_2406

Pfam-B_10602


Pfam-B_10379


Pfam-B_2627


Pfam-B_11495

Pfam-B_269

Pfam-B_286

Pfam-B_2335Pfam-B_8803 Pfam-B_689Pfam-B_1923Pfam-B_320Pfam-B_2224


Pfam-B_2300

Pfam-B_5840

Pfam-B_227





Pfam-B_2475


Pfam-B_10424 Pfam-B_8133Pfam-B_1992Pfam-B_5438

Pfam-B_5213

Pfam-B_2049

Pfam-B_11854Pfam-B_11873Pfam-B_11903Pfam-B_1193Pfam-B_11306Pfam-B_1262Pfam-B_6160Pfam-B_171Pfam-B_5566Pfam-B_1741 Pfam-B_11880

Pfam-B_2865



Pfam-B_2958

Pfam-B_6977

Pfam-B_1263

thy rog lobu l i n_1 P fam-B_8589 Pfam-B_7924Pfam-B_11458


Pfam-B_5Pfam-B_5307Pfam-B_9386Pfam-B_783


Pfam-B_6870

Pfam-B_7225Pfam-B_3831Pfam-B_3832Pfam-B_3859Pfam-B_7439Pfam-B_4175Pfam-B_3167Pfam-B_4080

Pfam-B_6867

r e c A Pfam-B_9714

Pfam-B_673Pfam-B_736 Pfam-B_2401



Pfam-B_4120Pfam-B_4119Pfam-B_4892 Pfam-B_490


Pfam-B_1396

Pfam-B_721

Pfam-B_11421

Pfam-B_7405

Pfam-B_2056 Pfam-B_3365 g l n - s y n t

P fam-B_3796

Pfam-B_3920

Pfam-B_5122

Pfam-B_572 Pfam-B_2615Pfam-B_5457 Pfam-B_7633Pfam-B_5346 Pfam-B_5218 Pfam-B_5096Pfam-B_5095


Pfam-B_1566

Pfam-B_10902

Pfam-B_2256

Pfam-B_5880

Pfam-B_11520

Pfam-B_6278

Pfam-B_7187 Pfam-B_685Pfam-B_6869 Pfam-B_6864 Pfam-B_6835

Pfam-B_304Pfam-B_2828Pfam-B_3681Pfam-B_11589Pfam-B_3745Pfam-B_7510Pfam-B_4309Pfam-B_4774 Pfam-B_6596Pfam-B_2943Pfam-B_4676 Pfam-B_6719Pfam-B_2883



Pfam-B_6268

Pfam-B_2312Pfam-B_2534Pfam-B_7519Pfam-B_2864Pfam-B_6444Pfam-B_2930Pfam-B_2985Pfam-B_11891 Pfam-B_2447Pfam-B_2929 Pfam-B_962Pfam-B_489 Pfam-B_4599 Pfam-B_3242Pfam-B_11177Pfam-B_4052 Pfam-B_3364Pfam-B_10317 Pfam-B_3921Pfam-B_4081Pfam-B_4893 Pfam-B_4144

Pfam-B_8112 Pfam-B_8111 Pfam-B_7515Pfam-B_7589



Pfam-B_536Pfam-B_5836 Pfam-B_5124 Pfam-B_5992

Pfam-B_1450 Pfam-B_11368 Pfam-B_11477

Pfam-B_4670


Pfam-B_9467


Pfam-B_6739

Pfam-B_6743


Pfam-B_4777

Pfam-B_6918



Pfam-B_2879

Pfam-B_11478

Pfam-B_3562



Pfam-B_602


Pfam-B_921

Pfam-B_8702

Pfam-B_9212


Pfam-B_1812

Pfam-B_683

Pfam-B_207

Pfam-B_3350

Pfam-B_1312

Pfam-B_4719


Pfam-B_2053

Pfam-B_4800

Pfam-B_2123

Pfam-B_2134


Pfam-B_8761

Pfam-B_5221


Pfam-B_5618

Pfam-B_8252

Pfam-B_1150

Pfam-B_4708


Pfam-B_5619

Pfam-B_2209

Pfam-B_8030Pfam-B_7857Pfam-B_7457Pfam-B_6310Pfam-B_6280 Pfam-B_498Pfam-B_4845Pfam-B_4836

Pfam-B_8973

Pfam-B_8969heme_1

Pfam-B_5239

Pfam-B_7775

ox ido red_mo lyb

P fam-B_3078Pfam-B_7598 Pfam-B_3349Pfam-B_3607

Pfam-B_7832


Pfam-B_5783

Pfam-B_7176

Pfam-B_246

Pfam-B_11893

Pfam-B_5782

Pfam-B_556

Pfam-B_7478

Pfam-B_995

Pfam-B_96

Pfam-B_4733

Pfam-B_2645

Pfam-B_8452

Pfam-B_8453

Pfam-B_3573

Pfam-B_8451

Pfam-B_7145

Pfam-B_2613

Pfam-B_169

Pfam-B_1017

Pfam-B_10739

Pfam-B_6632

Pfam-B_1842

Pfam-B_11676

Pfam-B_6873

Pfam-B_7581

Pfam-B_1027

Pfam-B_6874

Pfam-B_4858

Pfam-B_6289


Pfam-B_9207




Pfam-B_3625

Pfam-B_3046

Pfam-B_4857

Pfam-B_6095

Pfam-B_3445

Pfam-B_635


Pfam-B_951

Pfam-B_1904

Pfam-B_2432

Pfam-B_1882

Pfam-B_949

Pfam-B_780

Pfam-B_7146

Pfam-B_3614


Pfam-B_9304

Pfam-B_538

Pfam-B_1320

Pfam-B_2266

Pfam-B_493

Pfam-B_2018

Pfam-B_7468

Pfam-B_3581

Pfam-B_5107

Pfam-B_4677

Pfam-B_10706

Pfam-B_1413

Pfam-B_8753

Pfam-B_1153

Pfam-B_7512

Pfam-B_124

Pfam-B_6138

Pfam-B_694

Pfam-B_3457

Pfam-B_9833

Pfam-B_9832

Pfam-B_9830

Pfam-B_4113

Pfam-B_7410

Pfam-B_9831

Pfam-B_1412

Pfam-B_2301

Pfam-B_10502

Pfam-B_10231

Pfam-B_10228

Pfam-B_10225

Pfam-B_3894

Pfam-B_4575

Pfam-B_1243

Pfam-B_371

Pfam-B_3503

Pfam-B_7185

Pfam-B_6118

Pfam-B_8894

Pfam-B_3753

Pfam-B_3231

Pfam-B_3389

Pfam-B_7409

Pfam-B_2644

Pfam-B_2205

Pfam-B_2663

Pfam-B_94

Pfam-B_8740

Pfam-B_8741

Pfam-B_1486

Pfam-B_9387

Pfam-B_10647

Pfam-B_5147

Pfam-B_7317


Pfam-B_11865

Pfam-B_3766

Pfam-B_2092

Pfam-B_3388

Pfam-B_11866

Pfam-B_202

Pfam-B_393

Pfam-B_561

Pfam-B_2959

Pfam-B_254

Pfam-B_712

Pfam-B_2456

Pfam-B_651

Pfam-B_11600

Pfam-B_240

Pfam-B_195

Pfam-B_321

Pfam-B_3207

Pfam-B_1503


Pfam-B_3662

Pfam-B_8754

Pfam-B_3287

Pfam-B_4362


Pfam-B_2545

Pfam-B_7521


Pfam-B_5257

Pfam-B_3244

Pfam-B_7818

Pfam-B_10375

Pfam-B_7352


Pfam-B_589

Pfam-B_2543

Pfam-B_2544

Pfam-B_1341



Pfam-B_10893

Pfam-B_1334Pfam-B_1034Pfam-B_10892Pfam-B_2462 Pfam-B_1098Pfam-B_1423

p h o s l i p

Pfam-B_7503

Pfam-B_2533



Pfam-B_7645

Pfam-B_3576

Pfam-B_9518


Pfam-B_4679




Pfam-B_2449

Pfam-B_4030

Pfam-B_7142

Pfam-B_1778

Pfam-B_6837



Pfam-B_1273

Pfam-B_2938

Pfam-B_11399

Pfam-B_3440

Pfam-B_2320

Pfam-B_2321

Pfam-B_3076Pfam-B_3416 Pfam-B_8604Pfam-B_1363

Pfam-B_10659


Pfam-B_2523

Pfam-B_2770

Pfam-B_3844

Pfam-B_10660

Pfam-B_5071

Pfam-B_3847


Pfam-B_1040


Pfam-B_5125


Pfam-B_6071Pfam-B_884Pfam-B_7112Pfam-B_8564 Pfam-B_7858 Pfam-B_655Pfam-B_239 Pfam-B_8937Pfam-B_420 Pfam-B_11614 Pfam-B_3260Pfam-B_7454



Pfam-B_610

Pfam-B_6252

Pfam-B_7051

Pfam-B_5881

Pfam-B_897

Pfam-B_867

Pfam-B_8473

Pfam-B_8844

Pfam-B_8845

Pfam-B_7206

Pfam-B_1951

Pfam-B_5551

Pfam-B_4481

Pfam-B_9147

Pfam-B_661

Pfam-B_9257

Pfam-B_9258

Pfam-B_5882

Pfam-B_960

Pfam-B_7657

Pfam-B_7060

Pfam-B_5736

Pfam-B_898

Pfam-B_807


Pfam-B_5049

Pfam-B_1191

fe r4_N i fH

Pfam-B_10037

Pfam-B_2806


Pfam-B_9410

Pfam-B_1055

Pfam-B_262

Pfam-B_6353

Pfam-B_2375

Pfam-B_2374

Pfam-B_7366



Pfam-B_10816

Pfam-B_2431

Pfam-B_1744

Pfam-B_2429

Pfam-B_3613

Pfam-B_922

Pfam-B_10820

Pfam-B_2428

Pfam-B_5847

Pfam-B_5846

Pfam-B_2427a c t i n

P fam-B_10644

Pfam-B_107

Pfam-B_9344

Pfam-B_6403

Pfam-B_414

Pfam-B_445

Pfam-B_7363


Pfam-B_9880

Pfam-B_9873

Pfam-B_7258

Pfam-B_6355

Pfam-B_1507

Pfam-B_67

Pfam-B_6354

Pfam-B_3512


Pfam-B_1190


Pfam-B_8748

Pfam-B_2217

Cys -p ro tease


Pfam-B_3510

Pfam-B_1982

Pfam-B_8746

Pfam-B_3511

Pfam-B_911

Pfam-B_6605


Pfam-B_779

Pfam-B_10821

t h i o l a s e

Pfam-B_2430

Pfam-B_10530

Pfam-B_2137

Pfam-B_956

Pfam-B_5684

Pfam-B_1044

Pfam-B_345

Pfam-B_1131

Pfam-B_238

Pfam-B_2138

Pfam-B_10825

Pfam-B_10823

Pfam-B_1745

Pfam-B_846


Pfam-B_8888

Pfam-B_4438



Pfam-B_9252

Pfam-B_4332

Pfam-B_1666

Pfam-B_9251

Pfam-B_1850

Pfam-B_1981

Pfam-B_2629

Pfam-B_627

Pfam-B_4331





Pfam-B_5978 FGF Pfam-B_1808

Pfam-B_6349

Pfam-B_8813


Pfam-B_7672

Pfam-B_11694 Pfam-B_11302Pfam-B_11601Pfam-B_11518 Pfam-B_11232

Pfam-B_7327 Pfam-B_7257 Pfam-B_7232 Pfam-B_6921 Pfam-B_579Pfam-B_1580Pfam-B_8605 Pfam-B_5055Pfam-B_3792

Pfam-B_7356Pfam-B_1582 Pfam-B_9574 Pfam-B_9083Pfam-B_9082 Pfam-B_5774 Pfam-B_8786Pfam-B_8787Pfam-B_6577Pfam-B_1407Pfam-B_2114




Pfam-B_3094Pfam-B_10967Pfam-B_3385Pfam-B_3716Pfam-B_3849Pfam-B_7398Pfam-B_3864 Pfam-B_933

Pfam-B_9863S 4Pfam-B_617Pfam-B_3640Pfam-B_11446Pfam-B_6790 Pfam-B_4345Pfam-B_11447 Pfam-B_1942 Pfam-B_1918Pfam-B_9465

Pfam-B_5008 Pfam-B_4880Pfam-B_6836Pfam-B_4824 Pfam-B_3872Pfam-B_3705 Pfam-B_3871

Pfam-B_2036Pfam-B_2936Pfam-B_487Pfam-B_1346Pfam-B_9804Pfam-B_7496 Pfam-B_7357 Pfam-B_1108

Pfam-B_7737Pfam-B_5227Pfam-B_8180Pfam-B_8184Pfam-B_8208 Pfam-B_8319Pfam-B_8601 Pfam-B_7626Pfam-B_5188Pfam-B_217

g p d hPfam-B_10363Pfam-B_1045Pfam-B_10849Pfam-B_10954Pfam-B_10962Pfam-B_11127Pfam-B_11194



Pfam-B_1287

Pfam-B_3211

Pfam-B_1129

Pfam-B_1130

Pfam-B_5458

Pfam-B_1441


Pfam-B_2163



Pfam-B_2121

Pfam-B_2888

Pfam-B_3782



Pfam-B_1639



Pfam-B_3045

Pfam-B_7495

Pfam-B_3450

Pfam-B_127

Pfam-B_1105

Pfam-B_181

Pfam-B_3101

Pfam-B_3047




Pfam-B_959

Pfam-B_265

Pfam-B_10717

Pfam-B_3728

Pfam-B_1773

Pfam-B_5023

Pfam-B_5022

Pfam-B_5558

Pfam-B_3815

Pfam-B_10198 Pfam-B_186 Pfam-B_1752 Pfam-B_10873 Pfam-B_2487

Pfam-B_8467

Pfam-B_8672

Pfam-B_1596


Pfam-B_522


Pfam-B_4146

Pfam-B_676

Pfam-B_2192


Pfam-B_5896

Pfam-B_1095

Pfam-B_6872

Pfam-B_8637


Pfam-B_1563

Pfam-B_519


Pfam-B_648



Pfam-B_6579

Pfam-B_802

s i g m a 5 4

Pfam-B_631

Pfam-B_1115

Pfam-B_10

Pfam-B_21

Pfam-B_860


Pfam-B_9601


Pfam-B_7768

Pfam-B_2642


Pfam-B_1963


Pfam-B_1964

Pfam-B_10092

Pfam-B_10086

Pfam-B_2638

Pfam-B_10087

Pfam-B_8799


Pfam-B_10031


Pfam-B_6097 Pfam-B_917Pfam-B_6327 Pfam-B_6768Pfam-B_3808

Pfam-B_7975

Pfam-B_2507

Pfam-B_1103Pfam-B_5615Pfam-B_2813 Pfam-B_4434Pfam-B_3623


Pfam-B_3820

Pfam-B_3454

Pfam-B_1754

Pfam-B_8757


Pfam-B_7637 Pfam-B_7406Pfam-B_4685 Pfam-B_8911

Pfam-B_1952

Pfam-B_4173Pfam-B_8049 Pfam-B_5280 Pfam-B_6088Pfam-B_4992 Pfam-B_3606Pfam-B_10328c o n n e x i nPfam-B_4983 Pfam-B_11615Pfam-B_7452

Pfam-B_1012

Pfam-B_9321

Pfam-B_5064Pfam-B_2218 Pfam-B_9762 Pfam-B_3348 Pfam-B_2536

Pfam-B_1317


Pfam-B_4704

Pfam-B_2023

Pfam-B_1221


Pfam-B_115 P r i b o s y l t r a n Pfam-B_4341Pfam-B_10795Pfam-B_558Pfam-B_1442Pfam-B_9318Pfam-B_5944 Pfam-B_8012Pfam-B_7540Pfam-B_8708




Pfam-B_2003

Pfam-B_665 Pfam-B_8795Pfam-B_6609

Pfam-B_2987 Pfam-B_1766Pfam-B_853Pfam-B_2873Pfam-B_8356 Pfam-B_6982Pfam-B_7683 Pfam-B_5130Pfam-B_7833Pfam-B_883 Pfam-B_6868Pfam-B_11630Pfam-B_934Pfam-B_7547 Pfam-B_3399 Pfam-B_6731

Pfam-B_3034 Pfam-B_2625Pfam-B_2198Pfam-B_2199Pfam-B_8325Pfam-B_7292

Pfam-B_6519 Pfam-B_5997Pfam-B_5996 Pfam-B_11524Pfam-B_8499Pfam-B_9290 Pfam-B_5300 s o d f ePfam-B_3885Pfam-B_429Pfam-B_10635Pfam-B_7765Pfam-B_4633Pfam-B_1201Pfam-B_4775Pfam-B_4776Pfam-B_6735

Pfam-B_8237Pfam-B_8403Pfam-B_908Pfam-B_9860 Pfam-B_7003Pfam-B_7104Pfam-B_7190Pfam-B_7316 Pfam-B_7216 Pfam-B_6955Pfam-B_7254Pfam-B_7438Pfam-B_7835 Pfam-B_7658 Pfam-B_7614 Pfam-B_7579 Pfam-B_7497 Pfam-B_7494 Pfam-B_7470 Pfam-B_7448Pfam-B_1524Pfam-B_10399Pfam-B_6437Pfam-B_1117Pfam-B_521Pfam-B_1522Pfam-B_319Pfam-B_1866


l i p a s e

Pfam-B_5059

Pfam-B_841


Pfam-B_3589

Pfam-B_753




Pfam-B_10663

Pfam-B_475 Pfam-B_628 Pfam-B_545Pfam-B_1910 Pfam-B_1415Pfam-B_2255Pfam-B_5520

Pfam-B_4152



Pfam-B_2753

Pfam-B_2833



HSP70

Pfam-B_785

Pfam-B_930

Pfam-B_2286


Pfam-B_5123

Pfam-B_10588 Pfam-B_3361 Pfam-B_230 Pfam-B_10030 Pfam-B_3077Pfam-B_1597Pfam-B_2647Pfam-B_3061

Pfam-B_3070

Pfam-B_567

Pfam-B_7140

Pfam-B_10887

Pfam-B_289

Pfam-B_6172


K H - d o m a i n

Pfam-B_1587


Pfam-B_1410

Pfam-B_586

tsp_1

Pfam-B_996

Pfam-B_11397

Pfam-B_6173

Pfam-B_1950

Pfam-B_1697

Pfam-B_1943

Pfam-B_1147

Pfam-B_9962

Pfam-B_953

Pfam-B_9878

Pfam-B_4820


Pfam-B_417


Pfam-B_6032

Pfam-B_9963


Pfam-B_1180

Pfam-B_4312

Pfam-B_4310

Pfam-B_4327

Pfam-B_2850


Pfam-B_4311

Pfam-B_2512

Pfam-B_2229

Pfam-B_2724

Pfam-B_3343


Pfam-B_10319

Pfam-B_4328

Pfam-B_3843

Pfam-B_1406

Pfam-B_2112

Pfam-B_6363


Pfam-B_9612

Pfam-B_4344

Pfam-B_3735


Pfam-B_3846

Pfam-B_491


DNA_pol

P fam-B_9162

Pfam-B_4313



Pfam-B_7001

Pfam-B_4050

Pfam-B_7744

a d h _ s h o r t

P fam-B_650Pfam-B_3305

Pfam-B_8797


Pfam-B_10598

Pfam-B_1565

Pfam-B_5390

Pfam-B_3301


Pfam-B_1280

Pfam-B_8881


Pfam-B_2001

Pfam-B_6851

Pfam-B_5526

Pfam-B_6163


Pfam-B_2799

Pfam-B_6852

Pfam-B_3283

Pfam-B_5662

Pfam-B_3091

Pfam-B_5663

Pfam-B_816

Pfam-B_3108



Pfam-B_3443

Pfam-B_7552

Pfam-B_1416



Pfam-B_10079

Pfam-B_2356

Pfam-B_7747

Pfam-B_3105

Pfam-B_3010


Pfam-B_2954

Pfam-B_4501

Pfam-B_2807

Pfam-B_224

Pfam-B_1962

Pfam-B_2657

Pfam-B_10679

Pfam-B_6939

Pfam-B_1132

Pfam-B_1769

Pfam-B_5686

Pfam-B_8585

Pfam-B_10666


Pfam-B_873

Pfam-B_2662

Pfam-B_3585

Pfam-B_2495 Pfam-B_2369 Pfam-B_2338 Pfam-B_2287 Pfam-B_2171 Pfam-B_1762 Pfam-B_1749

Pfam-B_10129Pfam-B_10186Pfam-B_6306Pfam-B_10268Pfam-B_462

Pfam-B_11832


Pfam-B_2064


Pfam-B_6065

Pfam-B_3430

Pfam-B_6047

Pfam-B_2306

Pfam-B_5685

Pfam-B_2354

Pfam-B_5231

Pfam-B_9635

Pfam-B_9993

Pfam-B_2815

Pfam-B_9992

Pfam-B_2816

Pfam-B_9991

Pfam-B_8699

Pfam-B_902

Pfam-B_6044

Pfam-B_4984

Pfam-B_6231


Pfam-B_4637

Pfam-B_983

Pfam-B_8650

Pfam-B_10452

Pfam-B_1626

Pfam-B_3860

Pfam-B_835

Pfam-B_2109

Pfam-B_7285

Pfam-B_1579

Pfam-B_1278


Pfam-B_301

Pfam-B_2591

Pfam-B_362




Pfam-B_5781

Pfam-B_7625



Pfam-B_10463


Pfam-B_10464

Pfam-B_2760


Pfam-B_3164

Pfam-B_1678

Pfam-B_2293

Pfam-B_10470

Pfam-B_984


Pfam-B_464

Pfam-B_55

Pfam-B_9342

Pfam-B_214


Pfam-B_7755

Pfam-B_7134

Pfam-B_9030

Pfam-B_1356


Pfam-B_2837


Pfam-B_2124



Pfam-B_510

Pfam-B_2870


Pfam-B_2371


Pfam-B_421

Pfam-B_5318

Pfam-B_10731


Pfam-B_1101


Pfam-B_6450

Pfam-B_5991


Pfam-B_6529

Pfam-B_190


Pfam-B_7580



Pfam-B_7594

Pfam-B_3344 Pfam-B_6991 Pfam-B_6317 Pfam-B_2303Pfam-B_1225



Pfam-B_2382

Pfam-B_4384



Pfam-B_3321

Pfam-B_2098

Pfam-B_2005

s u b t i l a s e




Pfam-B_4803

Pfam-B_3915

Pfam-B_7007



Pfam-B_3569

Pfam-B_868


Pfam-B_3366

Pfam-B_1380

Pfam-B_2947

Pfam-B_11369


Pfam-B_1322Pfam-B_4993Pfam-B_8563Pfam-B_9409 Pfam-B_3521 Pfam-B_323Pfam-B_3284Pfam-B_3378Pfam-B_6089

Pfam-B_2415


Pfam-B_10693

Pfam-B_10529

Pfam-B_8755

Pfam-B_3565


Pfam-B_2908


Pfam-B_9320


Pfam-B_10926Pfam-B_5424 Pfam-B_639 COX2 Pfam-B_10932 Pfam-B_4183 Pfam-B_871Pfam-B_8932 Pfam-B_11111Pfam-B_5859Pfam-B_5611Pfam-B_4579

Pfam-B_5553

Pfam-B_4193

Pfam-B_7207


Pfam-B_1919

Pfam-B_562

Pfam-B_6742

Pfam-B_1790

Pfam-B_435

Pfam-B_377


Pfam-B_4010

Pfam-B_3558

Pfam-B_1518

Pfam-B_1214

p y r _ r e d o x

Pfam-B_10724

Pfam-B_584

Pfam-B_1864

Pfam-B_2991

Pfam-B_9415



Pfam-B_6606

Pfam-B_4518

Pfam-B_3183


Pfam-B_457

Pfam-B_8127

Pfam-B_5379

Pfam-B_8109

Pfam-B_9319

Pfam-B_2599

Pfam-B_4470

Pfam-B_4925


Pfam-B_8742

Pfam-B_8688


Pfam-B_1487

Pfam-B_5627

Pfam-B_8043

Pfam-B_5628

Pfam-B_2748

7 t m _ 2

Pfam-B_5019


Pfam-B_2317


Pfam-B_5104

Pfam-B_1759

Pfam-B_1553

Pfam-B_3667


Pfam-B_1552

Pfam-B_5804

Pfam-B_1877



Pfam-B_7417

Pfam-B_153

S 1 2

Pfam-B_1248




Pfam-B_7662


Pfam-B_48


Pfam-B_2295

Pfam-B_4414

Pfam-B_11162

Pfam-B_11141


Pfam-B_1061

Pfam-B_1062

Pfam-B_2294



Pfam-B_4407


Pfam-B_9547

Pfam-B_11864

p h o t o R C Pfam-B_6946

Pfam-B_70

Pfam-B_35

response_regPfam-B_4887


Pfam-B_3162

Pfam-B_2927

Pfam-B_2443

Pfam-B_5377


Pfam-B_9341

Pfam-B_6995

Pfam-B_6996

Pfam-B_4710

Pfam-B_1172

Pfam-B_6994

Pfam-B_7605

Pfam-B_7600

Pfam-B_1421

Pfam-B_7602




Pfam-B_4817

Pfam-B_613

Pfam-B_1562

Pfam-B_2928

Pfam-B_3412



Pfam-B_9565

Pfam-B_856Pfam-B_8

Pfam-B_7465

Pfam-B_5803


Pfam-B_7237

Pfam-B_2454


Pfam-B_839 Pfam-B_7464 Pfam-B_7098 Pfam-B_4054 Pfam-B_3185Pfam-B_10677


Pfam-B_10688






Pfam-B_3168


Pfam-B_1780


Pfam-B_11276

Pfam-B_7062

Pfam-B_1991


Pfam-B_2453

Pfam-B_10637

Pfam-B_10640

Pfam-B_6537

Pfam-B_838

Pfam-B_56

Pfam-B_981

tRNA-syn t_1

Pfam-B_1936

Zn_c lus

a l d e d hPfam-B_1946

Pfam-B_3931

Pfam-B_7834

Pfam-B_4641


Pfam-B_4194v w d

Pfam-B_8680


Pfam-B_200

Pfam-B_4804

Pfam-B_204

Pfam-B_10633

Pfam-B_2017

Pfam-B_374

Pfam-B_6810

Pfam-B_9144

Pfam-B_8670

Pfam-B_5835

Pfam-B_3330


Pfam-B_469

Pfam-B_10775

Pfam-B_647


Pfam-B_8718

Pfam-B_3736

Pfam-B_900

Pfam-B_4703

Pfam-B_10281

Pfam-B_2072

Pfam-B_2560

Pfam-B_8716

Pfam-B_8719

Pfam-B_4634

Pfam-B_2478

Pfam-B_2559

Pfam-B_2466

Pfam-B_3780

Pfam-B_4961


Pfam-B_8667

Pfam-B_682

Pfam-B_3500

Pfam-B_5131



Pfam-B_4801

Pfam-B_1551

Pfam-B_7553

Pfam-B_4593

Pfam-B_10288

Pfam-B_4592

Pfam-B_6424

Pfam-B_1658

Pfam-B_6513

Pfam-B_4619

Pfam-B_6415

Pfam-B_3529

Pfam-B_1659

ce l l u l ase

Pfam-B_8891

Pfam-B_677

Pfam-B_1051

Pfam-B_9192

Pfam-B_5583

Pfam-B_1301

Pfam-B_10625

Pfam-B_3943

Pfam-B_6617

Pfam-B_914

Pfam-B_1990

Pfam-B_283

Pfam-B_348

Pfam-B_4618

Pfam-B_10368

Pfam-B_2166

Pfam-B_5482

Pfam-B_10546

Pfam-B_8358

Pfam-B_2068

lec t in_ legA

Pfam-B_8423

Pfam-B_2050

Pfam-B_2616

Pfam-B_8430

Pfam-B_7248

Pfam-B_1395


Pfam-B_1783


Pfam-B_3022

Pfam-B_7246

Pfam-B_832

Pfam-B_10511

Pfam-B_4822

Pfam-B_10509

Pfam-B_10510


Pfam-B_427

Pfam-B_6875

Pfam-B_707

Pfam-B_4314


Pfam-B_9637


Pfam-B_2723

Pfam-B_4424

Pfam-B_1680

Pfam-B_9638

Pfam-B_3341

Pfam-B_6384

Pfam-B_4213

Pfam-B_255

Pfam-B_1474

Pfam-B_1165


cy toch rome_c

Pfam-B_8546

Pfam-B_8548

Pfam-B_5584

Pfam-B_5687

Pfam-B_684

Pfam-B_1308

Pfam-B_5814

Pfam-B_10813

Pfam-B_4721

Pfam-B_2848

Pfam-B_1477

Pfam-B_1663

Pfam-B_8734

Pfam-B_4212

Pfam-B_199

Pfam-B_3672

Pfam-B_10512

Pfam-B_4024

Pfam-B_4873

Pfam-B_1399

Pfam-B_10289


Pfam-B_1289

Pfam-B_10358

Pfam-B_6423

Pfam-B_10290

Pfam-B_6622

Pfam-B_1685

Pfam-B_1728


Pfam-B_4591


Pfam-B_11192

Pfam-B_11301

Pfam-B_5734

Pfam-B_1668

Pfam-B_8026

Pfam-B_9255

Pfam-B_8027

Pfam-B_2826

Pfam-B_7119

Pfam-B_9180

Pfam-B_1893


Pfam-B_1961

Pfam-B_7664

Pfam-B_11775

Pfam-B_9436

Pfam-B_6485


Pfam-B_9898

Pfam-B_1509

Pfam-B_6180


Pfam-B_9905

Pfam-B_350

Pfam-B_5750

Pfam-B_9044

Pfam-B_337

Pfam-B_9041

Pfam-B_2801

Pfam-B_4142

Pfam-B_9045



Pfam-B_10904

Pfam-B_646

Pfam-B_2729

Pfam-B_9241

Pfam-B_11280

Pfam-B_5533


Pfam-B_2421

Pfam-B_2907


Pfam-B_6588



Pfam-B_1373

Pfam-B_2814

Pfam-B_9974

Pfam-B_9

Pfam-B_22

Pfam-B_233

Pfam-B_174

Pfam-B_4712

Pfam-B_10736


Pfam-B_10760

Pfam-B_2419

Pfam-B_10752


Pfam-B_4415

Pfam-B_1445

Pfam-B_8322

GTP_EFTU

Pfam-B_7695

Pfam-B_982

Pfam-B_4006

Pfam-B_7932

Pfam-B_30

Pfam-B_4140

Pfam-B_7816

Pfam-B_10015

Pfam-B_585

Pfam-B_9928

Pfam-B_1390


Pfam-B_4939

Pfam-B_10170

Pfam-B_210

Pfam-B_11329

Pfam-B_229

Pfam-B_3972


c a d h e r i n

Pfam-B_874

Pfam-B_8204

Pfam-B_3978

Pfam-B_2028

Pfam-B_5708

Pfam-B_2274

Pfam-B_8862

Pfam-B_7155

Pfam-B_675



Pfam-B_10473

Pfam-B_10435

Pfam-B_6033f e r 4Pfam-B_9562

Pfam-B_7286

Pfam-B_7274

Pfam-B_5350

Pfam-B_2900

Pfam-B_7981

Pfam-B_7845

Pfam-B_7849

t r e f o i l

P fam-B_3081

Pfam-B_6957

Pfam-B_1107

Pfam-B_4606


Pfam-B_6398s e r p i n

Pfam-B_40

Pfam-B_9356

Pfam-B_8302

v w c

Pfam-B_1275

Pfam-B_1274

f ib r inogen_C


t r y p s i nKuni tz_BPTI

Pfam-B_6988

h o m e o b o x

Pfam-B_10294



Pfam-B_1043


Pfam-B_7640

Pfam-B_8076

Pfam-B_5196


Pfam-B_4269

Pfam-B_7453


Pfam-B_8380

Pfam-B_2643

Pfam-B_3223

Pfam-B_10351

Pfam-B_11100

Pfam-B_6417

Pfam-B_8371



Pfam-B_1144

Pfam-B_7091

Pfam-B_1628

p o u

Pfam-B_3322

Pfam-B_89

Pfam-B_5759

Pfam-B_5758

Pfam-B_1299

Pfam-B_8545

Pfam-B_3324

Pfam-B_8547

Pfam-B_8544

Pfam-B_2684


Pfam-B_3303

Pfam-B_8763

Pfam-B_11649

k e t o a c y l - s y n tPfam-B_50

Pfam-B_117

Pfam-B_235

Pfam-B_1908

Pfam-B_5747

Pfam-B_927

Pfam-B_5703


Pfam-B_9998

Pfam-B_2346


Pfam-B_4509

Pfam-B_4510

Pfam-B_6237

Pfam-B_6337

Pfam-B_6336

Pfam-B_6335

Pfam-B_7195

Pfam-B_6333

Pfam-B_10226

Pfam-B_5000

Pfam-B_517

Pfam-B_133

Pfam-B_177

Pfam-B_126

Pfam-B_2988

Pfam-B_1233

Pfam-B_234

Pfam-B_6862

Pfam-B_405

Pfam-B_6863

Pfam-B_36

Pfam-B_17

Pfam-B_52

Pfam-B_53Pfam-B_287

Pfam-B_24

Pfam-B_6492

Pfam-B_6334

Pfam-B_1232

Pfam-B_6819

Pfam-B_1761

Pfam-B_6822

Pfam-B_560

Pfam-B_10508

Pfam-B_10370

Pfam-B_1354

Pfam-B_1079

Pfam-B_3528

Pfam-B_6426

Pfam-B_10371

Pfam-B_10373

Pfam-B_2956

Pfam-B_6736

Pfam-B_355

Pfam-B_3670


Pfam-B_5669

Pfam-B_5768

Pfam-B_750

Pfam-B_8919

Pfam-B_1650

Pfam-B_4255

Pfam-B_8927

Pfam-B_5361

Pfam-B_5435

Pfam-B_6419

Pfam-B_359

Pfam-B_6915

Pfam-B_9606

Pfam-B_6987

Pfam-B_11092

Pfam-B_1876

Pfam-B_11075

tRNA-syn t_2

Pfam-B_4199

Pfam-B_4908

Pfam-B_3083


Pfam-B_11898

Pfam-B_11897


Pfam-B_4645

Pfam-B_196

Pfam-B_7505

Pfam-B_20

Pfam-B_37

Pfam-B_830

Pfam-B_671p i l i n

P fam-B_11130

Pfam-B_3883

Pfam-B_4497


Pfam-B_2342

Pfam-B_1694

Pfam-B_1695

Pfam-B_9931

Pfam-B_9933TGF-be ta


Pfam-B_3715

Pfam-B_11256



Pfam-B_9932



Pfam-B_370

Pfam-B_3463

Pfam-B_136

Pfam-B_2834


Pfam-B_1197

Pfam-B_8869

Pfam-B_1457

Pfam-B_6650

Pfam-B_11278


Pfam-B_1782

Pfam-B_4646

Pfam-B_3547


Pfam-B_2350


Pfam-B_2836

Pfam-B_10189

Pfam-B_11068

Pfam-B_727

Pfam-B_3575

Pfam-B_6380

Pfam-B_803

Pfam-B_2488

Pfam-B_3474

Pfam-B_6378

Pfam-B_2351


Pfam-B_2841

Pfam-B_3507

Pfam-B_4483

Pfam-B_969

Pfam-B_4675

Pfam-B_6339

Pfam-B_5654

Pfam-B_7900

Pfam-B_11123

Pfam-B_8313


Pfam-B_6330

Pfam-B_1755

Pfam-B_505

Pfam-B_423


Pfam-B_6332

Pfam-B_10245

Pfam-B_10242

Pfam-B_1521

Pfam-B_11487

Pfam-B_11490

Pfam-B_826

Pfam-B_10237


Pfam-B_1706

Pfam-B_3505

Pfam-B_827

Pfam-B_765

Pfam-B_6341

Pfam-B_6338

Pfam-B_4997

Pfam-B_10236

Pfam-B_10244

Pfam-B_10241

Pfam-B_7350

Pfam-B_7349

Pfam-B_1349


Pfam-B_6287

Pfam-B_3074

Pfam-B_1302

Pfam-B_3059

Pfam-B_7999

Pfam-B_8447

Pfam-B_11896

Pfam-B_11155

Pfam-B_11145

Pfam-B_2416

Pfam-B_10001

f i l a m e n t

Pfam-B_324

Pfam-B_2555

Pfam-B_2262

Pfam-B_162

Pfam-B_148

Pfam-B_122

Pfam-B_1116

Pfam-B_176


Pfam-B_768

Pfam-B_6371

Pfam-B_7665

Pfam-B_7667

Pfam-B_7666

Pfam-B_7668

Pfam-B_701

Pfam-B_6369

Pfam-B_915

Pfam-B_3574


Pfam-B_9692


Pfam-B_3767

Pfam-B_6578

Pfam-B_831

Pfam-B_10720

Pfam-B_1282

Pfam-B_3327

DAG_PE-bind


Pfam-B_10397

Pfam-B_9767

Pfam-B_8150

s u g a r _ t r

P fam-B_5411

Pfam-B_7235

Pfam-B_1110

Pfam-B_1574

Pfam-B_10299

Pfam-B_2773

Pfam-B_9661

Pfam-B_2846

Pfam-B_10947

Pfam-B_4594


Pfam-B_11038

Pfam-B_5211

Pfam-B_1067

h e m o p e x i n

Pfam-B_4128

Pfam-B_358

Pfam-B_5084

Pfam-B_4129

Pfam-B_6217

Pfam-B_6216


Pfam-B_4945

Pfam-B_941

Pfam-B_729

Pfam-B_730

Pfam-B_6221

Pfam-B_7355

Pfam-B_5267

Pfam-B_5265

Pfam-B_7354


Pfam-B_11928


Pfam-B_2288




Pfam-B_10585

Pfam-B_60

Pfam-B_10587

Pfam-B_31


Pfam-B_1089

Pfam-B_10727

Pfam-B_159


Pfam-B_3593



Pfam-B_75



Pfam-B_955zf-CCHC

Pfam-B_1152


Pfam-B_3572

Pfam-B_2147

Pfam-B_2009

Pfam-B_4663

Pfam-B_3120

Pfam-B_5266

Pfam-B_1360

Pfam-B_7838

Pfam-B_2007

Pfam-B_13Pfam-B_422

Pfam-B_109

Pfam-B_7196

Pfam-B_766

Pfam-B_411

Pfam-B_3508

Pfam-B_10243

Pfam-B_6340

Pfam-B_170

Pfam-B_5368

Pfam-B_11828

Pfam-B_10576

Pfam-B_6006

Pfam-B_7491

Pfam-B_10403

Pfam-B_10575

Pfam-B_10573

Pfam-B_10572




Pfam-B_2881

Pfam-B_2893

Pfam-B_1533



Pfam-B_10711

Pfam-B_10580

Pfam-B_11744

Pfam-B_4881

Pfam-B_993

Pfam-B_10726

Pfam-B_1234




Pfam-B_4666

Pfam-B_284

Pfam-B_6012


Pfam-B_160

Pfam-B_2758

Pfam-B_6580

Pfam-B_919

Pfam-B_1241

Pfam-B_2989

Pfam-B_774

Pfam-B_2292

Pfam-B_2290

Pfam-B_3564

Pfam-B_1677

Pfam-B_3879



r v t

P fam-B_881

r h v

Pfam-B_3263

Pfam-B_3299

Pfam-B_10556

Pfam-B_3212


Pfam-B_5111

Pfam-B_10607

Pfam-B_7486

Pfam-B_5069

Pfam-B_7372

Pfam-B_1323

Pfam-B_2528

r n a s e HPfam-B_859

Pfam-B_1803


Pfam-B_497

Pfam-B_4261H L H

Pfam-B_6516


Pfam-B_134

Pfam-B_3703

Pfam-B_10617

Pfam-B_125

Pfam-B_10563

Pfam-B_10558

Pfam-B_49

Pfam-B_2327

Pfam-B_42


Pfam-B_41

Pfam-B_4243

Pfam-B_110

Pfam-B_3999


Pfam-B_6511

Pfam-B_1260

Pfam-B_3068


Pfam-B_1804

Pfam-B_10597


Pfam-B_3050

Pfam-B_97

Pfam-B_11591

Pfam-B_7451

Pfam-B_11753d s r m

RuBisCO_smal l

P fam-B_4976

Pfam-B_4061

Pfam-B_8120

Pfam-B_2204

Pfam-B_303

Pfam-B_5007

c o p p e r - b i n d

Pfam-B_8749

Pfam-B_870

r a s

Pfam-B_11716

Pfam-B_11715

Pfam-B_9936

Pfam-B_10471

Pfam-B_5293

Pfam-B_2434

Pfam-B_11906

Pfam-B_1357

Pfam-B_597

Pfam-B_2784

Pfam-B_4931

Pfam-B_7301

Pfam-B_3901

Pfam-B_2362

Pfam-B_2995


Pfam-B_7873

Pfam-B_7017

Pfam-B_474

Pfam-B_5973

Pfam-B_6721

lamin in_B

Pfam-B_8034

Pfam-B_6532

Pfam-B_6534


Pfam-B_7846

Pfam-B_7174


Pfam-B_307


Pfam-B_11763

a n k

Pfam-B_8273

Pfam-B_1084

Pfam-B_7033

Pfam-B_633

Pfam-B_3121

Pfam-B_5780

Pfam-B_9398

Pfam-B_4888

Pfam-B_5958

a p p l e


Pfam-B_9028

Pfam-B_3852


Pfam-B_2464

Pfam-B_9544

Pfam-B_8209

Pfam-B_452

Pfam-B_7201

Pfam-B_4409

Pfam-B_896

Pfam-B_478

Pfam-B_11209

Pfam-B_2015

Pfam-B_540


Pfam-B_10608

COX1

Pfam-B_7477

Pfam-B_23

Pfam-B_2527

Pfam-B_5490

Pfam-B_3875

Pfam-B_440

Pfam-B_4163

Pfam-B_9912

Pfam-B_8494

Pfam-B_3136

Pfam-B_6304

Pfam-B_8528

Pfam-B_6303

Pfam-B_2149



Pfam-B_1820


Pfam-B_3985

Pfam-B_3122

Pfam-B_7565

Pfam-B_731

Pfam-B_8334

Pfam-B_481

Pfam-B_2572

Pfam-B_3988

Pfam-B_5643

Pfam-B_5797

Pfam-B_6323

Pfam-B_2838

Pfam-B_6480

Pfam-B_3905



Pfam-B_4279

Pfam-B_1375

Pfam-B_6435

Pfam-B_10787

Pfam-B_6878

Pfam-B_7769

Pfam-B_10791

Pfam-B_7771


Pfam-B_1258

Pfam-B_533


Pfam-B_5376

Pfam-B_9827

Pfam-B_10789

Pfam-B_6598

Pfam-B_4885

Pfam-B_10793


Pfam-B_11805

Pfam-B_11806

Pfam-B_6599


Pfam-B_10784

ATP-synt_C

Pfam-B_777

Pfam-B_11350

Pfam-B_2914

Pfam-B_5412

Pfam-B_1871

Pfam-B_4382

Pfam-B_9455

Pfam-B_1913

Pfam-B_9303

Pfam-B_7426

Pfam-B_6728

ld l_ recept_b

Pfam-B_6723

Pfam-B_4046

Pfam-B_9285

Pfam-B_9305

Pfam-B_9049

Pfam-B_6478

Pfam-B_7569

Pfam-B_2240

ho rmone_ rec


Pfam-B_563

Pfam-B_3904


Pfam-B_8727

Pfam-B_9365

Pfam-B_8717

Pfam-B_3314

Pfam-B_2530

Pfam-B_4396

Pfam-B_4395


Pfam-B_2524

Pfam-B_4397

Pfam-B_8116

Pfam-B_4398

Pfam-B_2405

Pfam-B_400

Pfam-B_7483

Pfam-B_10574


Pfam-B_496

Pfam-B_1531

Pfam-B_2297

Pfam-B_9543

Pfam-B_9662

Pfam-B_2289


Pfam-B_5163

m i t o _ c a r r

P fam-B_3612

Pfam-B_302

Pfam-B_3062

Pfam-B_9545

Pfam-B_1801

Pfam-B_1730

Pfam-B_4402

Pfam-B_8210

Pfam-B_8483

Pfam-B_6182



Pfam-B_901


Pfam-B_6086

Pfam-B_81

Pfam-B_10600

Pfam-B_757

Pfam-B_85

Pfam-B_5113

Pfam-B_7481


Pfam-B_555

Pfam-B_2403

Pfam-B_5115

Pfam-B_6018

Pfam-B_10594

Pfam-B_19r v p

Pfam-B_339

Pfam-B_6017

Pfam-B_7479

Pfam-B_6801

Pfam-B_1361

Pfam-B_11467

Pfam-B_6802

Pfam-B_9531

Pfam-B_297

Pfam-B_5199

Pfam-B_5510


Pfam-B_7056

Pfam-B_7681


Pfam-B_368

Pfam-B_594


Pfam-B_2296

Pfam-B_331

Pfam-B_9215


Pfam-B_4126

Pfam-B_10723

Pfam-B_8443

Pfam-B_8460

Pfam-B_8370

Pfam-B_10722

Pfam-B_5540

Pfam-B_4659

Pfam-B_2230

Pfam-B_8383

Pfam-B_8005

Pfam-B_2298

Pfam-B_4403

Pfam-B_9548

Pfam-B_9533

Pfam-B_1824

Pfam-B_6004

Pfam-B_4082

Pfam-B_1802


Pfam-B_10718

Pfam-B_8409

Pfam-B_3758

Pfam-B_8372

Pfam-B_5045

ABC_tran

r r m

Pfam-B_4724

zf -C4

Pfam-B_5323

Pfam-B_975

Pfam-B_7751

Pfam-B_6916

Pfam-B_3517

Pfam-B_447

Pfam-B_5182

Pfam-B_5428

Pfam-B_5810

Pfam-B_889

Pfam-B_5808

Pfam-B_2139

Pfam-B_10993

Pfam-B_8954

Pfam-B_814

Pfam-B_9276

Pfam-B_9682

Pfam-B_100

Pfam-B_3249

Pfam-B_8523

Pfam-B_10527

Pfam-B_1649

Pfam-B_4206

Pfam-B_5964

Pfam-B_425

Pfam-B_3492

Pfam-B_4631

Pfam-B_3222

Pfam-B_5317




Pfam-B_8379

Pfam-B_7997

Pfam-B_335

Pfam-B_1800

Pfam-B_3994

Pfam-B_5340

Pfam-B_1414


Pfam-B_5514

Pfam-B_2630

Pfam-B_948

Pfam-B_5339

Pfam-B_9684

Pfam-B_4894

Pfam-B_8163

Pfam-B_6845

Pfam-B_2188


Pfam-B_3097

Pfam-B_5197

Pfam-B_5109

Pfam-B_2131


Pfam-B_3237

Pfam-B_3540

Pfam-B_910

Pfam-B_4394

Pfam-B_607

Pfam-B_2245

Pfam-B_8468

Pfam-B_6013

Pfam-B_2578

Pfam-B_3444

Pfam-B_5162

Pfam-B_5304

Pfam-B_1698

Pfam-B_819

Pfam-B_11574

Pfam-B_5078

Y_phosphatase

P fam-B_5075

Pfam-B_1056

Pfam-B_4110

Pfam-B_5960

Pfam-B_11741

Pfam-B_9439



Pfam-B_9848



Pfam-B_2978


Pfam-B_3848

Pfam-B_8108


Pfam-B_2516

Pfam-B_1206

Pfam-B_794

Pfam-B_506


Pfam-B_8105



Pfam-B_1158

Pfam-B_1890

Pfam-B_9870

Pfam-B_3620

Pfam-B_9575

Pfam-B_1737

Pfam-B_7374


Pfam-B_7378


Pfam-B_550

Pfam-B_5760

Pfam-B_8928

t h i o r e dPfam-B_9149

Pfam-B_3306

Pfam-B_2912


Pfam-B_10099

Pfam-B_7653

Pfam-B_3935

Pfam-B_7654

Pfam-B_5079

lec t in_ legB

Pfam-B_1614

Pfam-B_5375


Pfam-B_9264


Pfam-B_1481

Pfam-B_1828

Pfam-B_4028

Pfam-B_1960

Pfam-B_4134

Pfam-B_7411

Pfam-B_1448

Pfam-B_10303



Pfam-B_1368

Pfam-B_5794

Pfam-B_7897

Pfam-B_8516

Pfam-B_166

Pfam-B_8515

Pfam-B_9047


Pfam-B_2715z f -C2H2



Pfam-B_7118

Pfam-B_11833

Pfam-B_9182

Pfam-B_4277

Pfam-B_7669

Pfam-B_9043

Pfam-B_1985

Pfam-B_6565

Pfam-B_4726

Pfam-B_9437

Pfam-B_5757

Pfam-B_5129

Pfam-B_9558

Pfam-B_9983

Pfam-B_1846

Pfam-B_5634

Pfam-B_5421

Pfam-B_9667

Pfam-B_3707

Pfam-B_2061

Pfam-B_2981

Pfam-B_3102

Pfam-B_8228

Pfam-B_2575



Pfam-B_1617

Pfam-B_28

Pfam-B_864

Pfam-B_1136

Pfam-B_2855

Pfam-B_863

Pfam-B_1618

Pfam-B_1840

Pfam-B_7826

Pfam-B_8141

Pfam-B_8196

Pfam-B_1656

Pfam-B_3570

Pfam-B_6903

Pfam-B_3172

Pfam-B_2601

Pfam-B_7708

Pfam-B_3708

Pfam-B_9649

RIP

Pfam-B_514

Pfam-B_604

Pfam-B_7136

Pfam-B_4698

Pfam-B_6576

Pfam-B_842

Pfam-B_4686

Pfam-B_10698

Pfam-B_3259

Pfam-B_1328

Pfam-B_6572

Pfam-B_1088

Pfam-B_11292

Pfam-B_9731


Pfam-B_10699

Pfam-B_10692

Pfam-B_2105


Pfam-B_6778


Pfam-B_2785

Pfam-B_5532

Pfam-B_564

Pfam-B_1359


Pfam-B_3453

Pfam-B_3597


Pfam-B_4340

Pfam-B_691


Pfam-B_7784

Pfam-B_3246

GATase

Pfam-B_4962

Pfam-B_1179

Pfam-B_3064

Pfam-B_4400

Pfam-B_9537

Pfam-B_9523

Pfam-B_9358

Pfam-B_5934


Pfam-B_3053

h o r m o n e 2

Pfam-B_7998

Pfam-B_198

Pfam-B_7413


Pfam-B_11468

Pfam-B_244

Pfam-B_2706

Pfam-B_2705

Pfam-B_1321

Pfam-B_3568

Pfam-B_3563

Pfam-B_8461


Pfam-B_4069


Pfam-B_642

Pfam-B_9957

Pfam-B_10710


Pfam-B_9259




Pfam-B_4941

Pfam-B_4338

Pfam-B_11264


Pfam-B_891

Pfam-B_1586


Pfam-B_1784


Pfam-B_7738

Pfam-B_4574

Pfam-B_7006

Pfam-B_1102

Pfam-B_1151

UPAR_LY6 Pfam-B_947Pfam-B_9938

Pfam-B_2668


Pfam-B_543

Pfam-B_454

Pfam-B_29

Pfam-B_485

Pfam-B_266


Pfam-B_7277

Pfam-B_7353

Pfam-B_7125

Pfam-B_1005

Pfam-B_929

Pfam-B_406

Pfam-B_9345

Pfam-B_11361

Pfam-B_11365

Pfam-B_4437

Pfam-B_11404

Pfam-B_192

Pfam-B_5942

Pfam-B_4778

Pfam-B_6536

Pfam-B_4798

Pfam-B_9751

Pfam-B_1530

Pfam-B_1063

i n s

Pfam-B_9982

Pfam-B_2600

Pfam-B_9623

Pfam-B_8284

Pfam-B_3360

Pfam-B_102

Pfam-B_1713

Pfam-B_4057

Pfam-B_4807

Pfam-B_5908

Pfam-B_11809

i gp k i n a s e

Pfam-B_5909

Pfam-B_1525

Pfam-B_3960

Pfam-B_10822

Pfam-B_6657

Pfam-B_629

Pfam-B_1446

Pfam-B_33

Pfam-B_11098

Pfam-B_3240

Pfam-B_322

Pfam-B_1124


Pfam-B_7727

Pfam-B_8195

Pfam-B_8161

Pfam-B_1855


Pfam-B_7392

Pfam-B_3898

Pfam-B_5406

Pfam-B_2777

Pfam-B_2839

Pfam-B_1466

Pfam-B_6943

Pfam-B_7595

Pfam-B_5136

Pfam-B_8508

Pfam-B_10215

Pfam-B_7977


Pfam-B_11029

Pfam-B_8360

Pfam-B_4107

Pfam-B_296

Pfam-B_741


Pfam-B_295

Pfam-B_6066

Pfam-B_4089

Pfam-B_3194

Pfam-B_9968

Pfam-B_2812

Pfam-B_1818

Pfam-B_4943

Pfam-B_2568

Pfam-B_4490

Pfam-B_8238

Pfam-B_5028

Pfam-B_5740

Pfam-B_7809

Pfam-B_2234

Pfam-B_453

Pfam-B_2611

Pfam-B_5714

Pfam-B_686

Pfam-B_7263

Pfam-B_10827

Pfam-B_818

Pfam-B_2763

Pfam-B_5718

Pfam-B_5208

Pfam-B_4150

Pfam-B_438


Pfam-B_5167



Pfam-B_1931

Pfam-B_3952

Pfam-B_2659

Pfam-B_3373

Pfam-B_5101

Pfam-B_2697

Pfam-B_2696

Pfam-B_7008

Pfam-B_11199

Pfam-B_4552

Pfam-B_7830

Pfam-B_1157

Pfam-B_5041


Pfam-B_8808

Pfam-B_4451

Pfam-B_8193

HSP20

Pfam-B_3730

Pfam-B_1821

S H 2

Pfam-B_1883

Pfam-B_6414


Pfam-B_8407

Pfam-B_2840

Pfam-B_2540


Pfam-B_3072

Pfam-B_1331

Pfam-B_8825

Pfam-B_2518

Pfam-B_2949

Pfam-B_1134

Pfam-B_7557

Pfam-B_6257

Pfam-B_3873Pfam-B_4700h o r m o n e

Pfam-B_6489

Pfam-B_735

Pfam-B_2016

Pfam-B_9650

Pfam-B_8145

Pfam-B_1025

Pfam-B_2827

Pfam-B_2161

Pfam-B_8760

Pfam-B_2506

Pfam-B_8944

Pfam-B_8138

Pfam-B_2701

Pfam-B_8445

Pfam-B_8987

Pfam-B_5187

Pfam-B_2002

Pfam-B_5348

Pfam-B_913

Pfam-B_10298

Pfam-B_6586

Pfam-B_2026

Pfam-B_267

Pfam-B_5316

Pfam-B_4

Pfam-B_3516

Pfam-B_466


Pfam-B_3877


Pfam-B_5085

Pfam-B_4139

Pfam-B_3557

Pfam-B_123

Pfam-B_595

Pfam-B_138

Pfam-B_395

Pfam-B_1204

Pfam-B_2892

Pfam-B_8323

Pfam-B_6619

Pfam-B_11001

Pfam-B_1510

Pfam-B_4713


Pfam-B_11737

Pfam-B_3137

Pfam-B_10918

Pfam-B_2148

Pfam-B_7200

Pfam-B_2340

Pfam-B_9546

Pfam-B_1895

Pfam-B_5272

Pfam-B_4664

Pfam-B_8019

Pfam-B_3633

Pfam-B_5089

Pfam-B_5214

Pfam-B_65

Pfam-B_6374

Pfam-B_1075

Pfam-B_121

Pfam-B_3942

Pfam-B_6098

HTH_1

Pfam-B_8157

Pfam-B_9605


Pfam-B_1996



Pfam-B_5048

Pfam-B_2006

Pfam-B_2877

Pfam-B_2878

Pfam-B_2669

Pfam-B_7836

Pfam-B_6523

Pfam-B_6959


Pfam-B_9494

Pfam-B_2013

Pfam-B_1490

Pfam-B_4749


Pfam-B_706


Pfam-B_10802

Pfam-B_668



Pfam-B_1860


Pfam-B_932

Pfam-B_9984

Pfam-B_2187

Pfam-B_5487

Pfam-B_11160


Pfam-B_10955

Pfam-B_9973

Pfam-B_1863

Pfam-B_4102

Pfam-B_666

Pfam-B_5509

Pfam-B_5516

Pfam-B_8312

Pfam-B_8364

Pfam-B_5513


Pfam-B_8419

Pfam-B_4138


Pfam-B_8289

Pfam-B_1629

Pfam-B_4159



Pfam-B_926

Pfam-B_8368

Pfam-B_8326

Pfam-B_8305

Pfam-B_11491

Pfam-B_9702


adh_z inc


Pfam-B_9624

Pfam-B_609

Pfam-B_8974

Pfam-B_3981

Pfam-B_7970

Pfam-B_8989

Pfam-B_8988

Pfam-B_4104

Pfam-B_7646

Pfam-B_9380

Pfam-B_9225

Pfam-B_6225

STphospha tase P fam-B_2126

Pfam-B_3197


Pfam-B_4410

Pfam-B_3467

Pfam-B_3933

Pfam-B_1123

Pfam-B_6227


Pfam-B_1029

Pfam-B_1693

Pfam-B_9561

Pfam-B_4088

Pfam-B_9560

Pfam-B_2682


Pfam-B_380

Pfam-B_442

Pfam-B_11331



Pfam-B_6023

Pfam-B_4852

Pfam-B_1512

Pfam-B_1286

Pfam-B_11324

Pfam-B_645

Pfam-B_59

Pfam-B_6724

Pfam-B_4087

Pfam-B_3002

Pfam-B_51

Pfam-B_8239

Pfam-B_10818

Pfam-B_4499

Pfam-B_10817

Pfam-B_11332

Pfam-B_10819

Pfam-B_10824

Pfam-B_1177

Pfam-B_194

Pfam-B_9971

Pfam-B_3196

Pfam-B_9981

Pfam-B_6544

Pfam-B_4240

Pfam-B_804

Pfam-B_1391

Pfam-B_8182

Pfam-B_4070

Pfam-B_1839

Pfam-B_9927

Pfam-B_2084

Pfam-B_2998


Pfam-B_8241

Pfam-B_8245

Pfam-B_11846

Pfam-B_8249

Pfam-B_8242

Pfam-B_3969

Pfam-B_1216

Pfam-B_2408

Pfam-B_5444

w a p

Pfam-B_9942

Pfam-B_1326

Pfam-B_4813

Pfam-B_1060

Pfam-B_9527

Pfam-B_8206

Pfam-B_8205

Pfam-B_9418

Pfam-B_1508

Pfam-B_455

Pfam-B_1622

Pfam-B_9903

Pfam-B_456


Pfam-B_390

Pfam-B_10545

Pfam-B_8912

Pfam-B_3195

v w a

Pfam-B_2178

Pfam-B_2756

f n 3Pfam-B_9563

Pfam-B_688

Pfam-B_10826

Pfam-B_687

f n 2

Pfam-B_7182

Pfam-B_3013

Pfam-B_3436

Pfam-B_6062

Pfam-B_3369

Pfam-B_461



Pfam-B_5742

c y c l i nPfam-B_8142Pfam-B_700


Pfam-B_3663

Pfam-B_83

Pfam-B_5278


Pfam-B_8139

Pfam-B_2598

Pfam-B_4062

Pfam-B_8143


Pfam-B_5741

Pfam-B_6600

Pfam-B_8975

Pfam-B_5754


Pfam-B_4059


Pfam-B_11830l i p o c a l i n

Pfam-B_3082

Pfam-B_9309

Pfam-B_11241

Pfam-B_3706

Pfam-B_11743

Pfam-B_381

Pfam-B_4610



Pfam-B_4538

Pfam-B_3916


Pfam-B_3817

Pfam-B_7442

Pfam-B_11808


Pfam-B_3580

Pfam-B_5604

Pfam-B_699

Pfam-B_1465

Pfam-B_2716

Pfam-B_2775

Pfam-B_5799

Pfam-B_135

Pfam-B_4076

Pfam-B_4513

Pfam-B_5398


Pfam-B_3897

Pfam-B_6144

Pfam-B_93

Pfam-B_2747

Pfam-B_3882

Pfam-B_5857

Pfam-B_4058

Pfam-B_7049

Pfam-B_554

Pfam-B_733

Pfam-B_4045

Pfam-B_11112

Pfam-B_407

myos in_head




Pfam-B_8192

n o t c h

Pfam-B_3080

Pfam-B_3713

Pfam-B_1807



Pfam-B_2980


Pfam-B_3066Pfam-B_11330 Pfam-B_10320Pfam-B_10305Pfam-B_822Pfam-B_1104Pfam-B_6706 Pfam-B_4967 Pfam-B_10120Pfam-B_11259

Pfam-B_4763 Pfam-B_4759 Pfam-B_4581Pfam-B_4597Pfam-B_4787 Pfam-B_3812Pfam-B_3838Pfam-B_3922Pfam-B_3923Pfam-B_3993Pfam-B_4413Pfam-B_4512Pfam-B_4528 Pfam-B_3043Pfam-B_3085Pfam-B_3132Pfam-B_3679 Pfam-B_3073Pfam-B_3362Pfam-B_3749 Pfam-B_3171Pfam-B_3190 Pfam-B_3093


Pfam-B_6201p e r o x i d a s e

Pfam-B_2704


Pfam-B_2690


Pfam-B_3368 Pfam-B_2602 Pfam-B_4433 Pfam-B_1300Pfam-B_2222 Pfam-B_4848Pfam-B_4838 Pfam-B_1838Pfam-B_3447Pfam-B_4435

Pfam-B_4660




Pfam-B_1513 Pfam-B_6256Pfam-B_1342 Pfam-B_155Pfam-B_4530 Pfam-B_128 Pfam-B_11497Pfam-B_11498Pfam-B_11545Pfam-B_4786Pfam-B_11579Pfam-B_11578Pfam-B_11795Pfam-B_11796 Pfam-B_525Pfam-B_11362Pfam-B_6787Pfam-B_11443 Pfam-B_6645Pfam-B_6789 Pfam-B_6788 Pfam-B_11351Pfam-B_11440Pfam-B_11442 Pfam-B_11363

Pfam-B_6060 Pfam-B_8848 Pfam-B_8840 Pfam-B_4849 Pfam-B_5274 Pfam-B_11095Pfam-B_4839 Pfam-B_409 Pfam-B_2474 Pfam-B_2709 Pfam-B_1367

Pfam-B_726Pfam-B_8878 COeste rase

Pfam-B_9985

Pfam-B_11889

Pfam-B_8883

Pfam-B_6228


Pfam-B_4040

g l u t s

Pfam-B_8913

Pfam-B_11253

Pfam-B_9946

b e t a - l a c t a m a s e

P fam-B_2557

Pfam-B_1453

Pfam-B_486

Pfam-B_4195

Pfam-B_18

Pfam-B_8707

Pfam-B_27

Pfam-B_2175


Pfam-B_9989

Pfam-B_5170

Pfam-B_11774

Pfam-B_2480

Pfam-B_3789

Pfam-B_2404

Pfam-B_2373

Pfam-B_7628

Pfam-B_7212

Pfam-B_3907

Pfam-B_2556

Pfam-B_3908


Pfam-B_7725


Pfam-B_1362

Pfam-B_1568

Pfam-B_9238

Pfam-B_2014

k a z a l

P fam-B_2273

Pfam-B_2768

Pfam-B_11

Pfam-B_3279

Pfam-B_14

Pfam-B_6512


Pfam-B_6571


Pfam-B_709

Pfam-B_39

Pfam-B_72Pfam-B_101

Pfam-B_1100


Pfam-B_11414

Pfam-B_197

Pfam-B_172

Pfam-B_848

c p n 6 0

Pfam-B_1001

Pfam-B_203

Pfam-B_11412

Pfam-B_9646

Pfam-B_383


Pfam-B_4789

Pfam-B_6773

Pfam-B_69

Pfam-B_6057

Pfam-B_1549

Pfam-B_10741

Pfam-B_925

Pfam-B_241

Pfam-B_6129

Pfam-B_9811

Pfam-B_249

Pfam-B_1558

Pfam-B_6128

Pfam-B_2442

Pfam-B_965

Pfam-B_9382

Pfam-B_7068

Pfam-B_9383

Pfam-B_7795

Pfam-B_3087

Pfam-B_1212

Pfam-B_1213


Pfam-B_10253

Pfam-B_131

Pfam-B_1224

Pfam-B_7132

Pfam-B_776

Pfam-B_5746

Pfam-B_3523

Pfam-B_3426

Pfam-B_268

Pfam-B_1090

Pfam-B_6347

Pfam-B_1973

Pfam-B_10157

Pfam-B_5926Pfam-B_4561 b Z I P

Pfam-B_1155


Pfam-B_769

Pfam-B_164

Pfam-B_3090

Pfam-B_1254

Pfam-B_242

Pfam-B_151

Pfam-B_342

Pfam-B_1581

Pfam-B_1791

Pfam-B_3906


Pfam-B_1426


Pfam-B_2718

Pfam-B_2246


Pfam-B_3

Pfam-B_2093

Pfam-B_3977


Pfam-B_276

Pfam-B_259

Pfam-B_7956

Pfam-B_3790

Pfam-B_5171


Pfam-B_4354

Pfam-B_205

Pfam-B_2514

Pfam-B_1404

Pfam-B_3028

Pfam-B_938

Pfam-B_4374

Pfam-B_256

Pfam-B_994

Pfam-B_5006

Pfam-B_3787

Pfam-B_719

Pfam-B_6627

Pfam-B_720



Pfam-B_3645


Pfam-B_4767


Pfam-B_3423

Pfam-B_6045

Pfam-B_507

Pfam-B_11671

w n t

P fam-B_11320

Pfam-B_9818

Pfam-B_591

Pfam-B_966

Pfam-B_9817

Pfam-B_9822

Pfam-B_9821

Pfam-B_9820

Pfam-B_9819

Pfam-B_6298

Pfam-B_10166

Pfam-B_4560

Pfam-B_4472

Pfam-B_4155


Pfam-B_9230

Pfam-B_4157

Pfam-B_4156

Pfam-B_8507

Pfam-B_10161

Pfam-B_6601


Pfam-B_343


Pfam-B_11670

Pfam-B_4616

Pfam-B_3693

Pfam-B_10164

Pfam-B_10167

Pfam-B_6849

Pfam-B_419

Pfam-B_1199


Pfam-B_10163

Pfam-B_5225

ox ido red_ fad

Pfam-B_1200

Pfam-B_1972

Pfam-B_10159

Pfam-B_10169

Pfam-B_503

Pfam-B_10168

Pfam-B_418

Pfam-B_7736

Pfam-B_3956

Pfam-B_9347


Pfam-B_5215

Pfam-B_4227 Pfam-B_5399 Pfam-B_2596 Pfam-B_7344

Pfam-B_5400

Pfam-B_9288

MHC_I

Pfam-B_5923

Pfam-B_6362

Pfam-B_5216

Pfam-B_6491

Pfam-B_9231

Pfam-B_10162

Pfam-B_10160

Pfam-B_6490a lpha -amy lase


Pfam-B_273

Pfam-B_1121

Pfam-B_476

Pfam-B_2521

Pfam-B_5217

Pfam-B_179

Pfam-B_3948

Pfam-B_9668

Pfam-B_9675

Pfam-B_6346

Pfam-B_7721

Pfam-B_7723


Pfam-B_3768

Pfam-B_2769


Pfam-B_10179

Pfam-B_6722

Pfam-B_9698

Pfam-B_6608

Pfam-B_8103

Pfam-B_7376

Pfam-B_8094

Pfam-B_5077


Pfam-B_7375

Pfam-B_11771



Pfam-B_10082

Pfam-B_3587

Pfam-B_6671


Pfam-B_312

Pfam-B_11584

Pfam-B_990


Pfam-B_1602

Pfam-B_4357

Pfam-B_8578



Pfam-B_1556

Pfam-B_4225

Pfam-B_2515

Pfam-B_725

Pfam-B_7648

Pfam-B_5168

Pfam-B_3611

Pfam-B_7712

Pfam-B_11104

7 t m _ 1


Pfam-B_7713

Pfam-B_11644

Pfam-B_3310

Pfam-B_4258

Pfam-B_2409

Pfam-B_1788

Pfam-B_4919

Pfam-B_4920

Pfam-B_2479

Pfam-B_4688

Pfam-B_7647

Pfam-B_1112

Pfam-B_11419


Pfam-B_6924

Pfam-B_5999

Pfam-B_4356

Pfam-B_10756



Pfam-B_10946

Pfam-B_8518

Pfam-B_1238

Pfam-B_10884

Pfam-B_7069

Pfam-B_2913

S H 3

Pfam-B_2008

Pfam-B_4924


Pfam-B_5656


Pfam-B_11593


Pfam-B_1725

Pfam-B_1956


Pfam-B_6909

Pfam-B_7635


Pfam-B_9911

Pfam-B_770


Pfam-B_11240

Pfam-B_7009

Pfam-B_11717

Pfam-B_2223


A A APfam-B_11592


Pfam-B_7798






Pfam-B_2157

Pfam-B_2034


Pfam-B_3133

Pfam-B_7299

Pfam-B_2887

Pfam-B_7300

Pfam-B_8153



Pfam-B_5607

Pfam-B_11766

Pfam-B_7926

Pfam-B_7844

Pfam-B_7561

Pfam-B_1826

Pfam-B_7210

Pfam-B_7994Pfam-B_2

Pfam-B_2977

lamin in_G

Pfam-B_7209


z n - p r o t e a s e

Pfam-B_1016

Pfam-B_4077

Pfam-B_887

Pfam-B_8257


Pfam-B_7545

Pfam-B_3015

Pfam-B_2314

Pfam-B_2483

Pfam-B_2494

Pfam-B_7334


Cys_knot

Pfam-B_32

Pfam-B_3946laminin_EGF


Pfam-B_2963

Pfam-B_1459

Pfam-B_7718

Pfam-B_7535

Pfam-B_4252


Pfam-B_8583

Pfam-B_479

Pfam-B_8194


Pfam-B_3124

Pfam-B_764

Pfam-B_6716


s u s h i

EGF

Pfam-B_6659

Pfam-B_844

Pfam-B_5809

Pfam-B_851

Pfam-B_8397

Pfam-B_6641

Pfam-B_10350

Pfam-B_8222

P H

Pfam-B_7252


Pfam-B_548

Pfam-B_616

Pfam-B_3017

Pfam-B_1113


l a m i n i n _ N t e r m

Pfam-B_1189

Pfam-B_4083

Pfam-B_4093

Pfam-B_4036

Pfam-B_3971

Pfam-B_8223

Pfam-B_2608

Pfam-B_4084


Pfam-B_2710

Pfam-B_1654


Pfam-B_520


Pfam-B_473

Pfam-B_2583

Pfam-B_105

Pfam-B_308


Pfam-B_352


Pfam-B_575


Pfam-B_3302

Pfam-B_1774

lec t in_c

Pfam-B_7677

Pfam-B_1967

Pfam-B_10703

Pfam-B_7745

Pfam-B_1068

Pfam-B_7952


Pfam-B_9191

Pfam-B_711

Pfam-B_4085

Pfam-B_10707

Pfam-B_7558

Pfam-B_3899


Pfam-B_2541

Pfam-B_9407

Pfam-B_3192

Pfam-B_11198

Pfam-B_4044

Pfam-B_3429

Pfam-B_3170

Pfam-B_376

Pfam-B_7077

Pfam-B_7078


Pfam-B_3442

Pfam-B_5765

Pfam-B_1125

Pfam-B_8758

Pfam-B_1832

Pfam-B_11637

Pfam-B_3686

Pfam-B_11658


Pfam-B_3689

Pfam-B_3586

Pfam-B_2546

Pfam-B_7526

Pfam-B_3962

Pfam-B_8090

Pfam-B_5184

Pfam-B_5185

Pfam-B_4148

Pfam-B_5128

Pfam-B_5176

Pfam-B_11164


Pfam-B_2789

Pfam-B_2333

Pfam-B_3549

Pfam-B_2125

Pfam-B_9918

Pfam-B_4164

Pfam-B_5820

Pfam-B_3940

Pfam-B_3288

Pfam-B_6183

Pfam-B_6184

Pfam-B_7750

Pfam-B_130

ld l_recept_a

Pfam-B_5141

Pfam-B_4074

Pfam-B_7311

Pfam-B_4174

Pfam-B_1026

Pfam-B_7425

Pfam-B_4550

Pfam-B_305

Pfam-B_9508

Pfam-B_1868

Pfam-B_5334

Pfam-B_338


Pfam-B_1819

Pfam-B_1817

Pfam-B_156

Pfam-B_546

Pfam-B_4563

Pfam-B_3773



Pfam-B_8181

Pfam-B_3509


Pfam-B_8224

Pfam-B_10440

Pfam-B_7848

Pfam-B_7034

Pfam-B_9943

Pfam-B_4617

Pfam-B_7843

Pfam-B_1267


Pfam-B_1813


C 2Pfam-B_11892

Pfam-B_2604

Pfam-B_1532

Pfam-B_4714

Pfam-B_2766


Pfam-B_4295

Pfam-B_3075

Pfam-B_5407

Pfam-B_1291

Pfam-B_5405

Pfam-B_3193


Pfam-B_4237f n 1 Pfam-B_9901

Pfam-B_2797


Pfam-B_2035

Pfam-B_1473

Pfam-B_5264

Pfam-B_1265

Pfam-B_7573

Pfam-B_5142

Pfam-B_5186

Pfam-B_5144

Pfam-B_940

Pfam-B_1600

Pfam-B_5591

Pfam-B_3944

Pfam-B_7282

i l 8

Pfam-B_437

Pfam-B_1601

Pfam-B_9129

Pfam-B_3315

Pfam-B_7749

Pfam-B_9130

Pfam-B_8089t o x i n

Pfam-B_3963

Pfam-B_2547

Pfam-B_4917

Pfam-B_11638

Pfam-B_1388

Pfam-B_6547

Pfam-B_11643

Pfam-B_3256

Pfam-B_3095

Pfam-B_11625

Pfam-B_9919

Pfam-B_623

Pfam-B_5190

Pfam-B_5191

Pfam-B_5192

s i g m a 7 0

Pfam-B_4137

Pfam-B_4995

Pfam-B_10145

Pfam-B_8676

Pfam-B_11279

Pfam-B_11151

Pfam-B_11046

Pfam-B_11014

Pfam-B_5335

Pfam-B_10130

Pfam-B_10105


Pfam-B_460

Pfam-B_6291

Pfam-B_698

Pfam-B_1970

Pfam-B_10142

Pfam-B_459

Pfam-B_787

Pfam-B_9859

Pfam-B_10555

p 4 5 0

Pfam-B_4662

Pfam-B_1237


Pfam-B_3393

Pfam-B_1569

Pfam-B_118

Pfam-B_10476

Pfam-B_288

Pfam-B_2742

Pfam-B_9385

Pfam-B_10391

Pfam-B_317

Pfam-B_1583

Pfam-B_34

Pfam-B_7259

Pfam-B_552

Pfam-B_10852

Pfam-B_4653

pro_ isomerase

Pfam-B_793

Pfam-B_3469

Pfam-B_6233

Pfam-B_9876

Pfam-B_2803

Pfam-B_4186

Pfam-B_10356

ox ido red_n i t r o

P fam-B_9979

Pfam-B_10073

Pfam-B_1167

Pfam-B_722

Pfam-B_3067

Pfam-B_9964

Pfam-B_2455

Pfam-B_1348

Pfam-B_5652

Pfam-B_2800

Pfam-B_7476

Pfam-B_154

Pfam-B_6674

Pfam-B_2058

Pfam-B_10668

Pfam-B_1365

Pfam-B_10807

Pfam-B_2823

Pfam-B_9861

Pfam-B_4689

Pfam-B_10202

Pfam-B_5062

Pfam-B_209

Pfam-B_6475

Pfam-B_10477

Pfam-B_1281

Pfam-B_271

Pfam-B_10479

Pfam-B_397

Pfam-B_1997

Pfam-B_3530

Pfam-B_4652

Pfam-B_1619

Pfam-B_10140

Pfam-B_7511

Pfam-B_1523

Pfam-B_9317

Pfam-B_2252

Pfam-B_4615

Pfam-B_9588

Pfam-B_1609

Pfam-B_7418

Pfam-B_5234

Pfam-B_2140

Pfam-B_11922

Pfam-B_1338

Pfam-B_10433

Pfam-B_3874

f e r 2

Pfam-B_7023

Pfam-B_9219

Pfam-B_5207

Pfam-B_3482


Pfam-B_8696

Pfam-B_11927


Pfam-B_4918

Pfam-B_4149

Pfam-B_8759

Pfam-B_9331

Pfam-B_7616


Pfam-B_4350

Pfam-B_4536

Pfam-B_5175

Pfam-B_2618

Pfam-B_3945

Pfam-B_8581

Pfam-B_11642

Pfam-B_2619

Pfam-B_1665

Pfam-B_8580

Pfam-B_4829

Pfam-B_4916

Pfam-B_4847

Pfam-B_5143

Pfam-B_3939

Pfam-B_6452

Pfam-B_1264

Pfam-B_5789


n e u r

Pfam-B_6171

Pfam-B_9310

Pfam-B_4533

Pfam-B_4535

Pfam-B_10270

Pfam-B_820

Pfam-B_7384


Pfam-B_528

Pfam-B_5589

Pfam-B_2191

Pfam-B_8471

a m i n o t r a n

Pfam-B_5971

Pfam-B_5721

Pfam-B_8619

Pfam-B_8584

Pfam-B_920

Pfam-B_1071

Pfam-B_4832


Pfam-B_9000

Pfam-B_223

Pfam-B_7459

Pfam-B_430

Pfam-B_916

Pfam-B_9145

Pfam-B_1447

Pfam-B_1949

Pfam-B_809

Pfam-B_3041

Pfam-B_6542

Pfam-B_988

Pfam-B_5972

Pfam-B_9648

Pfam-B_4418

e f h a n d

Pfam-B_6053


Pfam-B_2324

Pfam-B_4632

Pfam-B_332

Pfam-B_9838

Pfam-B_1433

Pfam-B_3153

Pfam-B_5383

Pfam-B_5842

Pfam-B_3446

Pfam-B_4482

Pfam-B_6146

Pfam-B_3371

Pfam-B_2305

Pfam-B_147

Pfam-B_7055

Pfam-B_6038


Pfam-B_7399

Pfam-B_282

Pfam-B_1927

Pfam-B_3881


Pfam-B_361

Pfam-B_388

Pfam-B_222

h i s t o n e

Pfam-B_7501

Pfam-B_408

Pfam-B_3738

Pfam-B_7577

Pfam-B_626

Pfam-B_5385

Pfam-B_8057

Pfam-B_8095

Pfam-B_483

Pfam-B_6248

Pfam-B_5138

Pfam-B_3404

Pfam-B_4913

Pfam-B_7904

Pfam-B_3117

Pfam-B_7041

Pfam-B_791

Pfam-B_2755


Pfam-B_3494


Pfam-B_1971

Pfam-B_10131

Pfam-B_3342

Pfam-B_6294

Pfam-B_9139

Pfam-B_2790

Pfam-B_9140

Pfam-B_3543

Pfam-B_590

Pfam-B_1941

E1-E2_ATPase

Pfam-B_697

Pfam-B_2120

Pfam-B_1516

Pfam-B_92

Pfam-B_6467

Pfam-B_6042


Figure 5.11. The Family network (using both CPs and non-CPs)




Pfam-B_4777


Pfam-B_9321Cys -p ro tease

Pfam-B_2738

Pfam-B_6605

Pfam-B_8968a c t i nPfam-B_111

Pfam-B_2879

Pfam-B_2396

Pfam-B_1363

Pfam-B_2645

Pfam-B_10658

Pfam-B_3568

Pfam-B_2705

Pfam-B_3789


Pfam-B_4679


Pfam-B_3790

k a z a l P fam-B_7340Pfam-B_6029 Pfam-B_7207 Pfam-B_6289


Pfam-B_2456

Pfam-B_2833


Pfam-B_5664

Pfam-B_5662

Pfam-B_631

Pfam-B_2426

Pfam-B_8453

Pfam-B_3918



Pfam-B_3920


Pfam-B_8065

Pfam-B_2432


Pfam-B_9989

Pfam-B_1568





Pfam-B_9344

Pfam-B_3728 Pfam-B_6349 Pfam-B_4704Pfam-B_5222Pfam-B_6317 Pfam-B_356Pfam-B_5760

Pfam-B_2959

Pfam-B_11367



Pfam-B_3563

Pfam-B_780

Pfam-B_2005

Pfam-B_1530


Pfam-B_10663

Pfam-B_1066

Pfam-B_8969

Pfam-B_10644


Pfam-B_2526

Pfam-B_3053

Pfam-B_3054

Pfam-B_5104

Pfam-B_9212

Pfam-B_9210

Pfam-B_1312

Pfam-B_683


Pfam-B_5258

Pfam-B_2507

Pfam-B_1737



Pfam-B_775

Pfam-B_3031


Pfam-B_1221


Pfam-B_10717

Pfam-B_2023

Pfam-B_3423

Pfam-B_1359

Pfam-B_10179

Pfam-B_10716

Pfam-B_1190

Pfam-B_695

Pfam-B_9873

Pfam-B_9876

Pfam-B_9880

Pfam-B_1507

Pfam-B_679Pfam-B_1463Pfam-B_4207Pfam-B_1864Pfam-B_1806Pfam-B_4266


Pfam-B_7512

Pfam-B_2523


Pfam-B_11468

Pfam-B_610




Pfam-B_920

Pfam-B_1322Pfam-B_3366Pfam-B_9986 Pfam-B_8844 Pfam-B_7857 Pfam-B_4574Pfam-B_5881




Pfam-B_277


Pfam-B_2973

Pfam-B_2674

Pfam-B_1461

Pfam-B_1131

Pfam-B_345



Pfam-B_6502

Pfam-B_529


Pfam-B_3597

Pfam-B_8489

Pfam-B_5119 Pfam-B_7975 Pfam-B_1191 Pfam-B_7452Pfam-B_1103


Pfam-B_3594

Pfam-B_710Pfam-B_3735Pfam-B_981g l n - s y n tP fam-B_3365Pfam-B_6278

Pfam-B_2301

Pfam-B_6539

Pfam-B_4884

Pfam-B_621

Pfam-B_2088 Pfam-B_2029 Pfam-B_11366 Pfam-B_10706 Pfam-B_1287 Pfam-B_2001 Pfam-B_6872 Pfam-B_639

Pfam-B_1095 Pfam-B_5558 UPAR_LY6

Pfam-B_4454

Pfam-B_570



Pfam-B_11330Pfam-B_6706Pfam-B_11259Pfam-B_8913Pfam-B_6788Pfam-B_11512Pfam-B_11545Pfam-B_4786Pfam-B_6865Pfam-B_11613Pfam-B_11745 Pfam-B_11442Pfam-B_6813 Pfam-B_3066Pfam-B_2921Pfam-B_11854Pfam-B_11850Pfam-B_11873Pfam-B_11880Pfam-B_11903Pfam-B_2475Pfam-B_1193Pfam-B_1366Pfam-B_6160Pfam-B_1262Pfam-B_11306Pfam-B_1929Pfam-B_1432Pfam-B_486 Pfam-B_1559 Pfam-B_1367Pfam-B_3977 COeste rasePfam-B_1426Pfam-B_1453Pfam-B_223Pfam-B_689Pfam-B_1923Pfam-B_6768Pfam-B_1754 Pfam-B_1741Pfam-B_5566Pfam-B_171Pfam-B_320Pfam-B_9804 h i s t o n ePfam-B_1183Pfam-B_10268Pfam-B_462Pfam-B_10305Pfam-B_10320Pfam-B_1104Pfam-B_4967Pfam-B_11253

Pfam-B_4144 Pfam-B_3923 Pfam-B_3922 Pfam-B_3838 Pfam-B_3812Pfam-B_3872 Pfam-B_3749 Pfam-B_2559Pfam-B_2650Pfam-B_2657Pfam-B_2954Pfam-B_3010Pfam-B_3043Pfam-B_3073 Pfam-B_2495 Pfam-B_2338 Pfam-B_2199 Pfam-B_2184 Pfam-B_2171 Pfam-B_2068Pfam-B_2287Pfam-B_2369 Pfam-B_1762 Pfam-B_1749 Pfam-B_11232Pfam-B_11302Pfam-B_11518Pfam-B_11601Pfam-B_11623Pfam-B_11663Pfam-B_11779Pfam-B_1205 Pfam-B_11694 Pfam-B_11194 Pfam-B_10849 Pfam-B_1054 Pfam-B_1045 g p d hPfam-B_10954Pfam-B_11127Pfam-B_10962

Pfam-B_9731Pfam-B_2785Pfam-B_9863Pfam-B_9979S 4 Pfam-B_7832Pfam-B_8028Pfam-B_8356Pfam-B_2873Pfam-B_2690Pfam-B_9288Pfam-B_9408Pfam-B_1918 Pfam-B_264 Pfam-B_8932Pfam-B_9410Pfam-B_1566Pfam-B_713Pfam-B_10902Pfam-B_11447Pfam-B_6790Pfam-B_11446Pfam-B_9612

Pfam-B_1619 Pfam-B_1609 Pfam-B_1548 Pfam-B_1522 Pfam-B_128Pfam-B_3679 Pfam-B_3362 Pfam-B_3218 Pfam-B_3190 Pfam-B_3171 Pfam-B_3093 Pfam-B_3085

Pfam-B_7494Pfam-B_7497Pfam-B_7515Pfam-B_7614Pfam-B_7695Pfam-B_8237Pfam-B_3450Pfam-B_2793 Pfam-B_228 Pfam-B_8214 Pfam-B_2095 GATase Pfam-B_1866 Pfam-B_1842Pfam-B_3094Pfam-B_3716Pfam-B_4824Pfam-B_6836Pfam-B_4880 Pfam-B_933Pfam-B_489 Pfam-B_6444 Pfam-B_4501Pfam-B_7419Pfam-B_154Pfam-B_7327Pfam-B_1250Pfam-B_7257Pfam-B_11751Pfam-B_7441Pfam-B_7673Pfam-B_2551Pfam-B_8184Pfam-B_8587Pfam-B_2662Pfam-B_8786Pfam-B_8787 Pfam-B_8180Pfam-B_6306 Pfam-B_10129 fe r4_N i fHPfam-B_10140Pfam-B_10186 Pfam-B_1836Pfam-B_6256 Pfam-B_521Pfam-B_1117Pfam-B_6437 Pfam-B_1524 Pfam-B_9714 Pfam-B_96Pfam-B_4941Pfam-B_5008Pfam-B_5055Pfam-B_1580Pfam-B_5095Pfam-B_5096 Pfam-B_3792

Pfam-B_4433Pfam-B_4435Pfam-B_10932Pfam-B_4434Pfam-B_10647Pfam-B_4362Pfam-B_11292 Pfam-B_3921Pfam-B_429Pfam-B_10635Pfam-B_4309Pfam-B_7765Pfam-B_4345 Pfam-B_8564Pfam-B_4173Pfam-B_7510Pfam-B_3885Pfam-B_7459 s o d f ePfam-B_3864Pfam-B_4175 Pfam-B_3849Pfam-B_7398 Pfam-B_2828Pfam-B_3284Pfam-B_884Pfam-B_3385Pfam-B_3588Pfam-B_11177Pfam-B_3681Pfam-B_11589 Pfam-B_808Pfam-B_3745 Pfam-B_304Pfam-B_4676 Pfam-B_2563Pfam-B_2883 Pfam-B_6070Pfam-B_2391Pfam-B_2044Pfam-B_258Pfam-B_3806Pfam-B_2595Pfam-B_6719Pfam-B_2741

Pfam-B_5620

Pfam-B_10


Pfam-B_8797

Pfam-B_543


Pfam-B_1964

Pfam-B_3321

Pfam-B_1790

Pfam-B_10086

Pfam-B_10092


Pfam-B_10091

Pfam-B_7478

Pfam-B_10087

Pfam-B_351

Pfam-B_1071

Pfam-B_1949

Pfam-B_1699

Pfam-B_5720

Pfam-B_8880

Pfam-B_5721

Pfam-B_3301

Pfam-B_8881

Pfam-B_1416

Pfam-B_860


Pfam-B_2538

Pfam-B_1882

Pfam-B_21

Pfam-B_1447

Pfam-B_278

Pfam-B_262

Pfam-B_3286

Pfam-B_10682

Pfam-B_2415


Pfam-B_10725

Pfam-B_648

Pfam-B_2417

Pfam-B_4706

Pfam-B_4705

Pfam-B_10624

Pfam-B_266


Pfam-B_6579

Pfam-B_29

Pfam-B_3822

Pfam-B_950

Pfam-B_542

Pfam-B_2653


Pfam-B_949

Pfam-B_8467

Pfam-B_4821

Pfam-B_6173

Pfam-B_9588

Pfam-B_1180

Pfam-B_6512

Pfam-B_2398

Pfam-B_3561

Pfam-B_3856

Pfam-B_2397



Pfam-B_1596

Pfam-B_2629

Pfam-B_6742


Pfam-B_1850

Pfam-B_2046

Pfam-B_6739

Pfam-B_11369

Pfam-B_593

Pfam-B_562



Pfam-B_11361


Pfam-B_9149

Pfam-B_8928t h i o r e dPfam-B_1003

Pfam-B_561

Pfam-B_712

Pfam-B_11600

Pfam-B_1158

Pfam-B_2480

Pfam-B_3354

Pfam-B_2307

Pfam-B_1328


Pfam-B_8753

Pfam-B_8754

Pfam-B_1153

Pfam-B_811






Pfam-B_5221


Pfam-B_2947

Pfam-B_3614

Pfam-B_3900

Pfam-B_4993Pfam-B_6280Pfam-B_7457Pfam-B_8053Pfam-B_8637Pfam-B_3565Pfam-B_580 Pfam-B_947 Pfam-B_6310

Pfam-B_4152

Pfam-B_3246

Pfam-B_487


Pfam-B_5390

Pfam-B_10529

Pfam-B_577

Pfam-B_7468

Pfam-B_1217

Pfam-B_1280



Pfam-B_7858



Pfam-B_5280

Pfam-B_8645


Pfam-B_1563






Pfam-B_2018Pfam-B_564Pfam-B_876 Pfam-B_5318 Pfam-B_3349 Pfam-B_3078



Pfam-B_1338

Pfam-B_9601

Pfam-B_5551Pfam-B_246Pfam-B_183 Pfam-B_2800 Pfam-B_3347Pfam-B_325Pfam-B_6627



Pfam-B_5782

Pfam-B_6046

Pfam-B_2627

Pfam-B_994

Pfam-B_7410



Pfam-B_5859



Pfam-B_6477




Pfam-B_9871

Pfam-B_4113



Pfam-B_720

Pfam-B_1784

Pfam-B_4615

Pfam-B_10202Pfam-B_6529 Pfam-B_9521 Pfam-B_722 HSP70

Pfam-B_3045

Pfam-B_10887

Pfam-B_4708

Pfam-B_4857 Pfam-B_5049 COX2 Pfam-B_4183 Pfam-B_420



Pfam-B_3605

Pfam-B_4331Pfam-B_9338 Pfam-B_5439 Pfam-B_1122 Pfam-B_7185 Pfam-B_665 Pfam-B_4340 Pfam-B_6097

Pfam-B_1666Pfam-B_4832P r i b o s y l t r a n

Pfam-B_2055


Pfam-B_7140

Pfam-B_1587Pfam-B_1643Pfam-B_7357Pfam-B_996 Pfam-B_7365 Pfam-B_5991Pfam-B_2217

Pfam-B_841


Pfam-B_3089


p h o s l i p


Pfam-B_10638

Pfam-B_2888

Pfam-B_3579

Pfam-B_753

Pfam-B_5333


Pfam-B_2948


Pfam-B_10602

Pfam-B_10026

Pfam-B_2317

Pfam-B_8352


Pfam-B_9701

Pfam-B_719

Pfam-B_10031

Pfam-B_7495

Pfam-B_7476

Pfam-B_6754

Pfam-B_1105

Pfam-B_567

Pfam-B_1394

Pfam-B_181

Pfam-B_8894

Pfam-B_3445


Pfam-B_7645

Pfam-B_898

Pfam-B_8604


Pfam-B_6427

Pfam-B_1910


Pfam-B_7389


Pfam-B_5424

Pfam-B_1378

Pfam-B_11573


Pfam-B_7586

Pfam-B_10873

Pfam-B_103

Pfam-B_3047

Pfam-B_1257

Pfam-B_5062

Pfam-B_265

Pfam-B_7585

Pfam-B_5191

Pfam-B_6571

Pfam-B_6570

Pfam-B_3906


Pfam-B_5151

Pfam-B_1735

Pfam-B_3908

Pfam-B_7725

Pfam-B_2556

Pfam-B_290

Pfam-B_7726

Pfam-B_3

Pfam-B_7259

Pfam-B_317

Pfam-B_2557

Pfam-B_7362

Pfam-B_8748


Pfam-B_7363


Pfam-B_11397

Pfam-B_7258

K H - d o m a i n

n e u rPfam-B_9207 Pfam-B_5592

Pfam-B_9467 Pfam-B_4040 Pfam-B_9946 Pfam-B_8761

g l u t s b e t a - l a c t a m a s e P fam-B_10791

Pfam-B_2914




Pfam-B_11671

Pfam-B_3907

Pfam-B_482

Pfam-B_3178

Pfam-B_5978 w n t P fam-B_11320

Pfam-B_6722

Pfam-B_1163

Pfam-B_4634

Pfam-B_6452

Pfam-B_2093

Pfam-B_11321

Pfam-B_1919

Pfam-B_4767

Pfam-B_2246



Pfam-B_8720




Pfam-B_1410


Pfam-B_2753

Pfam-B_3238

Pfam-B_3239


Pfam-B_10036 Pfam-B_7418 Pfam-B_7051 Pfam-B_490 Pfam-B_10967 Pfam-B_7292Pfam-B_3034 Pfam-B_2807Pfam-B_3705Pfam-B_3861 Pfam-B_6921

Pfam-B_3693


Pfam-B_4800


Pfam-B_11670

Pfam-B_5970 p e r o x i d a s e Pfam-B_7496

Pfam-B_9524


Pfam-B_11288

Pfam-B_6708

Pfam-B_146

Pfam-B_5384

Pfam-B_11322

Pfam-B_2137


Pfam-B_4438

Pfam-B_5789


Pfam-B_3430

Pfam-B_2205Pfam-B_2663 Pfam-B_4123Pfam-B_2186Pfam-B_1034thy rog lobu l i n_1 P fam-B_8740Pfam-B_4004Pfam-B_8741Pfam-B_8589 Pfam-B_5155Pfam-B_7356Pfam-B_7924Pfam-B_5307

MHC_I


Pfam-B_786

Pfam-B_2513

Pfam-B_7619

Pfam-B_7416

Pfam-B_10517

Pfam-B_7417

Pfam-B_9358

Pfam-B_5934

Pfam-B_1484

Pfam-B_1485

Pfam-B_360

Pfam-B_7246

Pfam-B_1487

Pfam-B_5239

heme_1Pfam-B_1812



Pfam-B_7098

ox ido red_mo lyb P fam-B_10677


Pfam-B_2499

Pfam-B_3022

Pfam-B_1575

Pfam-B_868

Pfam-B_5652tRNA-syn t_1



Pfam-B_244


Pfam-B_291

Pfam-B_615

Pfam-B_5019

Pfam-B_1111

Pfam-B_7237

Pfam-B_1395

Pfam-B_207

Pfam-B_5628

Pfam-B_7775

Pfam-B_1759

Pfam-B_4285


Pfam-B_4803

Pfam-B_2112

Pfam-B_2560

Pfam-B_1114


Pfam-B_4364

Pfam-B_983


Pfam-B_10640

Pfam-B_2243

Pfam-B_5804

Pfam-B_5670

Pfam-B_232


Pfam-B_1406


Pfam-B_3667

Pfam-B_2986a l d e d h

Pfam-B_10281

Pfam-B_4339

h o r m o n e 2

Pfam-B_4641

Pfam-B_19367 t m _ 2

Pfam-B_900

Pfam-B_1783

Zn_c lus

Pfam-B_1877

Pfam-B_1946

Pfam-B_1785

Pfam-B_3737

Pfam-B_4186


Pfam-B_2072

Pfam-B_1554


Pfam-B_4960

Pfam-B_4961

Pfam-B_9146

Pfam-B_8650

Pfam-B_173


Pfam-B_10452


Pfam-B_10451

Pfam-B_10450

Pfam-B_408

Pfam-B_723

Pfam-B_388

Pfam-B_3404

Pfam-B_8667

Pfam-B_6174

Pfam-B_3738response_reg

Pfam-B_361

Pfam-B_3412

Pfam-B_2815


Pfam-B_7041


Pfam-B_4453

Pfam-B_222

Pfam-B_6335


Pfam-B_2443

Pfam-B_5835

p 4 5 0

Pfam-B_1676

Pfam-B_3041

Pfam-B_10555

Pfam-B_5974

v w d


Pfam-B_3154

Pfam-B_11143


Pfam-B_7108

Pfam-B_35


Pfam-B_374

Pfam-B_204


Pfam-B_583





Pfam-B_6810

s u b t i l a s e Pfam-B_4804Pfam-B_4548

Pfam-B_7062

Pfam-B_5671

Pfam-B_3736

Pfam-B_4822

Pfam-B_24

Pfam-B_17

Pfam-B_3411


Pfam-B_469


Pfam-B_301

Pfam-B_10633

Pfam-B_9145

Pfam-B_2648

Pfam-B_926



Pfam-B_1224

Pfam-B_2928Pfam-B_8


pro_ isomerase

P fam-B_3721



Pfam-B_200

Pfam-B_9143


Pfam-B_3724

Pfam-B_3149


Pfam-B_70

Pfam-B_2926

Pfam-B_7501

Pfam-B_3117

Pfam-B_355


Pfam-B_927

Pfam-B_126


Pfam-B_8680

Pfam-B_4143

Pfam-B_2755

Pfam-B_809

Pfam-B_647



Pfam-B_4102

Pfam-B_4107

Pfam-B_5471

Pfam-B_8368


Pfam-B_1848

Pfam-B_5487

Pfam-B_8305


Pfam-B_666

Pfam-B_2187

Pfam-B_3434

Pfam-B_10509

Pfam-B_4024


Pfam-B_10511

Pfam-B_776

Pfam-B_10741

Pfam-B_3426

Pfam-B_10512

Pfam-B_1046

Pfam-B_6733

Pfam-B_287


Pfam-B_405


Pfam-B_6822

Pfam-B_10508

Pfam-B_6736

Pfam-B_4873 k e t o a c y l - s y n tP fam-B_177

Pfam-B_133

Pfam-B_6863

Pfam-B_50


Pfam-B_8699

Pfam-B_2816

Pfam-B_6233

Pfam-B_4407



Pfam-B_5387


Pfam-B_902


Pfam-B_791


Pfam-B_9565

Pfam-B_9993

Pfam-B_3881


Pfam-B_3393


Pfam-B_9385


Pfam-B_9859

Pfam-B_2953

Pfam-B_6819

Pfam-B_52

Pfam-B_6862

Pfam-B_517

Pfam-B_6875

Pfam-B_6334

Pfam-B_241

Pfam-B_11163

Pfam-B_11162

Pfam-B_1569

Pfam-B_9382

Pfam-B_6237

Pfam-B_9998

Pfam-B_4511

Pfam-B_4509

Pfam-B_4508

Pfam-B_4510

Pfam-B_970

Pfam-B_6333

Pfam-B_6336

Pfam-B_6337


Pfam-B_10226

Pfam-B_7068

Pfam-B_3087

Pfam-B_7195

Pfam-B_9383

Pfam-B_6604

Pfam-B_8358

Pfam-B_8289

Pfam-B_8423

Pfam-B_4141

Pfam-B_10802

Pfam-B_4801

Pfam-B_2950

Pfam-B_1100

Pfam-B_203

Pfam-B_6773

Pfam-B_4789

Pfam-B_69

Pfam-B_39

Pfam-B_925

Pfam-B_11412

Pfam-B_706

Pfam-B_3138


Pfam-B_1333


Pfam-B_741

Pfam-B_5513

Pfam-B_10216

Pfam-B_10215

Pfam-B_1520

Pfam-B_3500

Pfam-B_2370

Pfam-B_549

Pfam-B_1978

Pfam-B_11491

Pfam-B_9702

Pfam-B_11415

Pfam-B_1551

Pfam-B_849

Pfam-B_11414

Pfam-B_72



Pfam-B_2942


c p n 6 0

Pfam-B_9646

Pfam-B_9623

Pfam-B_9624

Pfam-B_8284

Pfam-B_8439

Pfam-B_1629

Pfam-B_4159

Pfam-B_8419

Pfam-B_8508

Pfam-B_800

Pfam-B_1446


Pfam-B_4011

Pfam-B_1853



Pfam-B_322

Pfam-B_347

Pfam-B_1860

tsp_1 Pfam-B_617 Pfam-B_11929 Pfam-B_1942 Pfam-B_6060Pfam-B_9656 Pfam-B_9465 Pfam-B_7547Pfam-B_5130Pfam-B_6982Pfam-B_7737Pfam-B_5227Pfam-B_7833 Pfam-B_7683 Pfam-B_6952Pfam-B_7060Pfam-B_3399Pfam-B_7094Pfam-B_934Pfam-B_2806Pfam-B_7502Pfam-B_450 Pfam-B_7395 t h i o l a s e Pfam-B_11630Pfam-B_6876Pfam-B_6868Pfam-B_6867Pfam-B_1827 Pfam-B_6609Pfam-B_8795Pfam-B_661Pfam-B_736Pfam-B_673Pfam-B_8730Pfam-B_6731 Pfam-B_2401 Pfam-B_5211Pfam-B_3947Pfam-B_5836Pfam-B_536Pfam-B_9290Pfam-B_5898Pfam-B_5996Pfam-B_6519Pfam-B_1951Pfam-B_6252Pfam-B_2990Pfam-B_611 Pfam-B_6088Pfam-B_5997Pfam-B_9751Pfam-B_853Pfam-B_1766Pfam-B_883Pfam-B_2987Pfam-B_8848

Pfam-B_5099Pfam-B_5134Pfam-B_5213Pfam-B_5302Pfam-B_5608Pfam-B_5738Pfam-B_5743Pfam-B_5774 Pfam-B_5093Pfam-B_6618 Pfam-B_6011Pfam-B_6236 Pfam-B_579Pfam-B_5950 Pfam-B_5840Pfam-B_6268Pfam-B_6644Pfam-B_6700Pfam-B_6758Pfam-B_6795 Pfam-B_6626Pfam-B_6747Pfam-B_6757 Pfam-B_6725Pfam-B_6835Pfam-B_6864Pfam-B_6869Pfam-B_6870Pfam-B_6955Pfam-B_7003Pfam-B_7104Pfam-B_7187Pfam-B_7190Pfam-B_7448 Pfam-B_7438 Pfam-B_7316 Pfam-B_7277 Pfam-B_7254 Pfam-B_7216Pfam-B_7470Pfam-B_9860 Pfam-B_962 Pfam-B_7731 Pfam-B_7589Pfam-B_908Pfam-B_9147 Pfam-B_8585 Pfam-B_7579Pfam-B_7835Pfam-B_8403Pfam-B_9292r e c APfam-B_10316Pfam-B_10399Pfam-B_2864 Pfam-B_8325 Pfam-B_2175Pfam-B_7176 Pfam-B_1027 Pfam-B_8133Pfam-B_2625 Pfam-B_1342Pfam-B_227

Pfam-B_7633Pfam-B_5457Pfam-B_2615Pfam-B_7657Pfam-B_6209Pfam-B_6202Pfam-B_6873Pfam-B_6874 Pfam-B_3832Pfam-B_3859Pfam-B_7439Pfam-B_4052Pfam-B_3167Pfam-B_4119 Pfam-B_4080Pfam-B_4081Pfam-B_4120Pfam-B_5188 Pfam-B_7597 Pfam-B_7145Pfam-B_7144 Pfam-B_11443Pfam-B_6789Pfam-B_11498Pfam-B_11579Pfam-B_11795 Pfam-B_11497Pfam-B_11796 Pfam-B_11578Pfam-B_3799 Pfam-B_7125 Pfam-B_11891 Pfam-B_2930Pfam-B_7225 Pfam-B_3757Pfam-B_3831 Pfam-B_2985Pfam-B_3539Pfam-B_3552 Pfam-B_5438Pfam-B_202 Pfam-B_10424Pfam-B_1992 Pfam-B_658 Pfam-B_1969Pfam-B_1702 Pfam-B_4530Pfam-B_1513Pfam-B_1786Pfam-B_2049Pfam-B_2251Pfam-B_9150Pfam-B_2447Pfam-B_2935Pfam-B_2534Pfam-B_7519Pfam-B_2929 Pfam-B_1753 Pfam-B_1319Pfam-B_5896Pfam-B_2114Pfam-B_1407

Pfam-B_7405

Pfam-B_10926


Pfam-B_3625

Pfam-B_470

Pfam-B_4332

Pfam-B_5022

Pfam-B_11571

Pfam-B_3620

Pfam-B_2936 Pfam-B_2036 Pfam-B_1108Pfam-B_3844 Pfam-B_475


Pfam-B_5064

Pfam-B_1012


Pfam-B_5023

Pfam-B_9575

Pfam-B_2487

Pfam-B_7596

Pfam-B_5882

Pfam-B_4338

Pfam-B_9258

Pfam-B_3889

Pfam-B_9257

Pfam-B_960

Pfam-B_1582Pfam-B_6577 Pfam-B_1808Pfam-B_5156 Pfam-B_3437 Pfam-B_7626Pfam-B_7724Pfam-B_5218Pfam-B_7784Pfam-B_8319Pfam-B_8013Pfam-B_9700 Pfam-B_4599Pfam-B_10317Pfam-B_4893Pfam-B_4973Pfam-B_2866Pfam-B_5346

Pfam-B_785

Pfam-B_930

Pfam-B_3929

l i p a s ePfam-B_10366 Pfam-B_3791 Pfam-B_3796



Pfam-B_4247


Pfam-B_1854

Pfam-B_5900



Pfam-B_8883

Pfam-B_7352

Pfam-B_2709Pfam-B_9985Pfam-B_3100Pfam-B_2474Pfam-B_3275Pfam-B_238Pfam-B_34Pfam-B_4839Pfam-B_11095Pfam-B_5686 Pfam-B_5685 Pfam-B_4849 Pfam-B_5274





Pfam-B_6995

Pfam-B_5925

Pfam-B_2597

Pfam-B_3174

Pfam-B_1172

Pfam-B_10379

Pfam-B_956 Pfam-B_5684Pfam-B_2222 Pfam-B_3447Pfam-B_4838Pfam-B_4848

Pfam-B_10892 Pfam-B_11457 Pfam-B_393 Pfam-B_11865 Pfam-B_11399Pfam-B_1486 Pfam-B_2958



Pfam-B_6994

Pfam-B_1586Pfam-B_9252 Pfam-B_9329 Pfam-B_917 Pfam-B_8235 Pfam-B_7531 Pfam-B_5458


Pfam-B_11574




Pfam-B_2545

Pfam-B_7177


Pfam-B_7672







Pfam-B_3787


Pfam-B_8757


Pfam-B_8239




Pfam-B_9518

Pfam-B_4845


Pfam-B_6581

Pfam-B_3378

Pfam-B_5942

Pfam-B_3521

Pfam-B_4579

c o n n e x i nPfam-B_10328

Pfam-B_323

Pfam-B_4707

Pfam-B_6536 Pfam-B_3453 Pfam-B_8813 Pfam-B_1597 Pfam-B_3077 Pfam-B_10375 Pfam-B_1341Pfam-B_5993 Pfam-B_3661Pfam-B_5639Pfam-B_4825Pfam-B_1777Pfam-B_1273








Pfam-B_7744

Pfam-B_8799Pfam-B_9833Pfam-B_5619 S 1 2

a d h _ s h o r t

P fam-B_10391

Pfam-B_10788

Pfam-B_10790

Pfam-B_3457

Pfam-B_9831


Pfam-B_2229

Pfam-B_10793

Pfam-B_10318

Pfam-B_10319

Pfam-B_9317

Pfam-B_2294

Pfam-B_4400

Pfam-B_451

Pfam-B_9237

Pfam-B_9537

Pfam-B_1174

Pfam-B_1175

Pfam-B_1061

Pfam-B_4401

Pfam-B_6015

Pfam-B_2512


Pfam-B_2760



Pfam-B_8252


Pfam-B_1337

Pfam-B_9830


Pfam-B_6138


Pfam-B_6599

Pfam-B_1523

Pfam-B_4327


Pfam-B_4885

Pfam-B_694

Pfam-B_5004

Pfam-B_182

Pfam-B_740


Pfam-B_2295

Pfam-B_4817

Pfam-B_664

Pfam-B_856

Pfam-B_11566

Pfam-B_4962

Pfam-B_4963



Pfam-B_4414

Pfam-B_494

Pfam-B_2762

Pfam-B_2124


Pfam-B_3603

Pfam-B_9547


Pfam-B_9523

Pfam-B_7600

Pfam-B_11864

Pfam-B_5158

Pfam-B_7607

Pfam-B_10253

Pfam-B_2373

Pfam-B_1421

Pfam-B_7605

Pfam-B_7608

Pfam-B_862

Pfam-B_7601

Pfam-B_214

Pfam-B_3182

Pfam-B_191

Pfam-B_10469

Pfam-B_10460

Pfam-B_464

Pfam-B_703

Pfam-B_55


Pfam-B_10464

Pfam-B_984


Pfam-B_457


Pfam-B_7580

Pfam-B_7755



Pfam-B_2870

Pfam-B_165

Pfam-B_1213

Pfam-B_510


Pfam-B_6471

Pfam-B_139

Pfam-B_1356

Pfam-B_3807

Pfam-B_2141

Pfam-B_153

Pfam-B_3107

Pfam-B_7762

Pfam-B_622

Pfam-B_379

Pfam-B_318

Pfam-B_8056

Pfam-B_8109

Pfam-B_7723

Pfam-B_180

Pfam-B_7998

Pfam-B_8043

Pfam-B_7721

Pfam-B_7007


Pfam-B_4925

Pfam-B_2599

Pfam-B_8137

Pfam-B_3168

Pfam-B_8167

Pfam-B_9319

Pfam-B_10820

Pfam-B_1745

Pfam-B_3613


Pfam-B_10825

Pfam-B_2430

Pfam-B_1744

Pfam-B_779

Pfam-B_485


Pfam-B_7768

Pfam-B_2642

Pfam-B_2382

Pfam-B_802s i g m a 5 4



Pfam-B_3949

Pfam-B_206

Pfam-B_179

Pfam-B_663

Pfam-B_3064



Pfam-B_2894

Pfam-B_2895

Pfam-B_2896


Pfam-B_4698

Pfam-B_10699


Pfam-B_604

Pfam-B_2266

Pfam-B_1147

Pfam-B_1791

Pfam-B_1404

Pfam-B_242

Pfam-B_205Pfam-B_3440 Pfam-B_7628 Pfam-B_2273Pfam-B_1950



Pfam-B_10698

Pfam-B_2768

Pfam-B_1254

Pfam-B_151


Pfam-B_953

Pfam-B_1581

Pfam-B_9964

Pfam-B_1697

Pfam-B_7344


Pfam-B_2268

Pfam-B_9232

Pfam-B_259

Pfam-B_417

Pfam-B_10161

Pfam-B_6601


Pfam-B_7615

Pfam-B_7625

Pfam-B_7593

Pfam-B_1320

Pfam-B_1423

Pfam-B_846Pfam-B_10816 Pfam-B_3525 Pfam-B_5781 Pfam-B_10190Pfam-B_9778 Pfam-B_8507 Pfam-B_538Pfam-B_8067 Pfam-B_513 Pfam-B_9227





Pfam-B_4570

Pfam-B_6701

Pfam-B_9030

Pfam-B_3110

Pfam-B_4156

Pfam-B_3915


Pfam-B_958

Pfam-B_10160

Pfam-B_906

Pfam-B_7594


Pfam-B_4686



Pfam-B_765

Pfam-B_7771

Pfam-B_5376

Pfam-B_10245

Pfam-B_9365

Pfam-B_10477

Pfam-B_3549

Pfam-B_1997

Pfam-B_10476



Pfam-B_7900

Pfam-B_3295

Pfam-B_11642

Pfam-B_11638

Pfam-B_2619


Pfam-B_5185

Pfam-B_11164


Pfam-B_6924

Pfam-B_5234

Pfam-B_8696

Pfam-B_9129

Pfam-B_6183

Pfam-B_7526

Pfam-B_3945

Pfam-B_6184

Pfam-B_7750

Pfam-B_5797

Pfam-B_563

Pfam-B_2469

Pfam-B_3903

Pfam-B_7569

Pfam-B_1418

Pfam-B_9049

Pfam-B_1595

ho rmone_ rec

Pfam-B_3905

Pfam-B_9057



Pfam-B_2240

Pfam-B_4046

Pfam-B_5643

Pfam-B_4279

Pfam-B_9455

Pfam-B_7769

Pfam-B_7772

Pfam-B_6878

Pfam-B_8727

Pfam-B_6723

Pfam-B_749

ld l_ recept_b

Pfam-B_1258

Pfam-B_6480

Pfam-B_8722

Pfam-B_2923

Pfam-B_11897

Pfam-B_6323

Pfam-B_4909

Pfam-B_6478

Pfam-B_10218

Pfam-B_5650

Pfam-B_2838

Pfam-B_6988

Pfam-B_10368

Pfam-B_3904

Pfam-B_6987

Pfam-B_4199

Pfam-B_1043

Pfam-B_5864

Pfam-B_73


Pfam-B_11068

Pfam-B_11075



Pfam-B_9897

S H 3

Pfam-B_1720

Pfam-B_4202

Pfam-B_2598

Pfam-B_8407


Pfam-B_3730


Pfam-B_4726

Pfam-B_1189

Pfam-B_1855

Pfam-B_5406


Pfam-B_2061

Pfam-B_534




Pfam-B_2161

Pfam-B_8170

Pfam-B_881

Pfam-B_3071

Pfam-B_1713

Pfam-B_8982

Pfam-B_3429

a n k

Pfam-B_3873

Pfam-B_7182

Pfam-B_616

Pfam-B_4807

Pfam-B_7727

Pfam-B_1618

Pfam-B_547

Pfam-B_8003

Pfam-B_10170


Pfam-B_11592


Pfam-B_6594

Pfam-B_2766

Pfam-B_5136

Pfam-B_3170

RIP

Pfam-B_2122

Pfam-B_8175

Pfam-B_7210

r a s

Pfam-B_100

Pfam-B_982

Pfam-B_7932


Pfam-B_324

Pfam-B_919

Pfam-B_372

zf -C4

Pfam-B_2111

Pfam-B_6580

ABC_tran

Pfam-B_2220

Pfam-B_1323

Pfam-B_1056

Pfam-B_768

Pfam-B_11741

Pfam-B_1133

Pfam-B_550

Pfam-B_8150

Pfam-B_5412

Pfam-B_10236

Pfam-B_506

Pfam-B_1206

Pfam-B_1282


Pfam-B_4521

Pfam-B_3083


Pfam-B_2836

Pfam-B_766

Pfam-B_6330

Pfam-B_533

Pfam-B_10242

Pfam-B_10708

Pfam-B_918

Pfam-B_1223

Pfam-B_1502

Pfam-B_141

Pfam-B_10711


Pfam-B_337

Pfam-B_447

Pfam-B_10718

Pfam-B_6660

Pfam-B_4631

Pfam-B_4206

Pfam-B_425

Pfam-B_5214

Pfam-B_3999

Pfam-B_8468

Pfam-B_5075

Pfam-B_5960

Pfam-B_7376

Pfam-B_4174

Y_phosphatase

P fam-B_5078


Pfam-B_9848


Pfam-B_3048

Pfam-B_8094

Pfam-B_705

Pfam-B_3842

Pfam-B_3371

Pfam-B_7055

Pfam-B_6053

Pfam-B_5765

Pfam-B_7399

Pfam-B_147

Pfam-B_1927

Pfam-B_9692

Pfam-B_963

Pfam-B_6563

Pfam-B_623

Pfam-B_5264

Pfam-B_1649


Pfam-B_7797

Pfam-B_11240

Pfam-B_5144

Pfam-B_5589

Pfam-B_5971

Pfam-B_9220

Pfam-B_8471

Pfam-B_5590

Pfam-B_528

Pfam-B_7023

Pfam-B_4829

Pfam-B_11658


Pfam-B_3686

Pfam-B_11637

Pfam-B_5208

f e r 2

Pfam-B_5207

Pfam-B_9861

Pfam-B_10756

Pfam-B_1665

Pfam-B_6923

Pfam-B_5176

Pfam-B_2058

Pfam-B_2335

Pfam-B_3183

Pfam-B_1232

Pfam-B_6846

Pfam-B_1390

Pfam-B_10470

Pfam-B_2998

Pfam-B_3575

Pfam-B_5515

Pfam-B_2855

Pfam-B_5908

Pfam-B_3663

Pfam-B_864

Pfam-B_2506

Pfam-B_7826

Pfam-B_7451

Pfam-B_10607

Pfam-B_7480

Pfam-B_7479

Pfam-B_757


Pfam-B_7481

Pfam-B_2528

Pfam-B_5115

Pfam-B_478

Pfam-B_11329

Pfam-B_6062

Pfam-B_5272


Pfam-B_2084

Pfam-B_9928

Pfam-B_5113

Pfam-B_10600


Pfam-B_496

Pfam-B_23

Pfam-B_11209

Pfam-B_5246

Pfam-B_1813

C 2

Pfam-B_1839

Pfam-B_11906

Pfam-B_8182

Pfam-B_11905


Pfam-B_2529

Pfam-B_85Pfam-B_555

r n a s e H

Pfam-B_81

COX1Pfam-B_452

Pfam-B_7201

Pfam-B_134

Pfam-B_1617


Pfam-B_9606

Pfam-B_3707

Pfam-B_479

Pfam-B_5741

Pfam-B_3899

Pfam-B_8195

Pfam-B_210

Pfam-B_1255


Pfam-B_2604

Pfam-B_8142

Pfam-B_8141

Pfam-B_3817

Pfam-B_5604

Pfam-B_9905

Pfam-B_6018

Pfam-B_6017

Pfam-B_6801

Pfam-B_1361

Pfam-B_1260

Pfam-B_2403

Pfam-B_3068

Pfam-B_7483

Pfam-B_1804

Pfam-B_6012

Pfam-B_1802

Pfam-B_10597

Pfam-B_2289

Pfam-B_2013

Pfam-B_2405


Pfam-B_9494


Pfam-B_10574

Pfam-B_7486

Pfam-B_497

Pfam-B_3878

Pfam-B_6516

Pfam-B_2015

r v pPfam-B_2012

Pfam-B_2011

Pfam-B_859

Pfam-B_2531


Pfam-B_6003

r v t


Pfam-B_1090

Pfam-B_10510


Pfam-B_707

Pfam-B_6041


Pfam-B_9546

Pfam-B_8209

Pfam-B_2758

Pfam-B_6006

Pfam-B_3875

Pfam-B_2527

Pfam-B_8210


Pfam-B_774

Pfam-B_6511

Pfam-B_10556


Pfam-B_3879

Pfam-B_2292

Pfam-B_605

Pfam-B_9241

Pfam-B_5323

Pfam-B_5757

Pfam-B_8192

Pfam-B_4252

Pfam-B_4394

Pfam-B_1739

Pfam-B_4395

Pfam-B_9533

Pfam-B_400

Pfam-B_4403

Pfam-B_6802

Pfam-B_4082

Pfam-B_4396

Pfam-B_659


Pfam-B_2945

Pfam-B_2540

lec t in_c

Pfam-B_11037

Pfam-B_4513

Pfam-B_8207

Pfam-B_8824

Pfam-B_11743

Pfam-B_1730

Pfam-B_1531

Pfam-B_10608

Pfam-B_7477

Pfam-B_9310

Pfam-B_10807

Pfam-B_4110

Pfam-B_2992

Pfam-B_2823

Pfam-B_4919

Pfam-B_6547

Pfam-B_1832

Pfam-B_4356


Pfam-B_3482

Pfam-B_5999

t o x i n

Pfam-B_3294

Pfam-B_6674

Pfam-B_2764

Pfam-B_3874

Pfam-B_10433

Pfam-B_4536

Pfam-B_4535

Pfam-B_5490

Pfam-B_1835

Pfam-B_4008

Pfam-B_8332

Pfam-B_4261

Pfam-B_6304

Pfam-B_8494

Pfam-B_8333

Pfam-B_6221


Pfam-B_2149

Pfam-B_4163

Pfam-B_2148

Pfam-B_8126

Pfam-B_481



Pfam-B_6303

Pfam-B_729

Pfam-B_8334

Pfam-B_8528

Pfam-B_5491

H L H


Pfam-B_1820

Pfam-B_3985

Pfam-B_10599

Pfam-B_440

Pfam-B_292

Pfam-B_3136

Pfam-B_9219

Pfam-B_2698

a m i n o t r a n

Pfam-B_9832

Pfam-B_2476

Pfam-B_4328

Pfam-B_2293

Pfam-B_1248

Pfam-B_6991

Pfam-B_1150


Pfam-B_11806


Pfam-B_10090


Pfam-B_67

Pfam-B_3512

Pfam-B_6172

Pfam-B_195

Pfam-B_1102

Pfam-B_256

Pfam-B_2303

Pfam-B_2258


Pfam-B_9957

Pfam-B_2448


ox ido red_n i t r o P fam-B_1583

Pfam-B_7178

Pfam-B_127

Pfam-B_5192

Pfam-B_5189

Pfam-B_3815


Pfam-B_2053

Pfam-B_2718

Pfam-B_3389

Pfam-B_2129


Pfam-B_4069

Pfam-B_6834 p h o t o R C

Pfam-B_7602



Pfam-B_6031


Pfam-B_8851

Pfam-B_10506

Pfam-B_4137

Pfam-B_6993

Pfam-B_6491

Pfam-B_11499

Pfam-B_2371

Pfam-B_11495


Pfam-B_5171

Pfam-B_2737

Pfam-B_9675

Pfam-B_9674

Pfam-B_163

Pfam-B_6346

Pfam-B_2596

Pfam-B_5400

Pfam-B_4354

Pfam-B_6353

Pfam-B_6403

Pfam-B_2375

Pfam-B_6354

Pfam-B_1982

Pfam-B_5847

Pfam-B_976

Pfam-B_6352

Pfam-B_911

Pfam-B_3709

Pfam-B_1518

Pfam-B_2404

Pfam-B_2014

Pfam-B_11774

Pfam-B_9162

Pfam-B_491

Pfam-B_3345

Pfam-B_414

Pfam-B_1167

Pfam-B_9158

Pfam-B_9168

Pfam-B_2724

Pfam-B_4310

Pfam-B_3344

DNA_pol

P fam-B_3768

Pfam-B_14

Pfam-B_27

Pfam-B_3279

Pfam-B_11

Pfam-B_18

Pfam-B_8707


Pfam-B_1972

ox ido red_ fad

Pfam-B_7796

Pfam-B_4560


Pfam-B_4311

Pfam-B_4312

Pfam-B_9347

Pfam-B_10168

Pfam-B_6298

Pfam-B_10169

Pfam-B_4988 Pfam-B_4948 Pfam-B_4947 Pfam-B_4842 Pfam-B_4788Pfam-B_4979Pfam-B_4987


Pfam-B_6961

Pfam-B_1263


Pfam-B_525


Pfam-B_10679


Pfam-B_8672

Pfam-B_3388


Pfam-B_642

p y r _ r e d o x

Pfam-B_4010

Pfam-B_1214

Pfam-B_10710

Pfam-B_1272

Pfam-B_1130

Pfam-B_2092

Pfam-B_2449

Pfam-B_10724

Pfam-B_1129


Pfam-B_1243



Pfam-B_1362

Pfam-B_8418

Pfam-B_273

Pfam-B_2521

Pfam-B_1121

Pfam-B_10502


s i g m a 7 0

Pfam-B_274

Pfam-B_4146

Pfam-B_8473

Pfam-B_10231

Pfam-B_269

Pfam-B_4995






Pfam-B_6182

Pfam-B_10403


a p p l e

Pfam-B_1928

Pfam-B_10572

Pfam-B_5368

Pfam-B_620

Pfam-B_939

Pfam-B_108

Pfam-B_2669


Pfam-B_3422

Pfam-B_10575

Pfam-B_2877

Pfam-B_10576

Pfam-B_7491


Pfam-B_2288

Pfam-B_2147

Pfam-B_1922


Pfam-B_3062

Pfam-B_7489

Pfam-B_7488

Pfam-B_3069

Pfam-B_5265

Pfam-B_4663

Pfam-B_6523

Pfam-B_7355

Pfam-B_941

Pfam-B_3572



Pfam-B_7487


Pfam-B_31

Pfam-B_5653


Pfam-B_2676

Pfam-B_2881

Pfam-B_5654

Pfam-B_2893

Pfam-B_1533

Pfam-B_1729

Pfam-B_10587

Pfam-B_5161

Pfam-B_731

Pfam-B_4070

Pfam-B_1391

Pfam-B_3987

Pfam-B_2191

Pfam-B_2085


Pfam-B_1803

Pfam-B_3876

Pfam-B_19

Pfam-B_10601

Pfam-B_7482

Pfam-B_2274

Pfam-B_1801

Pfam-B_9398

Pfam-B_540

Pfam-B_9820

Pfam-B_9822

Pfam-B_9821

Pfam-B_9818

Pfam-B_9817


Pfam-B_2365

Pfam-B_1200

Pfam-B_10166

Pfam-B_3343


Pfam-B_4561

Pfam-B_418

Pfam-B_503

Pfam-B_10159

Pfam-B_1199

Pfam-B_7632

Pfam-B_1155

Pfam-B_249

Pfam-B_965

Pfam-B_6129

Pfam-B_2252

Pfam-B_7736

Pfam-B_6849


Pfam-B_3090

Pfam-B_9811

Pfam-B_2442

Pfam-B_5399

Pfam-B_9327

Pfam-B_6128

lec t in_ legB

Pfam-B_51

lec t in_ legA

Pfam-B_10164

Pfam-B_9159

Pfam-B_4472

Pfam-B_645

Pfam-B_2725

Pfam-B_4774

Pfam-B_591

Pfam-B_966

Pfam-B_1558

Pfam-B_94

Pfam-B_9668

Pfam-B_2064

Pfam-B_11783

Pfam-B_769

Pfam-B_9819

Pfam-B_2166

Pfam-B_8499

Pfam-B_1450

Pfam-B_4575

Pfam-B_1101

Pfam-B_11832

Pfam-B_10228

Pfam-B_10225

Pfam-B_7160

Pfam-B_4475

Pfam-B_1286

Pfam-B_6515

Pfam-B_456

Pfam-B_284

Pfam-B_2880

Pfam-B_11846

Pfam-B_444

Pfam-B_5267

Pfam-B_10214

Pfam-B_1792

Pfam-B_7354

Pfam-B_687

Pfam-B_9562

Pfam-B_8987

Pfam-B_686

Pfam-B_1124

i n s

Pfam-B_11928

Pfam-B_3444

Pfam-B_2477

Pfam-B_11828

Pfam-B_6992

Pfam-B_5482

Pfam-B_8364

Pfam-B_296

Pfam-B_8360

Pfam-B_1285


Pfam-B_7977

Pfam-B_4138

Pfam-B_59

Pfam-B_3467

Pfam-B_6724

Pfam-B_6225

Pfam-B_9981

Pfam-B_9380

Pfam-B_1996

Pfam-B_3172

Pfam-B_1828

Pfam-B_3427

Pfam-B_8974

Pfam-B_3268

Pfam-B_445

Pfam-B_107

Pfam-B_2374

Pfam-B_6355

Pfam-B_3510

Pfam-B_5846

b Z I P

Pfam-B_4313

Pfam-B_8888

FGF Pfam-B_1930

Pfam-B_45


Pfam-B_3067


Pfam-B_476

Pfam-B_9339

Pfam-B_3894

a lpha -amy lase

P fam-B_1431

Pfam-B_4581



Pfam-B_3862

Pfam-B_1698

Pfam-B_819

Pfam-B_1127

Pfam-B_3120

Pfam-B_170

Pfam-B_42

Pfam-B_9508

Pfam-B_49

Pfam-B_2884

GTP_EFTU

Pfam-B_10563

Pfam-B_9215

Pfam-B_4415

Pfam-B_331

Pfam-B_1186


zf-CCHC

Pfam-B_530

Pfam-B_582

Pfam-B_411

Pfam-B_2291

Pfam-B_11860

Pfam-B_6610

Pfam-B_1152

Pfam-B_955

Pfam-B_2524

Pfam-B_3193

f n 1

Pfam-B_9903

Pfam-B_4490

Pfam-B_8862

Pfam-B_9901

Pfam-B_1216

Pfam-B_10543

Pfam-B_3829

Pfam-B_9563t r y p s i n

Pfam-B_4088

Pfam-B_453

Pfam-B_818

Pfam-B_1725

f e r 4Pfam-B_10818

Pfam-B_10826



f n 3

Pfam-B_10822

v w a

Pfam-B_4499

Pfam-B_4852

v w c Pfam-B_2763

Pfam-B_7263

Pfam-B_8241

Pfam-B_8249


w a p

Pfam-B_3969

f n 2

Pfam-B_8302


Pfam-B_5350


Pfam-B_2178

Pfam-B_9942

Pfam-B_7274

Kuni tz_BPTIPfam-B_2756

Pfam-B_2682



Pfam-B_1060


Pfam-B_6021

Pfam-B_1693

Pfam-B_7091

Pfam-B_667

Pfam-B_4126

Pfam-B_10527

Pfam-B_4659

Pfam-B_3557

Pfam-B_3059

Pfam-B_5197

Pfam-B_2131

Pfam-B_8443

Pfam-B_4139

Pfam-B_8157

Pfam-B_4130

Pfam-B_6217

h o m e o b o x

Pfam-B_7681

Pfam-B_10723

Pfam-B_1593

Pfam-B_5317


Pfam-B_8383

Pfam-B_9682

Pfam-B_7999

Pfam-B_8397

Pfam-B_4658

Pfam-B_8076

Pfam-B_11590

Pfam-B_3222

Pfam-B_2139lamin in_G


Pfam-B_8205

Pfam-B_3302

Pfam-B_4237



Pfam-B_8204


Pfam-B_6489

Pfam-B_2635

t r e f o i l

P fam-B_4044

Pfam-B_8034

Pfam-B_4128

Pfam-B_5514

Pfam-B_5750

Cys_knot

P fam-B_4945

Pfam-B_948

Pfam-B_3946

Pfam-B_814

Pfam-B_1508

Pfam-B_5709

Pfam-B_5708

Pfam-B_8912

Pfam-B_97

HSP20

Pfam-B_10703

Pfam-B_2900

Pfam-B_843

Pfam-B_309

Pfam-B_1157

Pfam-B_1068

Pfam-B_1084

Pfam-B_4045

Pfam-B_7561

Pfam-B_1684


Pfam-B_3249


Pfam-B_9264

Pfam-B_1469


Pfam-B_1614

Pfam-B_9255

Pfam-B_9274

Pfam-B_8027

Pfam-B_1683

Pfam-B_11301

Pfam-B_9267

Pfam-B_6023

Pfam-B_7809

Pfam-B_2445

Pfam-B_11331

Pfam-B_2002

Pfam-B_7970

Pfam-B_9558

Pfam-B_8989

Pfam-B_5386

Pfam-B_2568

Pfam-B_1063

Pfam-B_2234

Pfam-B_338

Pfam-B_1107

Pfam-B_7545

Pfam-B_10925

Pfam-B_2483



Pfam-B_5129

Pfam-B_7546

Pfam-B_9936

s e r p i n

Pfam-B_9006

Pfam-B_633

Pfam-B_2314

Pfam-B_1467

Pfam-B_11717

Pfam-B_11715

Pfam-B_2494


Pfam-B_1620



Pfam-B_3981

Pfam-B_1829

Pfam-B_7646

Pfam-B_1960

Pfam-B_4552

Pfam-B_4550

Pfam-B_305

Pfam-B_7659

Pfam-B_7334

Pfam-B_10473

Pfam-B_4648

Pfam-B_10471

Pfam-B_675

Pfam-B_3121

Pfam-B_3014


Pfam-B_7033


Pfam-B_1815

Pfam-B_1817


Pfam-B_7708

Pfam-B_8986

Pfam-B_1816

Pfam-B_3081

Pfam-B_11716

Pfam-B_7053

Pfam-B_8749

Pfam-B_473


Pfam-B_3015

Pfam-B_1868

Pfam-B_2583

Pfam-B_554

Pfam-B_7845

Pfam-B_5163

Pfam-B_105

Pfam-B_2284

Pfam-B_7844

Pfam-B_733

Pfam-B_7849

Pfam-B_728

Pfam-B_7843

Pfam-B_9008


Pfam-B_4606

Pfam-B_10440

Pfam-B_7982

Pfam-B_961

Pfam-B_2285

Pfam-B_7846

Pfam-B_7174

Pfam-B_7034

Pfam-B_307

Pfam-B_308

Pfam-B_352

Pfam-B_3013

Pfam-B_334

Pfam-B_7848


myos in_head

Pfam-B_3853

Pfam-B_8323

Pfam-B_1445

Pfam-B_3564

Pfam-B_2989

Pfam-B_1241


Pfam-B_38Pfam-B_65



Pfam-B_4881

Pfam-B_1234

Pfam-B_2400

Pfam-B_8322

Pfam-B_10562

Pfam-B_4664


Pfam-B_10617

Pfam-B_10578

Pfam-B_4665

Pfam-B_2327

r h v




Pfam-B_4397

Pfam-B_4666


Pfam-B_11744

Pfam-B_110

Pfam-B_7233


Pfam-B_9698

Pfam-B_5162

Pfam-B_1895

Pfam-B_2578

m i t o _ c a r rP fam-B_302

Pfam-B_901

Pfam-B_2420

Pfam-B_10760

Pfam-B_466

Pfam-B_3612

i l 8

Pfam-B_437

Pfam-B_4533

Pfam-B_4714

Pfam-B_2530

Pfam-B_5407

Pfam-B_4148

Pfam-B_4917

Pfam-B_8578

Pfam-B_4920

Pfam-B_4851



Pfam-B_7322

Pfam-B_10668

Pfam-B_11643

Pfam-B_5007

Pfam-B_376

Pfam-B_7077


Pfam-B_8193

Pfam-B_5398

Pfam-B_3169

Pfam-B_711

Pfam-B_4075


Pfam-B_1436

Pfam-B_2949

Pfam-B_4051




Pfam-B_7745

Pfam-B_225

Pfam-B_6257

Pfam-B_135

Pfam-B_735

Pfam-B_3897

Pfam-B_10758

Pfam-B_2421

Pfam-B_2729

Pfam-B_303

Pfam-B_1373

Pfam-B_1967

Pfam-B_7677

Pfam-B_2907


Pfam-B_16


Pfam-B_2026

Pfam-B_6586

Pfam-B_1047

Pfam-B_5533

Pfam-B_4Pfam-B_6588

Pfam-B_646

Pfam-B_10751

Pfam-B_4713Pfam-B_9

Pfam-B_9506

Pfam-B_233

Pfam-B_267

Pfam-B_22

Pfam-B_7155

Pfam-B_6957

Pfam-B_102

Pfam-B_7049

Pfam-B_9191

Pfam-B_2840

Pfam-B_4059

Pfam-B_6308

Pfam-B_1821

Pfam-B_9309

Pfam-B_10748

Pfam-B_3824

Pfam-B_467

Pfam-B_167

Pfam-B_10884


Pfam-B_10021

Pfam-B_4712

Pfam-B_3080

Pfam-B_7635

Pfam-B_6909

Pfam-B_11199

Pfam-B_11198

Pfam-B_11009


n o t c h


Pfam-B_8143



Pfam-B_2716

Pfam-B_275

Pfam-B_9609

Pfam-B_436

Pfam-B_7558


l i p o c a l i n

Pfam-B_1466

Pfam-B_5421

Pfam-B_2747

Pfam-B_2519


Pfam-B_4244

Pfam-B_4058

Pfam-B_1283

Pfam-B_3082

Pfam-B_8139

Pfam-B_7952

Pfam-B_8153

Pfam-B_548

Pfam-B_1807

Pfam-B_8197

Pfam-B_1136

Pfam-B_8230

Pfam-B_381

Pfam-B_2241

Pfam-B_8257


Pfam-B_747

Pfam-B_11241

Pfam-B_3706

d s r m

Pfam-B_11753

i g

Pfam-B_7392

Pfam-B_3916

Pfam-B_546

Pfam-B_1331

Pfam-B_229


Pfam-B_3713

Pfam-B_6943


Pfam-B_2982


Pfam-B_2777

Pfam-B_10702

Pfam-B_10946

Pfam-B_10707

Pfam-B_8372

Pfam-B_5510

Pfam-B_3017

Pfam-B_7981

Pfam-B_2570


Pfam-B_7329

Pfam-B_5070

Pfam-B_11771

Pfam-B_2515

Pfam-B_5591

Pfam-B_1883

Pfam-B_5073

Pfam-B_7375

Pfam-B_7798

Pfam-B_8808

Pfam-B_9904

Pfam-B_9846

Pfam-B_880

Pfam-B_3758

Pfam-B_909

RuBisCO_smal l


Pfam-B_2150

Pfam-B_4074

Pfam-B_6098

Pfam-B_606

Pfam-B_2028


Pfam-B_7653

Pfam-B_3816

Pfam-B_3935

Pfam-B_4701

EGF

Pfam-B_2963

Pfam-B_7719

Pfam-B_8944

Pfam-B_2980

Pfam-B_5316


Pfam-B_9437

s u s h i

P fam-B_11735

Pfam-B_4924

Pfam-B_11830

c y c l i n

Pfam-B_1025

Pfam-B_4076

Pfam-B_9667

Pfam-B_4976

Pfam-B_4939

Pfam-B_1589

Pfam-B_7009

Pfam-B_5808

Pfam-B_8717

Pfam-B_7311

Pfam-B_5041


Pfam-B_9983

Pfam-B_9968

Pfam-B_9980

Pfam-B_3570


Pfam-B_3933

Pfam-B_3197

Pfam-B_2126STphospha tase

P fam-B_1846

Pfam-B_7642

Pfam-B_11098

Pfam-B_3196

Pfam-B_1123

Pfam-B_8801

Pfam-B_9561

Pfam-B_10374

Pfam-B_6428

Pfam-B_2797

Pfam-B_2008

A A A

Pfam-B_1726

Pfam-B_1622


Pfam-B_11859

Pfam-B_1308

Pfam-B_4424


Pfam-B_363

Pfam-B_2723

Pfam-B_4314

Pfam-B_199

Pfam-B_255

cy toch rome_c


Pfam-B_1660

Pfam-B_1474

Pfam-B_4213

Pfam-B_364

Pfam-B_3341

Pfam-B_758

Pfam-B_9638

Pfam-B_9637

Pfam-B_1165

Pfam-B_1663

Pfam-B_2848

Pfam-B_6451

Pfam-B_8734

Pfam-B_1477

Pfam-B_10303

Pfam-B_4277

Pfam-B_166

Pfam-B_1985

Pfam-B_7411

Pfam-B_1448

Pfam-B_4134

Pfam-B_5187

Pfam-B_8676

Pfam-B_7119

Pfam-B_9041

Pfam-B_6040

Pfam-B_1368


Pfam-B_5794

Pfam-B_8516


Pfam-B_11833


Pfam-B_6965

Pfam-B_7897

Pfam-B_964

Pfam-B_6963

Pfam-B_9436

Pfam-B_8515

E1-E2_ATPase


Pfam-B_9649

Pfam-B_5335

Pfam-B_4482


Pfam-B_10130

Pfam-B_697

Pfam-B_92

Pfam-B_6291

Pfam-B_698

Pfam-B_10142

Pfam-B_10144

Pfam-B_4555

Pfam-B_2361

Pfam-B_575


Pfam-B_332

Pfam-B_6146

Pfam-B_150

Pfam-B_3714

Pfam-B_1990

Pfam-B_3971

Pfam-B_3972

Pfam-B_11192

Pfam-B_4618

Pfam-B_3943

Pfam-B_10814

Pfam-B_11130

Pfam-B_6423

Pfam-B_6535

Pfam-B_1316

Pfam-B_6622

Pfam-B_4721

Pfam-B_2034


Pfam-B_6513

Pfam-B_8910

Pfam-B_2324

Pfam-B_9838

Pfam-B_32

Pfam-B_2790

Pfam-B_9138

Pfam-B_9139

Pfam-B_968

Pfam-B_4632

Pfam-B_4894





lamin in_B

laminin_EGF

Pfam-B_5964

p o u



Pfam-B_1628

Pfam-B_3223

Pfam-B_4140


Pfam-B_2188

Pfam-B_2425

Pfam-B_2694

Pfam-B_804


Pfam-B_5718

Pfam-B_3002

f ib r inogen_C

Pfam-B_5714

Pfam-B_5793

Pfam-B_8544

Pfam-B_1644


Pfam-B_5583

Pfam-B_2684

Pfam-B_8763


Pfam-B_8547

Pfam-B_159

Pfam-B_1220

Pfam-B_6578

Pfam-B_2792

Pfam-B_993Pfam-B_58

Pfam-B_3837

Pfam-B_7350

Pfam-B_10727

Pfam-B_10371

Pfam-B_1354

Pfam-B_1079


Pfam-B_6426

Pfam-B_10370

Pfam-B_283

adh_z incPfam-B_9973

Pfam-B_9984


Pfam-B_11160

Pfam-B_10419

Pfam-B_11159

Pfam-B_9971

Pfam-B_2601

Pfam-B_10955

Pfam-B_2812

Pfam-B_40

Pfam-B_2600

Pfam-B_4093

Pfam-B_5435


Pfam-B_4083

Pfam-B_1440

Pfam-B_8224

Pfam-B_4084

c o p p e r - b i n d

Pfam-B_2581


Pfam-B_5334

Pfam-B_2157

Pfam-B_10965

Pfam-B_6144

Pfam-B_1532

Pfam-B_2362


Pfam-B_5313

Pfam-B_2434Pfam-B_2

Pfam-B_7923

Pfam-B_11763




Pfam-B_7994

Pfam-B_2887

Pfam-B_2977

Pfam-B_1826

Pfam-B_8057

Pfam-B_8095

Pfam-B_3153

Pfam-B_5360

Pfam-B_1433

Pfam-B_483

Pfam-B_626

Pfam-B_2211

Pfam-B_4142


Pfam-B_1509

h o r m o n e

Pfam-B_2339

Pfam-B_5383

Pfam-B_2801

Pfam-B_11112

Pfam-B_8445

Pfam-B_5385

Pfam-B_6534

Pfam-B_6532


Pfam-B_1291

Pfam-B_2995

Pfam-B_8096

l a m i n i n _ N t e r m Pfam-B_2608

Pfam-B_5361

Pfam-B_9192

Pfam-B_677

Pfam-B_8891

Pfam-B_8548

Pfam-B_8546

Pfam-B_5703

ce l l u l ase

Pfam-B_4255

Pfam-B_1650

Pfam-B_1908

Pfam-B_684

Pfam-B_1051

Pfam-B_8545

Pfam-B_1299

Pfam-B_3322

Pfam-B_5758

Pfam-B_3324

Pfam-B_750

Pfam-B_8927

Pfam-B_5768

Pfam-B_5669

Pfam-B_5584

Pfam-B_1301


Pfam-B_1728

Pfam-B_5747 Pfam-B_5759 Pfam-B_89Pfam-B_8919



Pfam-B_10289



Pfam-B_4591

Pfam-B_5679

Pfam-B_4224

Pfam-B_1685

Pfam-B_5812

Pfam-B_1659

Pfam-B_4619

Pfam-B_10358


Pfam-B_6617

Pfam-B_1658

Pfam-B_9935



Pfam-B_6195

Pfam-B_370

Pfam-B_2342

Pfam-B_3463


Pfam-B_10678


Pfam-B_8684

Pfam-B_402

Pfam-B_1339


Pfam-B_7299

Pfam-B_5740

Pfam-B_350

Pfam-B_10259

Pfam-B_4494

Pfam-B_7425

Pfam-B_2035

Pfam-B_7971


Pfam-B_3901

Pfam-B_3133

Pfam-B_2994

Pfam-B_5293

Pfam-B_7504


Pfam-B_2310

p i l i nP fam-B_1110

Pfam-B_7505

Pfam-B_671

Pfam-B_9931

Pfam-B_9933

TGF-be ta


Pfam-B_3883

Pfam-B_3495

Pfam-B_1695


Pfam-B_10299

Pfam-B_1349

Pfam-B_10947

Pfam-B_5348

Pfam-B_3516

Pfam-B_2377

Pfam-B_11021

Pfam-B_6380

Pfam-B_7235

Pfam-B_830

Pfam-B_20

Pfam-B_1182

Pfam-B_10246

Pfam-B_6332

Pfam-B_6340

Pfam-B_7196

Pfam-B_13

Pfam-B_123

Pfam-B_109

Pfam-B_595

Pfam-B_3508

Pfam-B_2844

Pfam-B_7197

Pfam-B_10241

Pfam-B_138

Pfam-B_395


Pfam-B_6338

Pfam-B_8105

Pfam-B_2516

Pfam-B_6341

Pfam-B_422

Pfam-B_3848

Pfam-B_10244

Pfam-B_10833

Pfam-B_424

Pfam-B_767

Pfam-B_8108

Pfam-B_9767

Pfam-B_2173

Pfam-B_5411

Pfam-B_954

s u g a r _ t r

P fam-B_2172

Pfam-B_10297

DAG_PE-bind

Pfam-B_7665

Pfam-B_1871

Pfam-B_701

Pfam-B_2773

Pfam-B_10298

Pfam-B_37


Pfam-B_6287

Pfam-B_913

Pfam-B_9661

Pfam-B_2846

Pfam-B_4594

Pfam-B_10852


Pfam-B_2789


Pfam-B_118

Pfam-B_8313

Pfam-B_288

Pfam-B_718

Pfam-B_573


Pfam-B_189

Pfam-B_6619

Pfam-B_11001


Pfam-B_10799

Pfam-B_5133

Pfam-B_5131

Pfam-B_8869


Pfam-B_1782

Pfam-B_1240

Pfam-B_196

Pfam-B_270

Pfam-B_6650

Pfam-B_1457

Pfam-B_794

Pfam-B_6342

Pfam-B_10243

Pfam-B_11487

Pfam-B_7384

Pfam-B_11123

Pfam-B_10239

Pfam-B_10478

Pfam-B_597

Pfam-B_3530

Pfam-B_2125

Pfam-B_1498

Pfam-B_4652

Pfam-B_4653

Pfam-B_552

Pfam-B_1357

Pfam-B_1574

Pfam-B_4646

Pfam-B_11256

Pfam-B_359

Pfam-B_4645


Pfam-B_11490

Pfam-B_3507

Pfam-B_1521

Pfam-B_3505

Pfam-B_1758

Pfam-B_10237

Pfam-B_2488

Pfam-B_7667

Pfam-B_727

Pfam-B_4749


Pfam-B_1706

Pfam-B_827


Pfam-B_7830

Pfam-B_1731


Pfam-B_889

Pfam-B_5809

Pfam-B_2659


r r m

Pfam-B_1125

Pfam-B_4350

S H 2

Pfam-B_7069

Pfam-B_3953

Pfam-B_6595

Pfam-B_4149

Pfam-B_3587

Pfam-B_2518

Pfam-B_2775

Pfam-B_7595

Pfam-B_8161

Pfam-B_6949

Pfam-B_4061

Pfam-B_2701

ld l_recept_a

Pfam-B_403

Pfam-B_1612

Pfam-B_796


Pfam-B_10357

Pfam-B_7008

Pfam-B_590c a d h e r i n

Pfam-B_10993

Pfam-B_6417

Pfam-B_1365

Pfam-B_4921

Pfam-B_4922



Pfam-B_4689

Pfam-B_1556

Pfam-B_6847

Pfam-B_1602

Pfam-B_10082

Pfam-B_11104

Pfam-B_6671

Pfam-B_3310

Pfam-B_7713

Pfam-B_2409



Pfam-B_4688

Pfam-B_3611

Pfam-B_9919

Pfam-B_4684

Pfam-B_2504

Pfam-B_4918


Pfam-B_8090

Pfam-B_8089

Pfam-B_3689

Pfam-B_7022

Pfam-B_4847

Pfam-B_341

Pfam-B_8759

Pfam-B_1388

Pfam-B_990

Pfam-B_4164

Pfam-B_5820

Pfam-B_6548

Pfam-B_7616

Pfam-B_313

Pfam-B_5128

Pfam-B_5175

Pfam-B_7111

Pfam-B_9916

Pfam-B_2140

Pfam-B_7282

Pfam-B_7321

Pfam-B_11644

Pfam-B_11419

Pfam-B_3952

Pfam-B_11625

Pfam-B_11584

7 t m _ 1

Pfam-B_4150



Pfam-B_989

Pfam-B_1788

Pfam-B_2703

Pfam-B_130

Pfam-B_4225

Pfam-B_7078

Pfam-B_885

Pfam-B_725

Pfam-B_9918

Pfam-B_3288

Pfam-B_5141


Pfam-B_8581

Pfam-B_3939



Pfam-B_1265


Pfam-B_5077


Pfam-B_764

Pfam-B_870

Pfam-B_7372

Pfam-B_2978

Pfam-B_9439

Pfam-B_4724

Pfam-B_10270

Pfam-B_1830

Pfam-B_1831

Pfam-B_8103

Pfam-B_2063

Pfam-B_773

Pfam-B_2463

Pfam-B_3037

Pfam-B_335

Pfam-B_3633

Pfam-B_4243

Pfam-B_5182

Pfam-B_5428

Pfam-B_6378

Pfam-B_1705

Pfam-B_1302

Pfam-B_1106


h e m o p e x i n

f i l a m e n t

P fam-B_11038

Pfam-B_1067

Pfam-B_1351

Pfam-B_969

Pfam-B_1089

Pfam-B_30

Pfam-B_160

Pfam-B_4483

Pfam-B_3540

Pfam-B_367

Pfam-B_2555

Pfam-B_607

Pfam-B_7668

Pfam-B_7756


Pfam-B_3165



Pfam-B_201



Pfam-B_11591

Pfam-B_9431

Pfam-B_888

Pfam-B_2696

Pfam-B_4451

Pfam-B_8890

Pfam-B_1600

Pfam-B_5143

Pfam-B_8758

Pfam-B_9130



Pfam-B_7052

Pfam-B_3442

Pfam-B_7648

Pfam-B_1601

Pfam-B_10010

Pfam-B_2802

Pfam-B_2675

Pfam-B_2697


Pfam-B_2223

Pfam-B_6565

Pfam-B_1965

Pfam-B_6484

Pfam-B_7117

Pfam-B_2715

Pfam-B_1961

Pfam-B_9046

Pfam-B_813

Pfam-B_330

Pfam-B_6721

z f -C2H2

Pfam-B_3173

Pfam-B_8954

Pfam-B_5413

Pfam-B_368

Pfam-B_8373

Pfam-B_7453

Pfam-B_5198

Pfam-B_3237

Pfam-B_8409

HTH_1

Pfam-B_2278

Pfam-B_415

Pfam-B_3314

Pfam-B_9650

Pfam-B_9182

Pfam-B_9043

Pfam-B_3741


Pfam-B_5048

Pfam-B_8163

Pfam-B_7669

Pfam-B_8955

Pfam-B_2416


Pfam-B_5541


Pfam-B_9276

Pfam-B_5341

Pfam-B_4129

Pfam-B_9943

Pfam-B_9140

Pfam-B_6294

Pfam-B_5842



Pfam-B_1516

Pfam-B_1971

Pfam-B_3494

Pfam-B_2120


Pfam-B_10143

Pfam-B_10131

Pfam-B_4556

Pfam-B_1970

ATP-synt_C

Pfam-B_11279



Pfam-B_4418

Pfam-B_2305

e f h a n d

Pfam-B_9648

Pfam-B_11014


Pfam-B_9047

Pfam-B_9042

Pfam-B_10726

Pfam-B_3593


Pfam-B_2759

Pfam-B_1489

Pfam-B_1490

Pfam-B_28

Pfam-B_5742

Pfam-B_11766

Pfam-B_1134


Pfam-B_4709

Pfam-B_4806

Pfam-B_8371

Pfam-B_3994

Pfam-B_7056

Pfam-B_1822

Pfam-B_3097

Pfam-B_6858

Pfam-B_9683

Pfam-B_1414

Pfam-B_8460

Pfam-B_8370

Pfam-B_8376

Pfam-B_3898

Pfam-B_2225

Pfam-B_4077

Pfam-B_3187

Pfam-B_4538

Pfam-B_5754

Pfam-B_7557

Pfam-B_4062

Pfam-B_1941

Pfam-B_7680

Pfam-B_5340

Pfam-B_5339

Pfam-B_7252

Pfam-B_3492

P H

Pfam-B_9605

Pfam-B_5540


Pfam-B_974

Pfam-B_8145

Pfam-B_3882

Pfam-B_6915

Pfam-B_6216

Pfam-B_5084

Pfam-B_4269

Pfam-B_358

Pfam-B_2304

Pfam-B_7640


Pfam-B_6845

Pfam-B_4127

Pfam-B_83

Pfam-B_5278

Pfam-B_8527

Pfam-B_1656

Pfam-B_8918

Pfam-B_1465

Pfam-B_1840


Pfam-B_10217

Pfam-B_4072

Pfam-B_8583

Pfam-B_2204

Pfam-B_8017


Pfam-B_6600

Pfam-B_3580

p k i n a s e

Pfam-B_699

Pfam-B_11809




Pfam-B_6944

Pfam-B_700

Pfam-B_5154

Pfam-B_1113

Pfam-B_2541

Pfam-B_11808

Pfam-B_461

Figure 5.12. The Family network using only non-circular patterns



Pfam-B_918

Pfam-B_1502

Pfam-B_141

Pfam-B_8094

Pfam-B_993

Pfam-B_5264

Pfam-B_1223

Pfam-B_2635

Pfam-B_6489

Pfam-B_4174

Pfam-B_10708

Pfam-B_10703

Pfam-B_309

Pfam-B_10711

Pfam-B_919

Pfam-B_831

Pfam-B_7376

Y_phosphatase

P fam-B_5075

Pfam-B_10718

Pfam-B_10566

Pfam-B_7378

Pfam-B_5960

Pfam-B_5078

Pfam-B_705

Pfam-B_3842

Pfam-B_844

lec t in_c

Pfam-B_8076


Pfam-B_10397

Pfam-B_37

Pfam-B_3547

Pfam-B_6287

Pfam-B_2846


Pfam-B_4686

Pfam-B_7136

Pfam-B_4698

Pfam-B_10692

Pfam-B_6576

Pfam-B_10698

Pfam-B_2245

Pfam-B_1088

Pfam-B_5810

Pfam-B_7668

Pfam-B_1649

Pfam-B_9692

Pfam-B_7565

Pfam-B_8332


Pfam-B_1022

Pfam-B_121

Pfam-B_2149

Pfam-B_2139

Pfam-B_5197

Pfam-B_8466

Pfam-B_11100


Pfam-B_4659

Pfam-B_7453


Pfam-B_4126

Pfam-B_7999

Pfam-B_4269

Pfam-B_8447

Pfam-B_9683

Pfam-B_4128

Pfam-B_3557

Pfam-B_3059

Pfam-B_667

Pfam-B_3097

Pfam-B_1822


Pfam-B_4130

h o m e o b o x

Pfam-B_11833

Pfam-B_1922

Pfam-B_8370

Pfam-B_5266

Pfam-B_964

Pfam-B_7354

Pfam-B_9180

p o u

Pfam-B_5265

Pfam-B_6965

Pfam-B_8380

Pfam-B_3223

Pfam-B_7091

Pfam-B_9606

Pfam-B_5085

Pfam-B_842

Pfam-B_7667

Pfam-B_174

Pfam-B_5809

Pfam-B_6619

Pfam-B_6572

Pfam-B_7665

Pfam-B_701


Pfam-B_20


Pfam-B_2773

Pfam-B_10299

Pfam-B_58

Pfam-B_3074

Pfam-B_359

Pfam-B_2377

Pfam-B_913

Pfam-B_9661

Pfam-B_4594

Pfam-B_2340

Pfam-B_1835

Pfam-B_6374

Pfam-B_7115

Pfam-B_42

Pfam-B_6303

Pfam-B_292

Pfam-B_4008

Pfam-B_284

Pfam-B_331

Pfam-B_910

Pfam-B_8116

Pfam-B_49

Pfam-B_955

Pfam-B_2880

Pfam-B_530

Pfam-B_3572

Pfam-B_1152

Pfam-B_10585

Pfam-B_3069


Pfam-B_2884

Pfam-B_1186

Pfam-B_10562

Pfam-B_3575

Pfam-B_7489


Pfam-B_4675

Pfam-B_2676

Pfam-B_3758

Pfam-B_7846

Pfam-B_7174

Pfam-B_7848

Pfam-B_7843

Pfam-B_308

s e r p i n

Pfam-B_352

Pfam-B_9418

Pfam-B_6641

Pfam-B_4943


Pfam-B_10403

Pfam-B_10573

Pfam-B_5323

Pfam-B_8141

Pfam-B_2506

Pfam-B_1774

Pfam-B_4206

Pfam-B_8724

Pfam-B_2892

Pfam-B_630

Pfam-B_2494

Pfam-B_3015

Pfam-B_7845

Pfam-B_425

Pfam-B_2710

Pfam-B_337

Pfam-B_4631

Pfam-B_1830

Pfam-B_1831

Pfam-B_2516

Pfam-B_8105

Pfam-B_7932 HTH_1

Pfam-B_3474

Pfam-B_594

Pfam-B_10001

Pfam-B_11256


Pfam-B_7505

Pfam-B_768

Pfam-B_8089

Pfam-B_5176

Pfam-B_3013

Pfam-B_1621

Pfam-B_10297

Pfam-B_150

Pfam-B_830

Pfam-B_2158

Pfam-B_1182

Pfam-B_6380

p i l i n

P fam-B_11021

Pfam-B_5136

Pfam-B_10351

Pfam-B_7809

Pfam-B_10572

Pfam-B_6006

Pfam-B_8138

Pfam-B_1026

Pfam-B_1459


Pfam-B_3663

Pfam-B_3169

Pfam-B_5604

Pfam-B_11743

Pfam-B_1713

Pfam-B_8142

Pfam-B_3707

Pfam-B_8824

Pfam-B_5799

Pfam-B_7635

Pfam-B_1617

Pfam-B_5741

Pfam-B_1525

Pfam-B_8975

Pfam-B_7951

Pfam-B_102

Pfam-B_4058

Pfam-B_2840d s r m


Pfam-B_5857

Pfam-B_4807i g

Pfam-B_5757

Pfam-B_4252

Pfam-B_3071

Pfam-B_2604

Pfam-B_2575

Pfam-B_11009

Pfam-B_11037

Pfam-B_8982

Pfam-B_479

Pfam-B_864

Pfam-B_7727

Pfam-B_3817

Pfam-B_8003

Pfam-B_1618

Pfam-B_3969

Pfam-B_2241

Pfam-B_2009


Pfam-B_6018

Pfam-B_1817

Pfam-B_1815

Pfam-B_2963

Pfam-B_6017

Pfam-B_7479

Pfam-B_1361


Pfam-B_10604

Pfam-B_85

Pfam-B_81

Pfam-B_10600

Pfam-B_2529

Pfam-B_7480

Pfam-B_555


Pfam-B_8207

Pfam-B_2855

Pfam-B_3899

Pfam-B_11198

Pfam-B_7559

Pfam-B_774

Pfam-B_2297

Pfam-B_2296

Pfam-B_3068

Pfam-B_881

Pfam-B_6003

Pfam-B_4082

Pfam-B_9533

Pfam-B_2290

Pfam-B_4403

Pfam-B_135

Pfam-B_5421


Pfam-B_7451

Pfam-B_7826

Pfam-B_2540

Pfam-B_5113

Pfam-B_5114

Pfam-B_495

r n a s e H

Pfam-B_6146

Pfam-B_4482

Pfam-B_332

Pfam-B_9139

Pfam-B_330

Pfam-B_590

Pfam-B_3543

Pfam-B_9943

Pfam-B_6294

c a d h e r i n

Pfam-B_9138


Pfam-B_2790

Pfam-B_5842

Pfam-B_3342


Pfam-B_9140

Pfam-B_5841


Pfam-B_1189




Pfam-B_7659

Pfam-B_7053

Pfam-B_10142

E1-E2_ATPase

Pfam-B_1516

Pfam-B_9007

Pfam-B_210

Pfam-B_7182

Pfam-B_10170


Pfam-B_1971

Pfam-B_459

Pfam-B_3482

Pfam-B_3294

Pfam-B_4535

Pfam-B_9838

Pfam-B_4536

Pfam-B_8302

P H

Pfam-B_4556

Pfam-B_4555

Pfam-B_460

Pfam-B_10130

Pfam-B_2361

Pfam-B_698

Pfam-B_6291

Pfam-B_2120

ATP-synt_C

laminin_EGF

Pfam-B_5435

Pfam-B_4083

Pfam-B_3109

Pfam-B_5360

Pfam-B_8096

Pfam-B_2608

Pfam-B_3972

Pfam-B_5385

l a m i n i n _ N t e r m

Pfam-B_10555

Pfam-B_7577

Pfam-B_4548

Pfam-B_8718

Pfam-B_2466

p 4 5 0

Pfam-B_7383

Pfam-B_1237

Pfam-B_641

Pfam-B_2112

Pfam-B_9859

Pfam-B_402

Pfam-B_1694

Pfam-B_1339

Pfam-B_7068

Pfam-B_2742

Pfam-B_1569

Pfam-B_370

Pfam-B_9934

Pfam-B_1695

Pfam-B_9933

Pfam-B_4497

Pfam-B_8684

Pfam-B_6195

Pfam-B_136

Pfam-B_2834

Pfam-B_8471

Pfam-B_5590

Pfam-B_3495

Pfam-B_2191

Pfam-B_9383


Pfam-B_11164

Pfam-B_1079

Pfam-B_3528

Pfam-B_2789

Pfam-B_10852


Pfam-B_10370

Pfam-B_283

Pfam-B_1676


Pfam-B_723

Pfam-B_388

Pfam-B_3738



Pfam-B_2430

Pfam-B_779

Pfam-B_6627

Pfam-B_4374

Pfam-B_3931

Pfam-B_4634

Pfam-B_1780

ox ido red_mo lyb

P fam-B_7775

Pfam-B_5552

Pfam-B_2192

Pfam-B_4820

Pfam-B_6788

Pfam-B_3843

Pfam-B_977

Pfam-B_1406

Pfam-B_5651


Pfam-B_3287

Pfam-B_169

Pfam-B_277

Pfam-B_4332

Pfam-B_4909

Pfam-B_7051

Pfam-B_1461

Pfam-B_5124

Pfam-B_6042

Pfam-B_791

Pfam-B_5138

Pfam-B_7904


p y r _ r e d o xPfam-B_1127


heme_1 Pfam-B_9778 Pfam-B_2418


Pfam-B_6959

Pfam-B_2673


Pfam-B_10820

Pfam-B_9421

Pfam-B_2674

Pfam-B_1744


Pfam-B_3613


v w d



Pfam-B_11897


Pfam-B_2923

Pfam-B_5650

Pfam-B_2306


Pfam-B_39Pfam-B_197


Pfam-B_4185


Pfam-B_9211

Pfam-B_2989


Pfam-B_8252

Pfam-B_2209

Pfam-B_264


Pfam-B_10506

Pfam-B_5923 a lpha -amy lase

P fam-B_9288 Pfam-B_6993 Pfam-B_9467


Pfam-B_1919

Pfam-B_5970

Pfam-B_3594

Pfam-B_8445

Pfam-B_6581





Pfam-B_207

Pfam-B_7625


Pfam-B_172

Pfam-B_5746

Pfam-B_848

Pfam-B_1332

Pfam-B_6046

c p n 6 0


Pfam-B_920

Pfam-B_8711

Pfam-B_4707

Pfam-B_4884

Pfam-B_9321

Pfam-B_4352





Pfam-B_7584

Pfam-B_7586


Pfam-B_9804


Pfam-B_7585



Pfam-B_631


Pfam-B_2137

Pfam-B_2029

Pfam-B_5231

Pfam-B_4331



Pfam-B_4124

Pfam-B_4685

Pfam-B_3576

Pfam-B_8111

Pfam-B_7122

Pfam-B_8112

Pfam-B_1666

Pfam-B_4995

Pfam-B_11898

Pfam-B_2613

Pfam-B_8719

Pfam-B_5125

Pfam-B_8754

Pfam-B_8753

Pfam-B_1462

Pfam-B_5156

Pfam-B_8716

Pfam-B_8133

Pfam-B_1153

Pfam-B_8180

Pfam-B_276

Pfam-B_8418

Pfam-B_3076

Pfam-B_8755

Pfam-B_2807

Pfam-B_224

Pfam-B_4193


Pfam-B_1034


s i g m a 7 0

Pfam-B_4501

Pfam-B_542

Pfam-B_950

Pfam-B_543

Pfam-B_4327

Pfam-B_1549

Pfam-B_6773

Pfam-B_4789

Pfam-B_69

Pfam-B_949

Pfam-B_925

Pfam-B_8467


Pfam-B_203

Pfam-B_8346

Pfam-B_8109

Pfam-B_2599

Pfam-B_8067


Pfam-B_9237


Pfam-B_9317

Pfam-B_8056

Pfam-B_1962

Pfam-B_238

Pfam-B_10037

fe r4_N i fH

Pfam-B_10036

Pfam-B_536


Pfam-B_5836


Pfam-B_8605

Pfam-B_10079

Pfam-B_2356

Pfam-B_34

Pfam-B_2645

Pfam-B_59


Pfam-B_8453

Pfam-B_10546



Pfam-B_2222

Pfam-B_51

Pfam-B_2476

Pfam-B_11331

Pfam-B_10318

Pfam-B_2229

Pfam-B_2850

Pfam-B_1287

Pfam-B_645

Pfam-B_5274

Pfam-B_4848

Pfam-B_155pro_ isomerase P fam-B_9962 Pfam-B_9259 Pfam-B_8787Pfam-B_897

Pfam-B_5155Pfam-B_5218Pfam-B_7724Pfam-B_5267Pfam-B_7355Pfam-B_5300Pfam-B_2154 Pfam-B_8937 Pfam-B_6604Pfam-B_9588 Pfam-B_4749Pfam-B_498 Pfam-B_73Pfam-B_4362Pfam-B_4394Pfam-B_4435Pfam-B_6060Pfam-B_4616Pfam-B_4721Pfam-B_6617 Pfam-B_3581Pfam-B_507 Pfam-B_6013

Pfam-B_7658 Pfam-B_685 Pfam-B_5515Pfam-B_5806 Pfam-B_5696Pfam-B_5808 Pfam-B_4663Pfam-B_4759Pfam-B_490Pfam-B_4911Pfam-B_5059Pfam-B_5147Pfam-B_5167Pfam-B_5227 Pfam-B_462

Pfam-B_8435Pfam-B_4140Pfam-B_4119Pfam-B_4120Pfam-B_7535Pfam-B_3873Pfam-B_3871Pfam-B_3872Pfam-B_4324 Pfam-B_1054Pfam-B_3260Pfam-B_3606Pfam-B_6065Pfam-B_3430

Pfam-B_3385Pfam-B_3440Pfam-B_45Pfam-B_4512Pfam-B_4528 Pfam-B_4413 Pfam-B_3132Pfam-B_3864Pfam-B_3993 Pfam-B_3849

Pfam-B_7608

Pfam-B_7601

Pfam-B_6475

Pfam-B_2725


Pfam-B_7605

Pfam-B_413

Pfam-B_4313


Pfam-B_4653

Pfam-B_2252

Pfam-B_3344 DNA_pol

P fam-B_10479 Pfam-B_10476

Pfam-B_10816

Pfam-B_1745

Pfam-B_2429

Pfam-B_846

Pfam-B_922

Pfam-B_3549

Pfam-B_271

Pfam-B_397

Pfam-B_8623

Pfam-B_10677

Pfam-B_7762

s u b t i l a s e

Pfam-B_4285

Pfam-B_7755

Pfam-B_3107

Pfam-B_622

Pfam-B_583

Pfam-B_318

Pfam-B_5974

Pfam-B_4143

Pfam-B_469

Pfam-B_5972


Pfam-B_10825

Pfam-B_3041

Pfam-B_809

Pfam-B_8670




Pfam-B_5077

Pfam-B_5079

Pfam-B_9897

Pfam-B_2659

Pfam-B_3887

Pfam-B_4451

Pfam-B_11592

Pfam-B_888

Pfam-B_11591

Pfam-B_11593


Pfam-B_9052

Pfam-B_2228

Pfam-B_9904

Pfam-B_2912


Pfam-B_2913

S H 3

Pfam-B_10099

Pfam-B_9431

S H 2Pfam-B_1883

Pfam-B_403



Pfam-B_9911

Pfam-B_796

Pfam-B_2060

Pfam-B_8518

Pfam-B_878

Pfam-B_1436

Pfam-B_9609

Pfam-B_5154

Pfam-B_8407

Pfam-B_3706


Pfam-B_4607


Pfam-B_7456

Pfam-B_5101

Pfam-B_3080

Pfam-B_1283

Pfam-B_4045

Pfam-B_4700

Pfam-B_2900

Pfam-B_9750


Pfam-B_7069

EGF

Pfam-B_2416

Pfam-B_10721


Pfam-B_5591

Pfam-B_5041

Pfam-B_3837

Pfam-B_7311

Pfam-B_4074

Pfam-B_7349

Pfam-B_7830

Pfam-B_870

Pfam-B_6216

Pfam-B_8397

Pfam-B_1414

Pfam-B_7997

Pfam-B_8524

Pfam-B_6915

Pfam-B_7640


Pfam-B_5045

Pfam-B_3995

Pfam-B_4057

Pfam-B_7558

Pfam-B_3187

Pfam-B_4139

Pfam-B_8005

Pfam-B_2541

Pfam-B_1465

Pfam-B_11809

Pfam-B_1134

Pfam-B_5397

Pfam-B_5405

r a s

Pfam-B_4202

Pfam-B_108

Pfam-B_8409

Pfam-B_2777

Pfam-B_10350

Pfam-B_3302

Pfam-B_7491

Pfam-B_5368

Pfam-B_1840

Pfam-B_8153

Pfam-B_3898

Pfam-B_8583

Pfam-B_4077

Pfam-B_7557

Pfam-B_2225

Pfam-B_3429

Pfam-B_83

Pfam-B_6308

Pfam-B_8944

Pfam-B_275

a n k

Pfam-B_4060

Pfam-B_3713

Pfam-B_8825

Pfam-B_735

Pfam-B_3124

Pfam-B_9191

Pfam-B_8139

Pfam-B_219

Pfam-B_11199

Pfam-B_2716

Pfam-B_1821

Pfam-B_3082

Pfam-B_2827

Pfam-B_9667

Pfam-B_3257

Pfam-B_4538

Pfam-B_8196

Pfam-B_5754

Pfam-B_4062

Pfam-B_9309

Pfam-B_4059

Pfam-B_7049

Pfam-B_6595

Pfam-B_6414

Pfam-B_3730

Pfam-B_4726

Pfam-B_7009



Pfam-B_3289

Pfam-B_3580

Pfam-B_2016

Pfam-B_6721

Pfam-B_8527

Pfam-B_1979

RIP


Pfam-B_11735

Pfam-B_4051


Pfam-B_93

Pfam-B_3102

Pfam-B_4420

Pfam-B_4072

Pfam-B_7595

Pfam-B_4061

Pfam-B_8143

Pfam-B_974


Pfam-B_2701

Pfam-B_10702

Pfam-B_8017


Pfam-B_10357

Pfam-B_3170

Pfam-B_3916

Pfam-B_7952

Pfam-B_8145

Pfam-B_10946

Pfam-B_5742

Pfam-B_11794

Pfam-B_8257

Pfam-B_8120

Pfam-B_28

c y c l i n

P fam-B_10021

Pfam-B_2982


Pfam-B_2980

Pfam-B_11323

Pfam-B_2161

Pfam-B_523

Pfam-B_2519

Pfam-B_5909

Pfam-B_7442

Pfam-B_2983

Pfam-B_11017

Pfam-B_8175

Pfam-B_6909


Pfam-B_5406

Pfam-B_3960

Pfam-B_6943

n o t c h

Pfam-B_8197

Pfam-B_2061

Pfam-B_2518


Pfam-B_4085

Pfam-B_2204

Pfam-B_4806

Pfam-B_3882

Pfam-B_1025

Pfam-B_4076

Pfam-B_8193

Pfam-B_11753

Pfam-B_2839

Pfam-B_2598

p k i n a s e

Pfam-B_6600

Pfam-B_477

Pfam-B_4075

Pfam-B_3192

Pfam-B_11808

Pfam-B_1136

Pfam-B_2949

Pfam-B_5398

Pfam-B_1656

Pfam-B_436


Pfam-B_6944

Pfam-B_11026

Pfam-B_5607

Pfam-B_7926

Pfam-B_711

Pfam-B_2747

Pfam-B_2878


Pfam-B_7034

Pfam-B_334

Pfam-B_10575

Pfam-B_7033

Pfam-B_2758


Pfam-B_6417


Pfam-B_620

Pfam-B_2274

Pfam-B_9008

Pfam-B_520

Pfam-B_473

Pfam-B_7981

Pfam-B_156

Pfam-B_2550

Pfam-B_1654

Pfam-B_105

Pfam-B_7334

Pfam-B_3369

Pfam-B_6957


Pfam-B_1467

Pfam-B_9006

Pfam-B_2314

Pfam-B_1620

Pfam-B_851


Pfam-B_4006

Pfam-B_6098

Pfam-B_307

Pfam-B_554

Pfam-B_2583

Pfam-B_10440

Pfam-B_3121

Pfam-B_733

Pfam-B_10435

Pfam-B_336

Pfam-B_7982

myos in_head

Pfam-B_7844

Pfam-B_407

Pfam-B_2570

Pfam-B_728

Pfam-B_1640

Pfam-B_939

Pfam-B_7719

Pfam-B_5386

Pfam-B_7718


Pfam-B_3880

Pfam-B_2028

Pfam-B_3193

Pfam-B_9903

Pfam-B_874

Pfam-B_606

Pfam-B_9901

Pfam-B_4237

Pfam-B_5710


Pfam-B_3978

Pfam-B_1819

Pfam-B_297

Pfam-B_3299

Pfam-B_3263


Pfam-B_1677

Pfam-B_7486

Pfam-B_659

Pfam-B_4397

Pfam-B_6801

Pfam-B_6000

Pfam-B_4395

Pfam-B_1490

Pfam-B_10556

Pfam-B_6004

Pfam-B_6802

Pfam-B_9531

Pfam-B_2289

Pfam-B_11467

r h v

Pfam-B_2945

Pfam-B_3212

Pfam-B_4396

Pfam-B_2759

Pfam-B_1824

Pfam-B_2292

Pfam-B_2298

Pfam-B_5116

r v tP fam-B_2530

Pfam-B_1260

Pfam-B_9548

Pfam-B_1802

Pfam-B_2531

Pfam-B_2299

Pfam-B_2013

Pfam-B_6511

Pfam-B_9494

Pfam-B_2403

Pfam-B_1804

Pfam-B_10597


Pfam-B_9543

Pfam-B_2405

Pfam-B_7483

Pfam-B_6062

Pfam-B_4563

Pfam-B_616

Pfam-B_7252

Pfam-B_2386

Pfam-B_548

Pfam-B_461


Pfam-B_3017

Pfam-B_4939

Pfam-B_3509

Pfam-B_3816

Pfam-B_546


Pfam-B_547

Pfam-B_229

Pfam-B_134r v p

Pfam-B_5115

Pfam-B_6002

Pfam-B_3878

Pfam-B_497

Pfam-B_10594

Pfam-B_19

Pfam-B_1803

Pfam-B_10243

Pfam-B_1706

Pfam-B_6342


Pfam-B_3505

Pfam-B_10245

Pfam-B_4279

Pfam-B_123

Pfam-B_189

Pfam-B_7197

Pfam-B_827

Pfam-B_1204

Pfam-B_505


Pfam-B_5797


Pfam-B_5376

Pfam-B_9365

Pfam-B_6330

Pfam-B_13

Pfam-B_10242

Pfam-B_10246


Pfam-B_3507

Pfam-B_11490

Pfam-B_2844

Pfam-B_765

Pfam-B_10237

Pfam-B_6340

Pfam-B_826

Pfam-B_1758

Pfam-B_11487

Pfam-B_595

Pfam-B_4997

Pfam-B_138

Pfam-B_6332

Pfam-B_1521


Pfam-B_7769

Pfam-B_1913

Pfam-B_7770

Pfam-B_676


Pfam-B_7603

Pfam-B_2124

Pfam-B_7606

Pfam-B_4311

Pfam-B_9168

Pfam-B_7600

Pfam-B_862

Pfam-B_6403 Pfam-B_3917 Pfam-B_900Pfam-B_6733Pfam-B_278Pfam-B_361

Pfam-B_4339

Pfam-B_4703

Pfam-B_1420



Pfam-B_3321

Pfam-B_262

a l d e d hPfam-B_10281

Pfam-B_2373

Pfam-B_1212

Pfam-B_1399


Pfam-B_2986

Pfam-B_5652

Pfam-B_7248

tRNA-syn t_1

Pfam-B_10371

Pfam-B_2125

Pfam-B_3530

Pfam-B_2333


Pfam-B_118

Pfam-B_9162

Pfam-B_491

Pfam-B_1498

Pfam-B_4652

Pfam-B_9158

Pfam-B_408

Pfam-B_1997

Pfam-B_4453

Pfam-B_2755

Pfam-B_10477


Pfam-B_6248

Pfam-B_5847

Pfam-B_288


Pfam-B_5846



Pfam-B_7501

Pfam-B_1946


Pfam-B_1904


Pfam-B_7602

Pfam-B_7607

Zn_c lus

Pfam-B_3063

Pfam-B_6659h o r m o n e

Pfam-B_7425

Pfam-B_2425

Pfam-B_11763

Pfam-B_6336

Pfam-B_1532

Pfam-B_6532

Pfam-B_474

Pfam-B_3902

Pfam-B_5973

Pfam-B_1291

Pfam-B_7301

Pfam-B_5313

Pfam-B_6144

Pfam-B_2


Pfam-B_2994

Pfam-B_3133

Pfam-B_2362

Pfam-B_7300

Pfam-B_3901

Pfam-B_2887

Pfam-B_7017

Pfam-B_6534

Pfam-B_3075


Pfam-B_7299

Pfam-B_9590

Pfam-B_11762

Pfam-B_911

Pfam-B_10965

Pfam-B_3560

Pfam-B_11918

Pfam-B_6128

Pfam-B_4472

Pfam-B_2956

Pfam-B_50

Pfam-B_2581

c o p p e r - b i n d Pfam-B_2434

Pfam-B_1823


Pfam-B_2034

Pfam-B_1725

Pfam-B_235

Pfam-B_117

Pfam-B_560

Pfam-B_6846

Pfam-B_6492

Pfam-B_6863

Pfam-B_234

Pfam-B_405

Pfam-B_2286

Pfam-B_5122

Pfam-B_5123

Pfam-B_2685

Pfam-B_2533

Pfam-B_3186

Pfam-B_7503

Pfam-B_1318

Pfam-B_959

Pfam-B_3184

Pfam-B_2565

Pfam-B_5647



Pfam-B_1317

Pfam-B_3354

Pfam-B_10091


Pfam-B_1949

Pfam-B_10090

Pfam-B_1699

Pfam-B_557



Pfam-B_10087

Pfam-B_3483

Pfam-B_351


Pfam-B_6231

Pfam-B_9991


Pfam-B_902

Pfam-B_6233


Pfam-B_4984

Pfam-B_111

Pfam-B_2816


Pfam-B_6353

Pfam-B_3512

Pfam-B_1982


Pfam-B_10793

Pfam-B_6599

Pfam-B_6598

Pfam-B_976

Pfam-B_11806

Pfam-B_11805


Pfam-B_206

Pfam-B_180

Pfam-B_3950

Pfam-B_7722

Pfam-B_3951

Pfam-B_7723

Pfam-B_179

Pfam-B_5215

Pfam-B_10633

Pfam-B_6354

Pfam-B_3510

Pfam-B_2374

Pfam-B_11500

Pfam-B_1359

Pfam-B_2069

Pfam-B_1473

Pfam-B_2375

Pfam-B_6352

Pfam-B_67

Pfam-B_9811

Pfam-B_6129

Pfam-B_5945

Pfam-B_4191

Pfam-B_249

Pfam-B_1348

Pfam-B_4209

Pfam-B_9932

Pfam-B_5971

Pfam-B_10678

Pfam-B_9931

Pfam-B_10147

Pfam-B_9382

Pfam-B_2342

Pfam-B_3463

Pfam-B_5171

Pfam-B_5217

Pfam-B_3948

Pfam-B_343

Pfam-B_2737

Pfam-B_163

Pfam-B_2596

Pfam-B_5400

Pfam-B_7795

Pfam-B_4354

Pfam-B_164

Pfam-B_4157

Pfam-B_11350

Pfam-B_10785

Pfam-B_10391

Pfam-B_7721

Pfam-B_9227

Pfam-B_5334

Pfam-B_4885

Pfam-B_10789



Pfam-B_8989

Pfam-B_6174

Pfam-B_3412


Pfam-B_8987


Pfam-B_4722


Pfam-B_612Pfam-B_8

Pfam-B_2443

Pfam-B_3721




Pfam-B_9264

Pfam-B_9274

Pfam-B_1960

Pfam-B_1683

lec t in_ legB

Pfam-B_1996

Pfam-B_492

Pfam-B_5730

Pfam-B_8027

Pfam-B_3724

Pfam-B_35

Pfam-B_3149

Pfam-B_3154


Pfam-B_4887

Pfam-B_4407

Pfam-B_2928

Pfam-B_5377

Pfam-B_2926


Pfam-B_5387

response_reg

Pfam-B_718


Pfam-B_3852


Pfam-B_6610

Pfam-B_9698

Pfam-B_11860

Pfam-B_3854

Pfam-B_6086

Pfam-B_4506


Pfam-B_9706

Pfam-B_6347




Pfam-B_139

Pfam-B_11301

Pfam-B_9267

Pfam-B_2524

Pfam-B_613

Pfam-B_99

Pfam-B_3162


Pfam-B_5383

Pfam-B_3153

Pfam-B_8224

Pfam-B_4093

Pfam-B_575

Pfam-B_4084


Pfam-B_4617


Pfam-B_191


Pfam-B_10460

Pfam-B_10464

Pfam-B_703

Pfam-B_464

Pfam-B_214

Pfam-B_165



Pfam-B_56

Pfam-B_10157

Pfam-B_430

Pfam-B_10167

Pfam-B_6537

Pfam-B_10166

Pfam-B_6298


Pfam-B_10164

Pfam-B_10169

Pfam-B_1972

Pfam-B_9230

Pfam-B_958

Pfam-B_9231

Pfam-B_4155

Pfam-B_2017

Pfam-B_988

Pfam-B_6542



Pfam-B_3090

Pfam-B_5399

Pfam-B_9347

Pfam-B_10168

Pfam-B_6849

Pfam-B_4560

Pfam-B_2365

Pfam-B_418

Pfam-B_5926

Pfam-B_503

Pfam-B_10163

Pfam-B_7736

Pfam-B_3956

Pfam-B_4561

Pfam-B_333

ox ido red_ fad

Pfam-B_5225

Pfam-B_7796

Pfam-B_10159

Pfam-B_1199


Pfam-B_510

Pfam-B_1356

Pfam-B_2870

Pfam-B_1724

Pfam-B_509

Pfam-B_2431

Pfam-B_1877

Pfam-B_4286

Pfam-B_5671

Pfam-B_7834



Pfam-B_5006

Pfam-B_379

Pfam-B_8662

Pfam-B_10688

Pfam-B_2748

7 t m _ 2Pfam-B_1487Pfam-B_6778Pfam-B_4902

Pfam-B_2858


Pfam-B_5627

Pfam-B_8742

Pfam-B_8688

Pfam-B_124

Pfam-B_712

Pfam-B_11600

Pfam-B_190

Pfam-B_1003

Pfam-B_2456

Pfam-B_2959

Pfam-B_5782


Pfam-B_9830


Pfam-B_240

Pfam-B_1337

Pfam-B_2629

Pfam-B_694




a c t i n

Pfam-B_9344

Pfam-B_9831

Cys -p ro tease

Pfam-B_2427

Pfam-B_2738

Pfam-B_1981

Pfam-B_198

Pfam-B_9601

Pfam-B_4271

Pfam-B_247

Pfam-B_2547

Pfam-B_4547

Pfam-B_2126

Pfam-B_1622

STphospha tase

P fam-B_1124

Pfam-B_7642

Pfam-B_10270

Pfam-B_3848

Pfam-B_794

Pfam-B_8928

Pfam-B_424

Pfam-B_767

Pfam-B_7384

Pfam-B_5760

Pfam-B_1206

Pfam-B_506

Pfam-B_11642

Pfam-B_8030

Pfam-B_4786

Pfam-B_8932

Pfam-B_6333

Pfam-B_816

Pfam-B_5246

Pfam-B_3455

Pfam-B_11715

Pfam-B_5000

Pfam-B_10226

Pfam-B_6334

Pfam-B_6819

Pfam-B_6337

Pfam-B_3670

Pfam-B_6335

Pfam-B_7195

Pfam-B_5069

Pfam-B_7374

Pfam-B_201

Pfam-B_9848

Pfam-B_7372

Pfam-B_2978

Pfam-B_5073

Pfam-B_2111

Pfam-B_9846

Pfam-B_1600

Pfam-B_312

Pfam-B_4684

Pfam-B_7712

Pfam-B_4847

Pfam-B_437

Pfam-B_3953

Pfam-B_1112

Pfam-B_7750

Pfam-B_4225


Pfam-B_1601

Pfam-B_2703

Pfam-B_623


Pfam-B_3952

Pfam-B_375

Pfam-B_11638

Pfam-B_4851


Pfam-B_3310

Pfam-B_6580

Pfam-B_159

Pfam-B_1220

Pfam-B_2792

Pfam-B_10727

Pfam-B_969

Pfam-B_1351

Pfam-B_1089

Pfam-B_160

Pfam-B_3593

Pfam-B_372

Pfam-B_1788

Pfam-B_7321

Pfam-B_11104

Pfam-B_9130

Pfam-B_1388

Pfam-B_2058

Pfam-B_1602

Pfam-B_5142


Pfam-B_7022


Pfam-B_5070

Pfam-B_5072

Pfam-B_11771

Pfam-B_3165


Pfam-B_1323

Pfam-B_1056

Pfam-B_1556

Pfam-B_2504



Pfam-B_7713

Pfam-B_3256

Pfam-B_5141

Pfam-B_885


Pfam-B_725

7 t m _ 1

Pfam-B_10543

Pfam-B_6209

Pfam-B_5096

Pfam-B_691

Pfam-B_341

Pfam-B_10726

f e r 2

Pfam-B_2068

Pfam-B_932

Pfam-B_456

Pfam-B_11159

Pfam-B_6903

Pfam-B_9973

Pfam-B_4410


Pfam-B_9980

Pfam-B_6544

Pfam-B_194

Pfam-B_6227

Pfam-B_9971



Pfam-B_9129

Pfam-B_5144

Pfam-B_5820

Pfam-B_1931

Pfam-B_4533

Pfam-B_2823

Pfam-B_10756

Pfam-B_6671

Pfam-B_9861

Pfam-B_11160

Pfam-B_3467


Pfam-B_4148

Pfam-B_4356

Pfam-B_11927

Pfam-B_9310

Pfam-B_4149

Pfam-B_4918

Pfam-B_10807

Pfam-B_3373

Pfam-B_4357

Pfam-B_4920

Pfam-B_6529

Pfam-B_10741


Pfam-B_3108

Pfam-B_11545

Pfam-B_3091

Pfam-B_11320

w n t

P fam-B_11716

Pfam-B_305


Pfam-B_5538

Pfam-B_3645

Pfam-B_11714

Pfam-B_966

Pfam-B_1558

Pfam-B_106

Pfam-B_186

Pfam-B_970

Pfam-B_1953

Pfam-B_9936

Pfam-B_9820

Pfam-B_9819

Pfam-B_1232

Pfam-B_11649

Pfam-B_9822

Pfam-B_1233

Pfam-B_355

Pfam-B_9818

Pfam-B_9817

k e t o a c y l - s y n t

P fam-B_4873

Pfam-B_11511

Pfam-B_177

Pfam-B_3672

Pfam-B_9821

Pfam-B_53

Pfam-B_591

Pfam-B_126

Pfam-B_10508

Pfam-B_6862

Pfam-B_927

Pfam-B_517

Pfam-B_6822

Pfam-B_1761

Pfam-B_5537

Pfam-B_3045

Pfam-B_7357

Pfam-B_7359



Pfam-B_36

Pfam-B_17

Pfam-B_52

Pfam-B_24

Pfam-B_287

Pfam-B_133

Pfam-B_2988

Pfam-B_6736

Pfam-B_1123

Pfam-B_10668

Pfam-B_3611

Pfam-B_6674

Pfam-B_5234

Pfam-B_2546


Pfam-B_3586

Pfam-B_8578

Pfam-B_5175

Pfam-B_4688



Pfam-B_4919

Pfam-B_8090

Pfam-B_2479

Pfam-B_4350

Pfam-B_9331

Pfam-B_4689

Pfam-B_4917

Pfam-B_6183

Pfam-B_940

Pfam-B_8581

Pfam-B_8580

Pfam-B_5186

Pfam-B_1265

Pfam-B_11419

Pfam-B_3945

Pfam-B_3939

Pfam-B_2764

Pfam-B_2140


Pfam-B_5185

Pfam-B_8696

Pfam-B_1365


Pfam-B_9916

Pfam-B_3962

Pfam-B_3940

Pfam-B_2618

t h i o r e d

Pfam-B_3055

Pfam-B_8108

Pfam-B_3944

Pfam-B_1158

Pfam-B_1890

t o x i n

Pfam-B_10082

Pfam-B_7616

Pfam-B_4258

Pfam-B_11584

Pfam-B_865

Pfam-B_4150

Pfam-B_3587

Pfam-B_8758

ld l_recept_a

Pfam-B_6847

Pfam-B_6923

Pfam-B_6547

Pfam-B_11644

Pfam-B_7648


i l 8


Pfam-B_1299

Pfam-B_10884

Pfam-B_2184Pfam-B_2199Pfam-B_2198Pfam-B_2218Pfam-B_884Pfam-B_8803Pfam-B_2227Pfam-B_2317 Pfam-B_1864Pfam-B_2175Pfam-B_3218Pfam-B_945 Pfam-B_917Pfam-B_1463 Pfam-B_1754Pfam-B_10311Pfam-B_1522Pfam-B_1545 Pfam-B_3728Pfam-B_8761Pfam-B_1338Pfam-B_6172Pfam-B_1366Pfam-B_1929Pfam-B_1367 Pfam-B_1231 Pfam-B_11894

Pfam-B_3699 Pfam-B_10530 t h i o l a s e Pfam-B_1147 Pfam-B_2806Pfam-B_9964Pfam-B_11726Pfam-B_2462 Pfam-B_9938Pfam-B_10259Pfam-B_9898 Pfam-B_10932 Pfam-B_4433Pfam-B_9657Pfam-B_1190 Pfam-B_9656Pfam-B_9410Pfam-B_9880Pfam-B_10545Pfam-B_5129Pfam-B_5220Pfam-B_5221 Pfam-B_1807

Pfam-B_3110 Pfam-B_2577Pfam-B_2985 Pfam-B_2559Pfam-B_2560Pfam-B_6596

thy rog lobu l i n_1 P fam-B_8740Pfam-B_8741Pfam-B_1121Pfam-B_2521Pfam-B_7851 Pfam-B_5258Pfam-B_5257

COeste rasePfam-B_2114Pfam-B_1407Pfam-B_1415Pfam-B_1122Pfam-B_1426Pfam-B_1292Pfam-B_1432

Pfam-B_2662 Pfam-B_2647 Pfam-B_4915 Pfam-B_1838Pfam-B_1131 Pfam-B_5133 Pfam-B_5821Pfam-B_10926

Pfam-B_10202Pfam-B_2312Pfam-B_8179Pfam-B_5803 Pfam-B_3921Pfam-B_2571 Pfam-B_9676Pfam-B_5273 Pfam-B_4175 Pfam-B_1346Pfam-B_5978Pfam-B_9470 Pfam-B_9219Pfam-B_9220 Pfam-B_217Pfam-B_2449Pfam-B_2448 Pfam-B_8208



Pfam-B_9000 Pfam-B_8322Pfam-B_4415Pfam-B_8563Pfam-B_8564Pfam-B_9320Pfam-B_3918Pfam-B_9409 Pfam-B_5830



Pfam-B_1484

Pfam-B_2507

Pfam-B_3820

Pfam-B_3821

Pfam-B_6701

Pfam-B_7593

Pfam-B_329

Pfam-B_9030


Pfam-B_3579

Pfam-B_1087

Pfam-B_811

Pfam-B_857

Pfam-B_7594

Pfam-B_4720

Pfam-B_8702

Pfam-B_8703

Pfam-B_6029

Pfam-B_4719

Pfam-B_6317

Pfam-B_1178

Pfam-B_1431

Pfam-B_1130

Pfam-B_493

Pfam-B_9304


Pfam-B_538

Pfam-B_1320

Pfam-B_2268





Pfam-B_8712


Pfam-B_7654

Pfam-B_2023

Pfam-B_679

Pfam-B_4704


Pfam-B_7285

Pfam-B_1579

Pfam-B_4186

Pfam-B_8650



Pfam-B_835

Pfam-B_7405

Pfam-B_682

Pfam-B_9146

Pfam-B_5835

Pfam-B_1586

Pfam-B_3061

Pfam-B_200

Pfam-B_8845Pfam-B_3437Pfam-B_8584Pfam-B_2593Pfam-B_2595Pfam-B_6719Pfam-B_2943 Pfam-B_2224

Pfam-B_7818

Pfam-B_2415

Pfam-B_5222

Pfam-B_1098Pfam-B_4973Pfam-B_2865Pfam-B_2866Pfam-B_1847Pfam-B_183Pfam-B_321Pfam-B_2048Pfam-B_2663Pfam-B_2205 Pfam-B_11399Pfam-B_2938Pfam-B_8589


Pfam-B_10363

Pfam-B_6512

h i s t o n e

Pfam-B_3364Pfam-B_4144Pfam-B_3242 Pfam-B_7782 Pfam-B_304Pfam-B_11891Pfam-B_2828

Pfam-B_2833


Pfam-B_8848

Pfam-B_3070



Pfam-B_2398

Pfam-B_7496

Pfam-B_3847


Pfam-B_3305 a d h _ s h o r t

P fam-B_4050

Pfam-B_10693




Pfam-B_2770

Pfam-B_10659

Pfam-B_1363

Pfam-B_6570

Pfam-B_4697

Pfam-B_1735

Pfam-B_10658

Pfam-B_5064




Pfam-B_11

Pfam-B_3279

Pfam-B_3301


Pfam-B_5720


Pfam-B_14


Pfam-B_6994

Pfam-B_1172

Pfam-B_1806

Pfam-B_1653

Pfam-B_6996

Pfam-B_2121

Pfam-B_10588Pfam-B_1412c o n n e x i nPfam-B_2284Pfam-B_3539Pfam-B_11321 Pfam-B_4821Pfam-B_2346 Pfam-B_8473

Pfam-B_740 Pfam-B_739Pfam-B_2101Pfam-B_7232 Pfam-B_6140Pfam-B_6256Pfam-B_2864 Pfam-B_3348Pfam-B_1342 Pfam-B_6138Pfam-B_6444


Pfam-B_6657Pfam-B_5439Pfam-B_662Pfam-B_1582Pfam-B_6577Pfam-B_8797Pfam-B_650Pfam-B_10775Pfam-B_3197Pfam-B_7134

Pfam-B_319Pfam-B_303Pfam-B_228Pfam-B_2320 Pfam-B_1967Pfam-B_962 Pfam-B_227Pfam-B_2300 Pfam-B_1814

Pfam-B_1378 Pfam-B_11324Pfam-B_3640Pfam-B_11292 Pfam-B_11264Pfam-B_5678Pfam-B_11366 HSP70 Pfam-B_11162Pfam-B_11253


Pfam-B_647Pfam-B_3046Pfam-B_6327 Pfam-B_6202Pfam-B_7580Pfam-B_7212Pfam-B_7478Pfam-B_5118Pfam-B_9194Pfam-B_3981 Pfam-B_7581 Pfam-B_729Pfam-B_7646 Pfam-B_481 Pfam-B_9238



A A ACys_knotP fam-B_8034Pfam-B_10120

Pfam-B_7468Pfam-B_265 Pfam-B_433Pfam-B_1047Pfam-B_1044 Pfam-B_5942 Pfam-B_8795Pfam-B_1094



Pfam-B_2163Pfam-B_1416Pfam-B_860 Pfam-B_7464Pfam-B_6515 Pfam-B_8802Pfam-B_8894Pfam-B_3423Pfam-B_8545 Pfam-B_4858



Pfam-B_7551 Pfam-B_7465Pfam-B_9521Pfam-B_9876 Pfam-B_7329Pfam-B_750Pfam-B_8012




s i g m a 5 4


Pfam-B_3525

Pfam-B_2001


Pfam-B_7975

Pfam-B_7146

Pfam-B_6323

Pfam-B_1595

Pfam-B_2838

Pfam-B_3905

Pfam-B_10218

Pfam-B_6480

Pfam-B_5558

Pfam-B_8489

Pfam-B_6478


Pfam-B_1596

Pfam-B_6428Pfam-B_6118Pfam-B_6045Pfam-B_5686Pfam-B_5062Pfam-B_504 Pfam-B_3405Pfam-B_3378Pfam-B_3552

Pfam-B_780

Pfam-B_3614


Pfam-B_7153



Pfam-B_470

Pfam-B_9731


Pfam-B_8461



Pfam-B_5131

Pfam-B_8585

HSP20


Pfam-B_1737


Pfam-B_10716

k a z a l

P fam-B_9989


Pfam-B_2480


Pfam-B_965

Pfam-B_5944

Pfam-B_301

Pfam-B_1278

Pfam-B_2003


Pfam-B_8253

Pfam-B_362

Pfam-B_4552

Pfam-B_2803

Pfam-B_250

Pfam-B_5618

Pfam-B_8749

Pfam-B_7569

Pfam-B_6032

Pfam-B_3268

Pfam-B_2591

Pfam-B_3426



Pfam-B_4511

Pfam-B_5548

Pfam-B_1090

Pfam-B_1224

Pfam-B_1180

Pfam-B_4146

Pfam-B_6173

Pfam-B_1485

Pfam-B_1509

Pfam-B_6180

Pfam-B_10902

Pfam-B_1566

Pfam-B_3031

Pfam-B_7006

Pfam-B_713


Pfam-B_2888



Pfam-B_4344

tsp_1


Pfam-B_2627

Pfam-B_3443

Pfam-B_10602

Pfam-B_2406


Pfam-B_876Pfam-B_7410Pfam-B_4677Pfam-B_10706Pfam-B_609adh_z incPfam-B_6632Pfam-B_96

Pfam-B_787


Pfam-B_107

Pfam-B_1183

Pfam-B_1591

Pfam-B_3343

Pfam-B_1281

Pfam-B_3943

Pfam-B_1990

Pfam-B_4913

Pfam-B_10303

Pfam-B_1985

Pfam-B_2826

Pfam-B_1448



Pfam-B_6485

Pfam-B_1368

Pfam-B_6484

Pfam-B_9650

Pfam-B_9041

Pfam-B_9045


Pfam-B_166

Pfam-B_7117

Pfam-B_9182

Pfam-B_9047

Pfam-B_1965


Pfam-B_8516

Pfam-B_7462

Pfam-B_5571

Pfam-B_5959

Pfam-B_4277

Pfam-B_6964

Pfam-B_6963

Pfam-B_8515

Pfam-B_1893

Pfam-B_5048

Pfam-B_9046

Pfam-B_2715

z f -C2H2

Pfam-B_7897

Pfam-B_7118

Pfam-B_7664

Pfam-B_9042


Pfam-B_10904

Pfam-B_11280

Pfam-B_889

Pfam-B_3516

Pfam-B_6369

Pfam-B_6371

Pfam-B_9662

Pfam-B_10947



Pfam-B_1110

Pfam-B_982

Pfam-B_11896

Pfam-B_1832

Pfam-B_1116

ABC_tran

Pfam-B_162

Pfam-B_3428

Pfam-B_11146

Pfam-B_1203

Pfam-B_148


Pfam-B_762

Pfam-B_2400

Pfam-B_6716

Pfam-B_803

Pfam-B_5173


Pfam-B_730

Pfam-B_11737

Pfam-B_3988

Pfam-B_4261

Pfam-B_3137

Pfam-B_3985

Pfam-B_1820

Pfam-B_3136

Pfam-B_2148

Pfam-B_731

Pfam-B_9912

Pfam-B_3327

Pfam-B_9974

Pfam-B_2814

Pfam-B_1349

Pfam-B_2304

Pfam-B_2188

Pfam-B_5339

Pfam-B_2131

Pfam-B_6858

Pfam-B_4894

Pfam-B_5340

Pfam-B_8376

Pfam-B_7686

Pfam-B_5198

Pfam-B_1593

Pfam-B_2630

Pfam-B_10719

Pfam-B_5199

Pfam-B_9684

Pfam-B_7680


Pfam-B_5413

Pfam-B_3492

Pfam-B_4945

Pfam-B_3222

Pfam-B_8443

Pfam-B_5109

Pfam-B_8371

Pfam-B_5196


Pfam-B_8373

Pfam-B_8372

Pfam-B_6916

Pfam-B_7681

Pfam-B_11590

Pfam-B_1800

Pfam-B_948

Pfam-B_5512

Pfam-B_8383

Pfam-B_10722

Pfam-B_358

Pfam-B_8460

Pfam-B_4658

Pfam-B_6859

Pfam-B_4701

Pfam-B_9605



Pfam-B_2291

Pfam-B_814

Pfam-B_1143


Pfam-B_3947

r r m

Pfam-B_8163

Pfam-B_2147

Pfam-B_5540

Pfam-B_915

Pfam-B_3574

Pfam-B_8717

Pfam-B_3237

Pfam-B_7836

Pfam-B_10589Pfam-B_60zf-CCHC

Pfam-B_674

C 2

Pfam-B_582






Pfam-B_38

Pfam-B_5117

Pfam-B_8019

Pfam-B_3703

Pfam-B_4664

Pfam-B_10558


Pfam-B_5153

Pfam-B_4881

Pfam-B_11744

Pfam-B_10578

Pfam-B_10617

Pfam-B_10581

Pfam-B_4666

Pfam-B_3564

Pfam-B_10580

Pfam-B_4665

z n - p r o t e a s e

Pfam-B_7233

Pfam-B_10707

Pfam-B_700

Pfam-B_4610

Pfam-B_7209

Pfam-B_1331

Pfam-B_747

Pfam-B_1466

s u s h i

P fam-B_4295

Pfam-B_2881

H L H

Pfam-B_3122


Pfam-B_3987

Pfam-B_31

Pfam-B_75


Pfam-B_2893

Pfam-B_5654

Pfam-B_1533

Pfam-B_1729

Pfam-B_10587

Pfam-B_4924

l i p o c a l i n

P fam-B_11830

Pfam-B_4976

Pfam-B_1068

Pfam-B_1589

Pfam-B_2420

Pfam-B_16

Pfam-B_10760

Pfam-B_466

Pfam-B_10736

Pfam-B_2906

Pfam-B_2643

Pfam-B_2421

Pfam-B_605

Pfam-B_4712

Pfam-B_5533

Pfam-B_2907

Pfam-B_2729

Pfam-B_5316

Pfam-B_8323

GTP_EFTU

Pfam-B_1739

Pfam-B_9241

Pfam-B_1373

Pfam-B_7695

Pfam-B_9

Pfam-B_267


Pfam-B_10758


Pfam-B_4713

Pfam-B_6588

Pfam-B_233

Pfam-B_2419

Pfam-B_3824

Pfam-B_646

Pfam-B_10751

Pfam-B_22

a m i n o t r a n

Pfam-B_8095

Pfam-B_1440

Pfam-B_1433

Pfam-B_8223

Pfam-B_3200

TGF-be ta

Pfam-B_8222

lamin in_B

Pfam-B_4036

Pfam-B_5589

Pfam-B_2464


Pfam-B_1928

Pfam-B_5780

Pfam-B_4888

Pfam-B_6182

Pfam-B_10625

Pfam-B_6622


Pfam-B_914

Pfam-B_5815

Pfam-B_6451

Pfam-B_6423

Pfam-B_199

Pfam-B_1728

Pfam-B_1685

Pfam-B_4618

Pfam-B_6415

Pfam-B_5687

Pfam-B_8891

Pfam-B_4593

Pfam-B_1658

Pfam-B_549

Pfam-B_10215


Pfam-B_1520


Pfam-B_706

Pfam-B_758

Pfam-B_363

Pfam-B_1680

Pfam-B_9638

Pfam-B_8919

Pfam-B_5583

Pfam-B_4255

Pfam-B_5768

Pfam-B_5669

Pfam-B_3529

Pfam-B_4592

Pfam-B_6513

Pfam-B_4591

Pfam-B_4224

Pfam-B_10289

Pfam-B_5679

Pfam-B_10358

Pfam-B_5812

Pfam-B_4619

Pfam-B_2383


Pfam-B_6424

Pfam-B_11192

Pfam-B_3341

Pfam-B_9637

Pfam-B_3424

Pfam-B_4314


Pfam-B_4424

Pfam-B_5703

Pfam-B_1301

Pfam-B_684ce l l u l ase

Pfam-B_8548

Pfam-B_5584

Pfam-B_8546

Pfam-B_1308

Pfam-B_1660

Pfam-B_1474

Pfam-B_6384

Pfam-B_4213

Pfam-B_2187

Pfam-B_8326

Pfam-B_295

Pfam-B_5509


Pfam-B_1848

Pfam-B_8368


Pfam-B_5487

Pfam-B_1863

Pfam-B_1333

Pfam-B_5482

Pfam-B_294

Pfam-B_8312

Pfam-B_3434

Pfam-B_3303

Pfam-B_5793

Pfam-B_8544

Pfam-B_296

Pfam-B_4107

Pfam-B_1644

Pfam-B_2684

Pfam-B_8547




Pfam-B_1853

Pfam-B_629

Pfam-B_1860

Pfam-B_800

Pfam-B_4154

Pfam-B_3240

Pfam-B_576

Pfam-B_33

Pfam-B_8430

Pfam-B_1446

Pfam-B_4141

Pfam-B_1627

Pfam-B_1629

Pfam-B_8289

Pfam-B_8358

Pfam-B_9624

Pfam-B_9623

Pfam-B_4011

Pfam-B_8439

Pfam-B_7977

Pfam-B_2616

Pfam-B_8423

Pfam-B_4159

Pfam-B_10799

Pfam-B_10802

Pfam-B_347

Pfam-B_668

Pfam-B_2648

Pfam-B_8493

Pfam-B_1285

Pfam-B_799

Pfam-B_4212

cy toch rome_c

Pfam-B_255

Pfam-B_1165

Pfam-B_2848

Pfam-B_1477

Pfam-B_8734

Pfam-B_1663

Pfam-B_1846

Pfam-B_5350

Pfam-B_1512

Pfam-B_9942

Pfam-B_3970v w c

Pfam-B_9563


Pfam-B_9544

Pfam-B_7477

Pfam-B_7200

Pfam-B_3875

Pfam-B_757

Pfam-B_4409

Pfam-B_540

Pfam-B_10601

Pfam-B_7482

Pfam-B_10015


Pfam-B_1390

Pfam-B_282


Pfam-B_1927

Pfam-B_9927

Pfam-B_6038

Pfam-B_9648

Pfam-B_2998

Pfam-B_2084

Pfam-B_1391

Pfam-B_2085

Pfam-B_11046

Pfam-B_8676

Pfam-B_10145


Pfam-B_4138

e f h a n d

Pfam-B_2305

Pfam-B_11279

Pfam-B_4418

Pfam-B_10918

Pfam-B_3876

Pfam-B_8210

Pfam-B_8482

Pfam-B_5958

Pfam-B_9546

Pfam-B_1531

Pfam-B_1801


COX1

Pfam-B_23

Pfam-B_1730

Pfam-B_4474

Pfam-B_442

Pfam-B_2408

Pfam-B_2694

Pfam-B_7481

Pfam-B_478

Pfam-B_6516

Pfam-B_11209

Pfam-B_452


Pfam-B_2528


Pfam-B_5112w a p

i n s

Pfam-B_1286

Pfam-B_455

Pfam-B_8204

Pfam-B_390

Pfam-B_8206

Pfam-B_804

Pfam-B_2763

Pfam-B_1063

Pfam-B_8205

Pfam-B_4490

Pfam-B_8862

f n 1

Pfam-B_5709

Pfam-B_9527

Pfam-B_5708

f ib r inogen_C

Pfam-B_1274

Pfam-B_6033

Pfam-B_688

Pfam-B_7263

Pfam-B_7286

Pfam-B_687

Pfam-B_453

Pfam-B_818

Pfam-B_11846

Pfam-B_5718

Pfam-B_2797

Pfam-B_4240

Pfam-B_1216

f n 3

Pfam-B_4852

Pfam-B_10826

Pfam-B_10827

Pfam-B_7274

t r y p s i n

Pfam-B_1177

Pfam-B_10822

Pfam-B_1060

Pfam-B_686

Pfam-B_4813

Pfam-B_10817


Pfam-B_1693

Pfam-B_4475

Pfam-B_9562f e r 4

Pfam-B_8239

Pfam-B_380

Pfam-B_4087



Pfam-B_5028


Pfam-B_5444

Kuni tz_BPTI

v w a

Pfam-B_4499


Pfam-B_1326

Pfam-B_10819

Figure 5.13. The Family network using only circular patterns


Chapter 6

Circular Pattern Discovery

6.1 Introduction

The circular pattern discovery (CPD) problem is to identify “interesting” circular patterns intext T . Here, “interesting” is typically defined in terms of constraints in the search, for instance,based on occurrence frequency, pattern length, proximity between patterns, etc. When T is adatabase of sequences, additional constraints, may be imposed, for example the coverage orquorum constraint [86]. In biological applications, for instance, interesting circular patterns arelikely to have biological relevance, e.g., they could point to proteins with related functions, evenwith low sequence similarity.

The CPD problem is related to the more well-known circular pattern matching (CPM) prob-lem discussed in Chapter 5.

To our knowledge, there is no existing work that explicitly studied the problem of patterndiscovery involving circular patterns. Motivated by the increasing significance of circular per-mutations in various applications, from computational biology to pattern analysis, we proposemethods to address the circular pattern discovery problem.

Main Results. We define and solve the ECPD and ACPD problems to find the “interesting”circular patterns, as defined using specified constraints.

133

CHAPTER 6. CIRCULAR PATTERN DISCOVERY 134

In this chapter, we present an algorithm to solve the ECPD problem. We also present twoalgorithms to solve the ACPD problem. One algorithm is based on Maes [81] CPM algorithm.Another algorithm is based on our ACPM2 algorithm. On average, the ACPD algorithm basedon ACPM2 algorithm is better than the ACPD algorithm based on Maes [81] CPM algorithm.

The following two theorems represent our main contribution on the CPD problem.

Theorem 6.1: Given a database sequence SeqDB, with r sequences, and the parametersm1,m2,k, f ,g, The algorithm ECPD uses suffix trees and suffix links to solve the exact circularpattern discovery problem in O(m2

2N) time, where N is the total number of symbols in SeqDB.

Theorem 6.2: Given a database sequence SeqDB, with r sequences, and the parametersm1,m2,k, f ,g, Algorithm ACPD which based on ACPM2 algorithm uses suffix array to solvethe ACPD problem in O(km3

2N2) worst case, and O(km32N) on average, where N is the total

number of symbols in SeqDB.

Organization. In the next section, we define the circular pattern discovery problems. Al-gorithms for the ECPD problem are presented and analyzed in Section 3. The ACPD problemsare introduced and solved in Section 4. In Section 5, we show experiments on analyzing circularpermutations in multidomain proteins using our algorithms. In Section 6, we summarize ourwork on circular pattern matching problems.

6.2 The Circular Pattern Discovery Problem

Although a lot of progress has been made in pattern discovery, sequential data mining,and pattern matching, there has not been much attention to the issue of pattern discovery withcyclic patterns. While the pattern matching problem assumes that a pattern of interest willbe provided before matching can start, pattern discovery does not require any initial pattern.Starting with no specific query pattern, one may seek to find “interesting” circular substringswithin the sequence, or database of sequences, for example, circular substrings that occurredwith a minimal number of occurrences. We call this the circular pattern discovery (CPD)problem. We consider three variations of the CPD problem below.


Exact Circular Pattern Discovery Problem (ECPD). Given a text T and a number f ,return all the high frequency cyclic substrings s (i.e. with f requency ≥ f ) and their respectivecircular shifts in T .

Approximate Circular Pattern Discovery Problem (ACPD). Given a text T and the num-bers f and k, return all the high frequency circular substrings s (i.e. with f requency ≥ f ) andtheir respective circular shifts with k-approximate matches in T .

Circular Pattern Discovery Problem (parameterized form). Given a database sequenceQ, with r sequences, and the parameters m1,m2,k, f ,g, return all circular substrings s and theirrespective circular shifts that have a k-approximate match in Q, with a high occurence frequency( f requency≥ f ), occurs in at least g sequences (where g≤ r), and the length m of each matchingsubstring satisfies the constraint m1 ≤ m≤ m2.

The parameter g models the coverage or quorum constraints, often imposed in motif dis-covery for biological sequences [86]. For instance, we may want a subsequence to appear in acertain proportion of the members of a protein family, before we can accept the subsequence asa valid motif for that family. The ECPD problem corresponds to the case with k = 0. Clearly,the parameterization could be modified to impose more or less constraints in the discovery, asdesired.

The Challenge. To see the difficulty involved in the CPD problem, we can consider thecomplexity of the naıve algorithm for the problem. First, consider pattern discovery for patternsof a specific length, say m (that is, m1 = m2 = m), on a text T of length n. For the ECPDproblem, we will have (n−m + 1) substrings, each with m cyclic shifts, with each m-lengthpattern requiring O(nm) time to search in T . The overall time will be in O(m2n2). We canimprove this to O(m2n) using standard linear time pattern matching algorithms, such as theKMP or Boyer-Moore algorithms. When we consider a range of pattern lengths (for example,m1 ≤ m ≤ m2), the overall complexity becomes O(m3

2n). When no length is specified, wewill need to consider all possible pattern lengths, hence, the overall time complexity will be inO(n4). Using a similar analysis, for the ACPD problem, we will need time in O(km2n2) forone single pattern length m, and in O(km3

2n2)) for a range of pattern lengths (m1 ≤ m ≤ m2).We have assumed the use of Ukkonen’s O(kn) algorithm for k-approximate matching of a givenm-length substring of the text. With no specified constraint on the length, we will require time


in O(kn5) for the parametrized ACPD problem.

The CPD problem is related to the CPM problem, but is much more complicated. ApplyingCPM algorithms directly will be making the assumption that we know what we are trying to”discover”, or would require an exhaustive consideration of all cyclic substrings. In the follow-ing, we first propose a fast ECPD algorithm, by exploiting suffix links, which are typically partof a standard suffix tree. We then address the ACPD problem, using suffix arrays.

6.3 The ECPD Algorithm

Our ECPD algorithm is based on our algorithm for the ECPM problem as presented inSection 5.2.

First the ECPD algorithm (Algorithm 6.1) will build a suffix tree ST from the sequence T .Then the algorithm checks each m-length pattern from T for possible circular pattern matching,using the suffix tree. Each operation of checking an m-length pattern for possible CPM willneed an O(m) time cost using the ECPM algorithm. There are O((m2−m1)N) patterns whoselength m is in the range [m1,m2]. So the time complexity of the ECPD algorithm will be inO((m2−m1 +1)(m2 +m1)N) = O(m2

2N).

Algorithm 6.1: ECPD Algorithm

ECPD(T,N,m1,m2, f )1 ST← BuildSuffixTree(T )2 for m = m1 to m2 do3 for each m-length substring P of T do4 ηocc ← ECPM(ST,P)5 if ηocc ≥ f then do6 Output the circular pattern P7 end if8 end for9 end for

Based on the foregoing, we summerize our results on the ECPD problem in the followingtheorem:


Theorem 6.1: Given a database sequence SeqDB, with r sequences, and the parametersm1,m2,k, f ,g, The ECPD Algorithm uses suffix trees and suffix links to solve the exact circularpattern discovery problem in O(m2

2N) time, where N is the total number of symbols in SeqDB.

6.4 The ACPD Algorithm

We first apply an existing ACPM algorithm directly on the ACPD problem. This modelsthe use of a generic ACPM algorithm for the ACPD problem. In particular, we modify Maes’sACPM algorithm to solve the ACPD problem. Subsequently, we describe our ACPD algorithm,and compare the two methods.

6.4.1 ACPD using Maes’ Algorithm

Algorithm 6.2 shows a method to solve the ACPD problem based on Maes’ algorithm. Thealgorithm compares each m-length pattern from the text with each (m+k)-length subtext, wherem1 ≤ m≤ m2. For each length m, there are min{O(N),O(|Σ|m)} patterns and subtext, so thereare O(N2) comparisons. The total number of comparisons is in O((m2−m1)N2). For eachcomparison, the time cost is in O(m2 logm). Thus, the time complexity of Algorithm 5.10 is

∑m2m=m1

O(N2m2 logm) = O(m32N2 logm2).

Landau’s Algorithm

Algorithm 6.3 is based on Landau’s incremental algorithm [67]. Similar to Maes’ algorithm,this algorithm compares each m-length pattern with each m + k-length subtext, where m is thelength of pattern and m1≤m≤m2. For each length m, there are min(O(N),O(|Σ|m) patterns andsubtext, so there are O(N2) comparisons. The total number of comparisons is O((m2−m1)N2).For each comparisons, the time cost is O(km). Thus, the time complexity of algorithm 6.3 is

∑m2m=m1

O(N2km) = O(km22N2).


Algorithm 6.2: ACPD Based on Maes’ algorithm

ACPD-MAES(T,N,k,m1,m2, f )1 for m = m1 to m2 do2 for each m-length substring P of T do3 ηocc← 04 for i = 1 to N−m− k do5 if MAESALGORITHM(P,T [i...i+m+ k]) is true then do6 ηocc← ηocc +17 end if8 end for9 if ηocc ≥ f then do10 Output P11 end if12 end for13 end for

Algorithm 6.3: ACPD Based on Landau’s algorithm

ACPD-LANDAU(T,N,k,m1,m2, f )1 ST← BuildSuffixTree(T )2 for m = m1 to m2 do3 for each m-length substring Pm do4 for i = 1 to N−m− k do5 LANDAUALGORITHM(Pm ,T [i...i+m+ k])6 end for7 end for8 end for


6.4.2 Proposed ACPD Algorithm

Our ACPD algorithm (Algorithm 6.4) uses the same framework as the described ACPMalgorithm (Section 5.3.4). The algorithm first constructs the hypotheses on potential circularpattern matches using q-gram filtration. For each hypothesis (i.e. each matching q-gram in T ),the ACPD algorithm verifies all possible circular shifts that involves this q-gram match.

The algorithm constructs O(N2) worst-case number of hypotheses with parameter q =b m1

k+1c, where N is the total length of the concatenated sequences (for a database of sequences).

On average, the number of hypotheses is O( N2

|Σ|q ). When |Σ|q is close to O(N), the number ofhypotheses will reduce to O(N).

At verification, the algorithm checks the circular shifts of length m1,m1 +1, ...,m2. For eachpattern with length m, there are O(m) circular shifts to check, where m1 ≤ m ≤ m2. The timefor checking all of m-length circular shifts in the same circular pattern involving the currenthypothesis is in O(km) using the verification algorithm (Algorithm 5.5). In each hypothesis, thenumber of circular pattern with length m is m. So the time cost of checking all circular shiftsof circular pattern with length m in a hypothesis is O(km2). The total time for verification willbe O(∑m2

m=m1km2) = O(km3

2). The worst case time complexity of Algorithm 6.4 will thus be in

O(km32N2). On average, the runing time will be in O(km3

2N2

|Σ|q ). When |Σ|q is close to O(N), as istypically the case, the time complexity will reduce to O(km3

2N).

A further improvement will be to exploit the huge redundancy in going from a pattern oflength m to a pattern of length (m+1). We observe that to go from an m-length pattern startingat position i in T to an (m + 1)-length pattern starting from the same position involves addingjust one symbol at the end. Most k-approximate cyclic pattern matches at one length will alsobe cyclic matches at the next length, if the cyclic edit distance is less than k− 1. Also, non-matching regions in T for the m-length pattern cannot be a match for the (m+1)-length patternif the circular edit distance with the m-length pattern is greater than k + 1. Exploiting thisredundancy by keeping the full dynamic programming table, will lead to a further reduction ofthe overall complexity to O(m2

2N2) worst case and O(m22N) on average. This is made possible

only by the nature of the CPD problem.


Based on the foregoing, we summerize our results on the ACPD problem in the followingtheorem:

Theorem 6.2: Given a database sequence SeqDB, with r sequences, and the parame-ters m1,m2,k, f ,g, Algorithm ACPD solves the ACPD problem in O(km3

2N2) worst case, andO(km3

2N) on average, where N is the total number of symbols in SeqDB.

In the above CPD algorithms, we have assumed one single input sequence, T , for simplicity.For a database of sequences, we can simply concatenate the sequences in the database to formone long sequence, and then construct the generalized suffix tree (or suffix array) for the longsequence. Then, we can keep track of the number of occurrence of each circular pattern in eachsequence in the database to support quorum constraints.

6.4.3 Comparison

The time complexity of the ACPD solution via Maes’ algorithm is O(m32N2 logm2). The

time complexity of the ACPD solution via Landau’s algorithm is O(km22N2. The time complex-

ity of the proposed ACPD algorithm is O(km32N2). The number of patterns is minN, |Σ|m. The

proposed ACPD algorithm is based on q-gram filtration. The average number of hypothesesis in O( N2

|Σ|mk). When |Σ|mk is close to O(N), the time complexity of the ACPD algorithm will

reduce to O(km32N). By exploiting the nature of the CPD problem in terms of the redundancy

between m-length paterns and (m + 1)-length patterns starting at the same location in the text,the complexity for the proposed ACPD algorithm can be reduced to O(m2

2N2) worst case, andO(m2

2N) on average.

6.5 Experiments

We performed circular pattern discovery on the same protein multidomain sequences withpattern length m, where 4 ≤ m ≤ 30. Figure 6.1 shows the variation of the number of distinctpatterns (both circular and non circular pattern) with pattern lengths. When the length is 4, thenumber of distinct patterns is large. When the length increases, the number of distinct patterns


Algorithm 6.4: Proposed ACPD algorithm

ACPD-QGRAM(T,N,k,m1,m2, f )1 < SA,LCP >← COMPUTESA(T )2 q← b m1

k+1c, Qset ← NULL3 for i = 1 to N do4 if LCP[i] ≥ q then do5 Qset ← c(Qset ,SA[i])6 else7 for j = 1 to length(Qset ) do8 for l = j+1 to length(Qset ) do9 Ppos ← Qset [j], Tpos ← Qset [l]10 for m = m1 to m2 do11 T1 ← T[Tpos+q+1...Tpos+m+k]; T2 ← Text[Tpos-(m-q+k)...Tpos-1]12 for h = 0 to m-q do13 P← T[Ppos+q+1...Ppos+q+h] ◦ Text[Ppos-m+q+h...Ppos-1]14 if BIDIRECTIONALED2(P,T1,T2,k) is true then do15 if P is the first occurrence then do16 ηocc(P)← 117 else18 ηocc(P)← ηocc(P)+119 end if20 end if21 end for22 end for23 end for24 end for25 Qset ← NULL26 end if27 end for28 ∀ P, Output P if ηocc(P)≥ f


Algorithm 6.5: Modified ACPM Hypothesis Verification

BIDIRECTIONALED2(P,T1,T2,k)1 P2← PR

2 ED1← DP(P,T1,k)3 ED2← DP(P2,T R

2 ,k)4 ED← ED2R S

rep(0,q)S

ED15 for h=1 to |P|+16 if (ED[h] + ED[h+m-1] ≤ k) then do7 return true8 end if9 end for

decreases quickly. Figure 6.2 shows the variation of the number of the circular patterns (i.eexcluding the non-circular patterns) with pattern lengths. As expected, the number decreasesrapidly with increasing length. In Figure 6.3 we show the maximum number of occurrences ofone pattern at different pattern lengths. Table 6.1 shows the number of distinct patterns and thenumber of non-circular patterns at different pattern lengths. Table 6.1 also shows the ratio with

Number o f distinct patternsNumber o f non−circular patterns . Interestingly, some circular patterns were discovered to occur withlenghts as large as 26 symbols. That is, 26 domains were repeated, but in the form of a cyclicpermutation between two multidomain proteins in the database.

Table 6.2 shows an example of a circular pattern with five domains. The pattern is {PDA1N075,PD000041, PD000041, PD248344, PD000041}. We assign the symbol ”A” to protein domainPDA1N075, the symbol ”B” to protein domain PD000041 and the symbol ”C” to protein do-main PD248344. It occurs in multi-domain protein sequences Q99407 HUMAN, ANK1 HUMANand Q13768 HUMAN with different orders.

Table 6.3 shows an example of circular pattern with thirteen domains. The pattern is{PD000768, PD000767, PD005993, PD000768, PD000767, PDA1J766, PD005993, PD000768,PD000767, PDA1J7O1, PD000165, PD272501, PDA1J3F2}. We assign the string {FGHFGIHFGJKLM}to the pattern. It occurs in the multi-domain protein sequences Q9Y4V9 HUMAN, Q5JR23 HUMANand Q9UJ57 HUMAN with different orders.


5 10 15 20 25 30

050

0010

000

1500

0

Number of Distinct Patterns

Pattern Length

Num

ber

of D

istin

ct N

orm

al P

atte

rns

Figure 6.1. Variation of the number of distinct patterns (including non-circular and circular

patterns) with pattern length.

5 10 15 20 25 30

010

020

030

040

0

Number of True CPs

Pattern Length

Num

ber

of T

rue

CP

s

Figure 6.2. Variation of number of circular patterns with pattern length.


Table 6.1. The number of distinct patterns with pattern lengthPattern # Distinct # Non-circular # Circular RatioLength Patterns Patterns patterns

4 80333 79934 399 100.50%5 8199 7944 255 103.21%6 4720 4705 15 100.32%7 2754 2677 77 102.88%8 836 830 6 100.72%9 550 532 18 103.38%

10 855 855 0 100.00%11 937 921 16 101.74%12 1011 983 28 102.85%13 727 676 51 107.54%14 299 296 3 101.01%15 265 262 3 101.15%16 141 141 0 100.00%17 139 139 0 100.00%18 116 100 16 116.00%19 50 50 0 100.00%20 58 57 1 101.75%21 51 51 0 100.00%22 19 19 0 100.00%23 37 37 0 100.00%24 18 18 0 100.00%25 34 34 0 100.00%26 18 17 1 105.88%27 19 19 0 100.00%28 2 2 0 100.00%29 8 8 0 100.00%30 5 5 0 100.00%

Table 6.2. Sample discovered circular patterns with length five.A≡PDA1N075, B≡PD000041, C≡PD000041, D≡PD248344, E≡PD000041

Protein Pos order PatternQ99407 HUMAN 6 0 ABBCBQ13768 HUMAN 7 0 ABBCB

ANK1 HUMAN 6 3 CBABBQ13768 HUMAN 6 4 BABBCQ99407 HUMAN 5 4 BABBC


5 10 15 20 25 30

020

4060

8010

012

0

Maximum Number of Occurrences

Pattern Length

Max

imum

Num

ber

of O

ccur

renc

t

Figure 6.3. Variation of maximum number of occurrences with pattern length.

Table 6.3. Sample discovered circular patterns with length thirteen.F≡PD000768, G≡PD000767, H≡PD005993, I≡PDA1J766,J≡ PDA1J7O1, K≡PD000165, L≡PD272501, M≡PDA1J3F2

Protein Pos order PatternQ9Y4V9 HUMAN 20 0 FGHFGIHFGJKLMQ5JR23 HUMAN 21 1 GHFGIHFGJKLMFQ9UJ57 HUMAN 38 2 HFGIHFGJKLMFG


6.6 Summary

In this chapter, we introduced the ECPD and ACPD problems, and proposed two algorithmsfor their solution. The first algorithm uses suffix trees and suffix links to solve the ECPDproblem in O(m2

2N) time. The second algorithm solves the more challenging ACPD problemin O(km2

2N2) worst case, and O(km22N) on average, using suffix arrays. By exploiting the

redundancy between patterns of different lengths that start at the same position in the text,the overall complexity is reduced to O(km2N2) worst case, and O(km2N) on average. Ourresults can be compared with the approach that directly uses Maes’ algorithm, one of the bestavailable ACPM algorithms for the ACPD problem, which runs in O(m3

2N2 logm2) time, orO(m3

2N logm2) for small values of m2. Although pattern discovery has been well studied, to ourknowledge, this is the first attempt at a focused study on the CPD problem.

We show an experiment for discovering the exact circular patterns in ProDom database. Wereported interesting exact circular patterns discovered by our algorithm. To our knowledge, thisis the first experiment at a focused study on discovering the exact circular patterns.

Chapter 7

Conclusion and Future Work

7.1 Conclusion

In this work, we present two novel data structures to solve the space problem of the suffixtree and its applications. These two structures are the virtual suffix tree (VST) and the Proba-bilistic Suffix Array (PSA). We consider the circular pattern matching problem and define thecircular pattern discovery problem. We present algorithms using suffix data structures to solveboth the exact and inexact variants of the CPM problem. Our focus is on efficiently answeringenumerative queries involving circular patterns. We also present algorithms based on our CPMalgorithms to solve the CPD problem. To our knowledge, this is the first attempt at a focusedstudy on discovering circular patterns.

The Virtual Suffix Tree. We introduce the VST (virtual suffix tree), an efficient datastructure for suffix trees and suffix arrays. The VST provides the same functionality as thesuffix tree, including support for pattern matching, and suffix links. But the VST requires amuch smaller space than the suffix tree and the other recently proposed space-efficient datastructures for suffix trees and suffix arrays. With n=length of the sequence, the worst case spaceis 18n bytes compared with 20n bytes for the other data structures for suffix trees and suffixarrays (such as the enhanced suffix array [2], and the linearized suffix tree [62]). On average,the space requirement (including that for suffix arrays and suffix links) is 13.8n bytes for the

147

CHAPTER 7. CONCLUSION AND FUTURE WORK 148

regular VST, and 12.05n bytes in its compact form.

The Probabilistic Suffix Array. We have presented the probabilistic suffix array (PSA),a data structure for representing information in variable length Markov models. The PSA pro-vides the same functionality as the probabilistic suffix tree (PST), but the space is significantlysmaller than that of the PST. We construct the PSA in linear time and linear space, independentof the order of the Markov model. The space is dependent on the number of interval nodesof the suffix array. In the worst case, the memory requirement is 33N bytes. On average, theneeded space will be 26N bytes, which includes the work space of the construction phase. Theneeded space for PSA is significantly smaller than that for the PST which is implemented ona regular suffix tree. Prediction using the PSA is in O(m log n

|Σ|)time, where m is the patternlength, and Σ is the symbol alphabet.

Circular Pattern Matching. We defined the circular pattern matching(CPM) problems inthe related work. We present a linear time algorithm to solve the ECPM problem which aims atfinding an exact circular pattern P in the text T . The ACPM problem is to find an approximateoccurrence of circular pattern P in text T . We present three algorithms for the ACPM2 problemand one greedy algorithm that produces an incomplete result. Of all the algorithms reportedin the literature for the ACPM problem, our ACPM q-gram-based bidirectional algorithm withsuffix trees provides the best results on average, with respect to both time and space complexity.Using our algorithms, we performed experiments on the analysis of circular permutations inmultidomain proteins. Based on the results, we developed a method for function prediction forsuch multidomain proteins.

Circular Pattern Discovery. Based on the work on the circular pattern matching problem,we defined the Exact Circular Pattern Discovery (ECPD) problem and the Approximate CircularPattern Discovery (ACPD) problem. We present an ECPD algorithm that uses suffix trees andsuffix links to solve the exact circular pattern discovery problem in O(m2

2N) time. We alsopresent an efficient algorithm to solve the ACPD problem based on our bidirectional ACPM2algorithm. Comparing with the ACPD algorithm based on Maes [81] CPM algorithm, ouralgorithm is the better solution on average.


7.2 Future Work

We briefly describe some potential future work, based on the material presented in thiswork.

7.2.1 Circular Pattern Discovery

We have presented algorithms for the circular pattern discovery problem. Our presentedalgorithms are based on our CPM algorithm. In our ACPD algorithm, there are lot of repeatingoperations that compares the circular shifts vs. the subtext. In our future thinking, we may finda way to avoid the repeated comparisons.

7.2.2 Network Analysis for Circular Multidomain Proteins

In this work we proposed efficient algorithms for rapid detection of cyclic permutations inmultidomain proteins, using suffix trees and suffix arrays. Based on the results, we formed net-works linking different multidomain proteins based on cyclic patterns observed in the proteins,using current data in the ProDom database [21]. Using these networks we performed functionalannotation of the multidomain proteins. Significance of functional relatedness between twomultidomain proteins was accessed using z-scores and p-scores. We performed only simpleanalysis of the networks, for instance, based on simple in-degree and out-degree characteristicsof the nodes in the network. The overall performance using the method on a network formedwith the Top 500 proteins (as ranked by degree) was as follows: sensitivity 0.81; precision0.88, F-measure 0.84. Although based only on sequence data, and with no alignment, theseresults are comparable with ProFunc [69, 70] and ProKnow [101], both of which perform ataround 70% accuracy. These are web-based servers that predict protein function from its 3Dstructure, using initial structure alignment, and a combination of algorithms. Motivated by theseimpressive results, an interesting future direction will be a more detailed study of these networksof cyclic patterns, built on just protein sequence data. Network characterization using more rig-orous methods such as betweeness [43], centrality [52], and pairwise discontinuity [102] couldshow more light on the potential relationships between related multidomain proteins, or non-


related multidomain proteins with potentially similar function.

7.2.3 From PSA to PFA

The probabilistic suffix automata is a subclass of the probabilistic finite automata (PFA)which is a simple model for learning in Markov models with order L. Ron et al. [109] provedthat “every distribution generated by a probabilistic suffix automata can equivalently be gener-ated by a PST”. In our future work, one can consider how to construct a probabilistic suffixautomata from our probabilistic suffix array data structure. Previous work [9, 109] has con-structed the probabilistic suffix automata from the probabilistic suffix tree (PST). But the spaceand time complexity is O(Ln2) which is huge. The compressed suffix array [45] and the com-pact suffix array [82, 83] are space-efficient suffix data structures. It would be interesting tostudy how these space-efficient suffix data structures can be used to improve the PSA.

7.2.4 Approximate Pattern Matching Using PSA

Dynamic programing [114] is a well know method to find approximate patterns, but thetime complexity is O(m2n) to find all the occurrences, where m is the length of pattern and n isthe length of the text. Ukkonen’s algorithm [121] improved the time complexity of this problemto O(mnk), where k is maximum error allowed in the pattern. This is still is large in practice. Inour thinking, the probability generated from PSA prediction may give us a useful measurementfor the approximate pattern matching problem. This simple thinking will work for existentialqueries in O(n+m log |Σ|) time, where O(n) is the PSA construction cost and O(m log |Σ|) is theprediction cost. But for enumerative queries, finding all the possible matching positions usingthe PSA will be a major challenge.

7.2.5 Prediction with PSA using Inexact Matching

Markov models in information retrieval are often based on item frequency and documentfrequency. However, in the real world, there are often some mismatch/errors (i.e. insert, delete,substitute) in the sequence. If we could compute the Markov Model based on an approximation


of the item frequency and document frequency, we will get a more robust model in prediction.Here, approximate frequencies are computed by considering possible mutations in the sequence.

Calculating Approximate Probability

The transition matrix of the Markov model will be calculated based on an approximate probabil-ity. The approximate probability is calculated by the approximate item frequency and documentfrequency. We use the following equation in place of the regular equation which calculates thetransition matrix using the exact term frequency (TF) or document frequency (DF).

P(Xn+1 = j|Xn = i,Xn−1 = in−1...,X0 = i0) = argmaxED(Y,X)≤k

P(Ym+n = j|YnYm+n−1Ym+n−2...Ym+n−L)

where ED(Y,X)=ED(Y[m+n-L+1 . . . m+n],X[n-L+1 . . . n])

Difficulty

For this extension of the PSA, we need to cope with two hard problems.

1. How to calculate the approximate term frequency (TF) and document frequency (DF)in efficient time and space complexity. This is similar to the motif discovery problem.We have to find a better algorithm to calculate the approximate term frequency (TF) anddocument frequency (DF).

2. How to associate nodes in the PSA to the approximate probability. In our current PSA,symbols in the same edge share the same node. Hence they have the same probability, tfand df. Thus the space for the PSA is linear with respect to the length of sequence. In thisfuture work, symbols in the same edge may not share the same probability. Thus in thenaive implementation, the space for the PSA will be quadratic with respect to the lengthof the sequence. Reducing this space requirement is an interesting challenge.


If these two problems can be solved, a robust representation for Markov models will beachieved.

7.3 Publications from the Dissertation

1. Jie Lin, Yue Jiang, and Don Adjeroh. The Virtual Suffix tree: An efficient data structurefor suffix trees and suffix arrays, the Prague Stringology Conference (PSC), 2008.

2. Jie Lin, Yue Jiang, Donald A. Adjeroh. The Virtual Suffix Tree, International Journal ofFoundations of Computer Science, Vol:20, Issue No:6, pp1109-1133, 2009.

3. Jie Lin and Don Adjeroh. All-against-all circular pattern matching. 2011. Under review.

4. Jie Lin and Don Adjeroh. Circular Pattern Discovery. 21st International Workshop onCombinatorial Algorithms, 2011.

5. Jie Lin, Don Adjeroh and Bing-Hua Jiang. Probabilistic Suffix Array: Efficient Modellingand Prediction of Protein Families. 2011. Under review.

6. Jie Lin, Don Adjeroh and Bing-Hua Jiang. Algorithms for Efficient Detection of CPs inMultidomain Protein. 2011. To be submitted.

Bibliography

[1] N Abe and M Warmuth. On the computational complexity of approximating distributionsby probabilistic automata. Machine Learning, 9:205–260, 1992.

[2] M. I. Abouelhoda, S. Kurtz, and E. Ohlebusch. Replacing suffix trees with enhancedsuffix arrays. Journal of Discrete Algorithms, 2:53 – 86, 2004.

[3] D. Adjeroh, T. Bell, and A. Mukherjee. The Burrows-Wheeler Transform: Data Com-pression, Suffix Arrays and Pattern Matching. Springer-Verlag, 2008.

[4] Don Adjeroh and Fei Nan. Suffix sorting via Shannon-Fano-Elias codes. Algorithms,3(2):145–167, 2010.

[5] Alfred V. Aho and Margaret J. Corasick. Efficient string matching: An aid to biblio-graphic search. Commun. ACM, 18:333–340, June 1975.

[6] S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. Basic local alignmentsearch tool. Journal of Molecular Biology, 215(3):403–410, October 1990.

[7] Amihood Amir, Yonatan Aumann, Gad M. Landau, Moshe Lewenstein, and Noa Lewen-stein. Pattern matching with swaps. In FOCS, pages 144–153, 1997.

[8] Arne Andersson and Stefan Nilsson. Efficient implementation of suffix trees. 25(2):129–141, 1995.

[9] Alberto Apostolico and Gill Bejerano. Optimal amnesic probabilistic automata or how tolearn and classify proteins in linear time and space. Journal of Computational Biology,7(3-4):381–393, 2000.

153

BIBLIOGRAPHY 154

[10] Alberto Apostolico and Maxime Crochemore. Optimal canonization of all substrings ofa string. Inf. Comput., 95(1):76–95, 1991.

[11] Alberto Apostolico and Raffaele Giancarlo. The Boyer Moore Galil string searchingstrategies revisited. SIAM J. Comput., 15(1):98–105, 1986.

[12] H. Arimura, H. Asaka, H. Sakamoto, and S. Arikawa. Efficient discovery of proximitypatterns with suffix arrays. In A. Amir and G.M. Landau, editors, CPM, volume 2089 ofLecture Notes in Computer Science, pages 152–156. Springer, 2001.

[13] R.A. Baeza-Yates and G.H. Gonnet. A new approach to text searching. In N.J. Belkinand C.J. van Rijsbergen, editors, SIGIR 89, Proceedings of the 12th Annual InternationalACM SIGIR Conference on Research and Development in Information Retrieval, vol-ume 23, pages 168–75. ACM, New York (published as a special issue of SIGIR Forum,Vol. 23, 1-2, Fall 88/Winter 89), 1989.

[14] Ricardo A. Baeza-Yates and Chris H. Perleberg. Fast and practical approximate stringmatching. Information Processing Letters, 59(1):21–27, 1996.

[15] Alex Bateman, Lachlan Coin, Richard Durbin, Robert D. Finn, Volker Hollich1, SamGriffiths-Jones, Ajay Khanna, Mhairi Marshall, Simon Moxon, Erik L. L. Sonnhammer1,David J. Studholme, Corin Yeats, and Sean R. Eddy. The Pfam protein families database.Nucleic Acids Res, 32 (Database issue):D138–D141, 2004.

[16] Ron Begleiter, Ran El-Yaniv, and Golan Yona. On prediction using variable orderMarkov models. J. Artif. Intell. Res. (JAIR), 22:385–421, 2004.

[17] Gill Bejerano and Golan Yona. Variations on probabilistic suffix trees: Statistical mod-eling and prediction of protein families. Bioinformatics, 17(1):23–43, 2001.

[18] Michael A. Bender and Martin Farach-Colton. The LCA problem revisited. In Gaston H.Gonnet, Daniel Panario, and Alfredo Viola, editors, LATIN, volume 1776 of LectureNotes in Computer Science, pages 88–94. Springer, 2000.

[19] Kellogg S. Booth. Lexicographically least circular substrings. Inf. Process. Lett.,10(4/5):240–242, 1980.

BIBLIOGRAPHY 155

[20] R.S. Boyer and J.S. Moore. A fast string searching algorithm. Communications of theACM, 20(10):62–72, 1977.

[21] Catherine Bru, Emmanuel Courcelle, Sbastien Carrre, Yoann Beausse, Rine Dalmar, andDaniel Kahn. The ProDom database of protein domain families: More emphasis on 3D.Nucleic Acids Res, 33:212–215, 2005.

[22] H. Bunke and U. Buhler. Applications of approximate string matching to 2D shaperecognition. Pattern Recognition, 26(12):1797–1812, December 1993.

[23] M. Burrows and D. J. Wheeler. A block-sorting lossless data compression algorithm.Technical Report 124, Digital Equipment Corporation, Palo Alto, California, May 1994.

[24] Y. Cao, A. Janke, P.J. Waddell, M. Westerman, O. Takenaka, S. Murata, N. Okada,S. Paabo, and M. Hasegawa. Conflict among individual mitochondrial proteins in re-solving the phylogeny of eutherian orders. J Mol Evol., 47:307–322, 1998.

[25] H. Carrillo and D. Lipman. The multiple sequence alignment problem in biology. SIAMJournal on Applied Mathematics, 48(5):1073–1082, 1988.

[26] William I. Chang and Jordan Lampe. Theoretical and empirical comparisons of ap-proximate string matching algorithms. In CPM ’92: Proceedings of the Third AnnualSymposium on Combinatorial Pattern Matching, pages 175–184, London, UK, 1992.Springer-Verlag.

[27] William I. Chang and Eugene L. Lawler. Sublinear approximate string matching andbiological applications. Algorithmica, 12(4/5):327–344, 1994.

[28] John G. Cleary and W. J. Teahan. Unbounded length contexts for PPM. ComputerJournal, 40(2/3):67–75, 1997.

[29] Richard Cole and Ramesh Hariharan. Faster suffix tree construction with missing suffixlinks. In STOC, pages 407–415, 2000.

[30] S Cong, J Han, and D A Padua. Parallel mining of closed sequential patterns. In KDD.ACM, 2005.

BIBLIOGRAPHY 156

[31] T. H. Cormen, C. E. Leiserson, and R. L. Rivest. Introduction to Algorithms. MIT Press,1990.

[32] Florence Corpet, Florence Servant, Jerome Gouzy, and Daniel Kahn. ProDom andProDom-CG: Tools for protein domain analysis and whole genome comparisons. Nu-cleic Acids Res, 28(1):267–269, 2000.

[33] Paul D. Cotter, Colin Hill1, and R. Paul Ross. Bacterial lantibiotics: Strategies to improvetherapeutic potential. Current Protein and Peptide Science, 6:61–75, 2005.

[34] Maxime Crochemore and Thierry Lecroq. Tight bounds on the complexity of theApostolico-Giancarlo algorithm. Information Processing Letters, 63:195–203, 1997.

[35] Craik DJ. Circling the enemy: Cyclic proteins in plant defence. Trends Plant Sci.,14(6):328–335, 2009.

[36] Jean-Pierre Duval. Factorizing words over an ordered alphabet. J. Algorithms, 4(4):363–381, 1983.

[37] Yariv Ephraim, Amir Dembo, and Lawrence R. Rabiner. A minimum discriminationinformation approach for hidden Markov modeling. IEEE Transactions on InformationTheory, 35(5):1001–1013, 1989.

[38] Martin Farach-Colton, Paolo Ferragina, and S. Muthukrishnan. On the sorting-complexity of suffix tree construction. J. ACM, 47(6):987–1011, 2000.

[39] Robert D. Finn, John Tate, Jaina Mistry, Penny C. Coggill, Stephen John Sammut, Hansrudolf Hotz, Goran Ceric, Kristoffer Forslund, Sean R. Eddy, Erik L. L. Sonnhammer,and Alex Bateman. The Pfam protein families database. Nucleic Acids Res, 36:281–288,2008.

[40] S. Garcia-Vallve, A. Rojas, J. Palau, and A. Romeu. Circular permutants in beta-glucosidases (family 3) within a predicted double-domain topology that includes a(beta/alpha)8-barrel.. Proteins, 31:214–223, 1998.

[41] Robert Giegerich, Stefan Kurtz, and Jens Stoye. Efficient implementation of lazy suffixtrees. Software — Practice and Experience, 33(11), 2003.

BIBLIOGRAPHY 157

[42] David Gillman and Michael Sipser. Inference and minimization of hidden Markovchains. In COLT, pages 147–158, 1994.

[43] M. Girvan and M. E. J. Newman. Community structure in social and biological networks.PNAS, 99(12):7821–7826, June 2002.

[44] Jens Gregor and Michael G. Thomason. Dynamic programming alignment of sequencesrepresenting cyclic patterns. IEEE Trans. Pattern Anal. Mach. Intell., 15(2):129–135,1993.

[45] Roberto Grossi and Jeffrey Scott Vitter. Compressed suffix arrays and suffix treeswith applications to text indexing and string matching. SIAM Journal on Computing,35(2):378–407, 2005.

[46] D. Gusfield. Algorithms on Strings, Trees and Sequences: Computer Science and Com-putational Biology. Cambridge University Press, Cambridge, UK, 1997.

[47] U. Heinemann and M. Hahn. Circular permutation of polypeptide chains: Implicationsfor protein folding and stability. Prog. Biophys. Mol. Biol., 64:121–143, 1995.

[48] R.N. Horspool. Practical fast searching in strings. 10(6):501–6, 1980.

[49] M Hu, J Yang, and W Su. Permu-pattern: Discovery of mutable permutation patterns. InKDD. ACM, 2008.

[50] James W. Hunt and Thomas G. Szymanski. A fast algorithm for computing longestcommon subsequences. Commun. ACM, 20(5):350–353, 1977.

[51] Costas S. Iliopoulos and M. Sohel Rahman. Indexing circular patterns. In WALCOM,pages 46–57, 2008.

[52] H. Jeong, S. P. Mason, A. L. Barabasi, and Z. N. Oltvai. Lethality and centrality inprotein networks. Nature, 411(6833):41–42, May 2001.

[53] Petteri Jokinen and Esko Ukkonen. Two algorithms for approximate string matchingin static texts (extended abstract). In A. Tarlecki, editor, Mathematical Foundations ofComputer Science 1991: Proc. of the 16th International Symposium, pages 240–248.Springer, Berlin, Heidelberg, 1991.

BIBLIOGRAPHY 158

[54] Maizel JV Jr and Lenk RP. Enhanced graphic matrix analysis of nucleic acid and proteinsequences. Proc Natl Acad Sci U S A, 78(12):7665–7669, 1981.

[55] Jongsun Jung and Byungkook Lee. Circularly permuted proteins in the protein structuredatabase. Protein Sci., 10(9):1881–1886, 2001.

[56] Juha Karkkainen. Suffix cactus: A cross between suffix tree and suffix array. In CPM:6th Symposium on Combinatorial Pattern Matching, 1995.

[57] Juha Karkkainen, Peter Sanders, and Stefan Burkhardt. Linear work suffix array con-struction. J. ACM, 53(6):918–936, 2006.

[58] S. Karlin, G. Ghandour, F. Ost, S. Tavare, and L.J. Korn. New approaches for com-puter analysis of nucleic acid sequences. Proceedings, National Academy of Sciences,80(18):5660–5664, 1983.

[59] R.M. Karp and M.O. Rabin. Efficient randomized pattern-matching algorithms.31(2):249–260, 1987.

[60] T. Kasai, G. Lee, H. Arimura, S. Arikawa, and K. Park. An efficient index data structurewith the capabilities of suffix trees and suffix arrays for alphabets of non-negligible size.In 12th Annual Symposium on Combinatorial Pattern Matching, 2001.

[61] Dong Kyue Kim, Jeong Eun Jeon, and Heejin Park. An efficient index data structure withthe capabilities of suffix trees and suffix arrays for alphabets of non-negligible size. InSPIRE 2004, 2004.

[62] Dong Kyue Kim, Minhwan Kim, and Heejin Park. Linearized suffix tree: an efficientindex data structure with the capabilities of suffix trees and suffix arrays. Algorithmica,2007.

[63] Dong Kyue Kim, Jeong Seop Sim, Heejin Park, and Kunsoo Park. Constructing suffixarrays in linear time. J. Discrete Algorithms, 3(2-4):126–142, 2005.

[64] D.E. Knuth, J.H. Morris, and V.R. Pratt. Fast pattern matching in strings. 6(2):323–350,1977.

BIBLIOGRAPHY 159

[65] Pang Ko and Srinivas Aluru. Space efficient linear time construction of suffix arrays. J.Discrete Algorithms, 3(2-4):143–156, 2005.

[66] R. Kohli and C. Walsh. Enzymology of acyl chain macrocyclization in natural productbiosynthesis. Chemical Communications, 3:297–307, 2003.

[67] Gad M. Landau, Eugene W. Myers, and Jeanette P. Schmidt. Incremental string compar-ison. SIAM Journal on Computing, 27:557–582, 1998.

[68] G.M. Landau and U. Vishkin. Fast string matching with k differences. Journal of Com-puter and System Sciences, 37(1):63–78, 1988.

[69] Roman A. Laskowski, James D. Watson, and Janet M. Thornton. Profunc: A server forpredicting protein function from 3D structure. Nucleic Acids Research, 33(Web-Server-Issue):89–93, 2005.

[70] Roman A. Laskowski, James D. Watson, and Janet M. Thornton. Protein function pre-diction using local 3D templates. Journal of Molecular Biology, 351:614–626, 2005.

[71] Florencia G. Leonardi. A generalization of the PST algorithm: Modeling the sparsenature of protein sequences. Bioinformatics, 22(11):1302–1307, 2006.

[72] Vladimir Levenshtein. Binary codes capable of correcting deletions, insertions, and re-versals. Cybernetics and Control Theory, 10(8):707–710, 1966. Original in DokladyAkademii Nauk SSSR 163(4): 845–848 (1965).

[73] M. Li, J.H. Badger, X. Chen, S. Kwong, P. Kearney, and H. Zhang. An information-based sequence distance and its application to whole mitochondrial genome phylogeny.Bioinformatics, 17:149–154, 2001.

[74] Jie Lin and Don Adjeroh. All-against-all circular pattern matching. 2011. Under review.

[75] Jie Lin and Don Adjeroh. Circular pattern discovery. 21st International Workshop onCombinatorial Algorithms, 2011.

[76] Jie Lin, Don Adjeroh, and Binghua Jiang. Algorithms for efficient detection of cps inmultidomain protein. 2011. To be submitted.

BIBLIOGRAPHY 160

[77] Jie Lin, Don Adjeroh, and Binghua Jiang. Probabilistic suffix array: Efficient modellingand prediction of protein families. 2011. Under review.

[78] Jie Lin, Yue Jiang, and Don Adjeroh. The virtual suffix tree: An efficient data structurefor suffix trees and suffix arrays. In Jan Holub and Jan Zdarek, editors, Proceedings ofthe Prague Stringology Conference 2008, pages 68–83, Czech Technical University inPrague, Czech Republic, 2008.

[79] Jie Lin, Yue Jiang, and Don Adjeroh. The virtual suffix tree. Int. J. Found. Comput. Sci.,20(6):1109–1133, 2009.

[80] Mortitz G. Maaβ. Computing suffix links for suffix trees and arrays. Information Pro-cessing Letters, 101(6), 2007.

[81] Maurice Maes. On a cyclic string-to-string correction problem. Inf. Process. Lett.,35(2):73–78, 1990.

[82] Veli Makinen. Compact suffix array – a space-efficient full-text index. Fundam. Inform.,56(1-2):191–210, 2003.

[83] Veli Makinen and Gonzalo Navarro. Compressed compact suffix arrays. CombinatorialPattern Matching, pages 420–433, 2004.

[84] U. Manber and G. Myers. Suffix arrays: A new method for on-line string searches. SIAMJ. Computing, 22(5):935–948, 1993.

[85] Tobias Marschall and Sven Rahmann. Probabilistic arithmetic automata and their appli-cation to pattern matching statistics. Combinatorial Pattern Matching, pages 95–106,2008.

[86] Tobias Marschall and Sven Rahmann. Efficient exact motif discovery. Bioinformatics,25(12), 2009.

[87] Andres Marzal and Sergio Barrachina. Speeding up the computation of the edit distancefor cyclic strings. Pattern Recognition, Int’l Conference on, 2:2891, 2000.

BIBLIOGRAPHY 161

[88] Geoffrey Mazeroff, Jens Gregor, Michael G. Thomason, and Richard Ford. Probabilisticsuffix models for API sequence analysis of Windows XP applications. Pattern Recogni-tion, 41(1):90–101, 2008.

[89] Edward M. McCreight. A space-economical suffix tree construction algorithm. J. ACM,23:262–272, April 1976.

[90] R. A. Mollineda, E. Vidal, and F. Casacuberta. Cyclic sequence alignments: Approxi-mate versus optimal techniques. International Journal of Pattern Recognition and Arti-ficial Intelligence, 16:291–299, 2002.

[91] R. A. Mollineda, E. Vidal, and F. Casacuberta. A windowed weighted approach forapproximate cyclic string matching. In ICPR ’02: Proceedings of the 16 th InternationalConference on Pattern Recognition (ICPR’02) Volume 4, page 40188, Washington, DC,USA, 2002. IEEE Computer Society.

[92] Ramon Alberto Mollineda, Enrique Vidal, and Francisco Casacuberta. Efficient tech-niques for a very accurate measurement of dissimilarities between cyclic patterns. InProceedings of the Joint IAPR International Workshops on Advances in Pattern Recog-nition, pages 337–346, London, UK, 2000. Springer-Verlag.

[93] Krisztian Monostori, Arkady Zaslavsky, and Heinz Schmidt. Suffix vector: Space- andtime-efficient alternative to suffix trees. In Michael J. Oudshoorn, editor, Twenty-FifthAustralasian Computer Science Conference (ACSC2002), Melbourne, Australia, 2002.ACS.

[94] J. Ian Munro, Venkatesh Raman, and S. Srinivasa Rao. Space efficient suffix trees. J.Algorithms, 39(2):205–222, 2001.

[95] G. Navarro and J. Tarhio. Boyer-Moore string matching over Ziv-Lempel compressedtext. Proceedings, Combinatorial Pattern Matching, LNCS 1848, pages 166–180, 2000.

[96] Saul Ben Needleman and Christian Dennis Wunsch. A general method applicable to thesearch for similarities in the amino acid sequence of two proteins. Journal of MolecularBiology, 48(3):443–453, 1970.

BIBLIOGRAPHY 162

[97] Ge Nong, Sen Zhang, and Wai Hong Chan. Linear time suffix array construction usingd-critical substrings. In CPM, pages 54–67, 2009.

[98] Jose Oncina. The Cocke-Younger-Kasami algorithm for cyclic strings. In ICPR ’96:Proceedings of the 13th International Conference on Pattern Recognition, page 413,Washington, DC, USA, 1996. IEEE Computer Society.

[99] Hasan H. Otu and Khalid Sayood. A new sequence distance measure for phylogenetictree construction. Bioinformatics, 19(16):2122–2130, November 2003.

[100] Philipp Pagel, Matthias Oesterheld, Volker Stumpflen, and Dmitrij Frishman. The DIMAweb resource - exploring the protein domain network. Bioinformatics, 22(8):997–998,2006.

[101] Debnath Pal. Inference of protein function from protein structure. Structure, 13:121–130,2005.

[102] Anatolij Potapov, Bjorn Goemann, and Edgar Wingender. The pairwise disconnectivityindex as a new metric for the topological analysis of regulatory networks. BMC Bioin-formatics, 9, 2008.

[103] Elise Prieur and Thierry Lecroq. From suffix trees to suffix vectors. In Prague Stringol-ogy Conference(PCS2005), Prague, 2005.

[104] Simon J. Puglisi, William F. Smyth, and Andrew Turpin. A taxonomy of suffix arrayconstruction algorithms. ACM Computing Surveys, 39(2), 2007.

[105] A. Reyes, C. Gissi, G. Pesole, F.M. Catzeflis, and C. Saccone. An information-basedsequence distance and its application to whole mitochondrial genome phylogeny. Mol.Biol. Evol., 17:979–983, 2000.

[106] J. Rissanen. Complexity of strings in the class of Markov sources. IEEE Transactionson Information Theory, IT-32(4):526–532, 1986.

[107] Jorma Rissanen. Universal coding information prediction and estimation. IEEE Trans-actions on Information Theory, IT-30(4):629–636, 1984.

BIBLIOGRAPHY 163

[108] Dana Ron, Yoram Singer, and Naftali Tishby. Learning probabilistic automata with vari-able memory length. In COLT, pages 35–46, 1994.

[109] Dana Ron, Yoram Singer, and Naftali Tishby. The power of amnesia: Learning prob-abilistic automata with variable memory length. Machine Learning, 25(2-3):117–149,1996.

[110] Luıs M. S. Russo, Gonzalo Navarro, and Arlindo L. Oliveira. Fully-compressed suffixtrees. In LATIN’08: Proceedings of the 8th Latin American Conference on TheoreticalInformatics, pages 362–373, Berlin, Heidelberg, 2008. Springer-Verlag.

[111] Kunihiko Sadakane. Compressed suffix trees with full functionality. Theory of Comput-ing Systems, 41(4):589–607, 2007.

[112] GK Sandve and F Drabls. A survey of motif discovery methods in an integrated frame-work. Biology Direct, 1(11), 2006.

[113] Thomas B. Sebastian, Philip N. Klein, and Benjamin B. Kimia. On aligning curves.IEEE Transactions on Pattern Analysis and Machine Intelligence, 25:116–125, 2003.

[114] P.H. Seller. The theory and computation of evolutionary distances: Pattern Recognition.jalg, 1:359–373, 1980.

[115] Y. Shiloach. Fast canonization of circular strings. J. Algorithms, 2(2):107–121, 1981.

[116] T. F. Smith and M. S. Waterman. Identification of common molecular subsequences. JMol Biol, 147(1):195–197, March 1981.

[117] W.F. Smyth. Computing Patterns in Strings. Addison-Wesley, 2003.

[118] Kenneth Sorensen. Distance measures based on the edit distance for permutation-typerepresentations. Journal of Heuristics, 13:35–47, 2007.

[119] YQ. Tang, J. Yuan, G. Osapay, K. Osapay, D. Tran, CJ. Miller, AJ. Ouellette, and ME.Selsted. A cyclic antimicrobial peptide produced in primate leukocytes by the ligation oftwo truncated alpha-defensins. Science, 286(5439):498–502, 1999.

BIBLIOGRAPHY 164

[120] Steven L. Tanimoto. A method for detecting structure in polygons. Pattern Recognition,13(6):389–394, 1981.

[121] Esko Ukkonen. Finding approximate patterns in strings. J. Algorithms, 6(1):132–137,1985.

[122] Esko Ukkonen. On-line construction of suffix trees. Algorithmica, 14(3):249–260, 1995.

[123] Shai Uliel, Amit Fliess, Amihood Amir, and Ron Unger. A simple algorithm for detectingcircular permutations in proteins. Bioinformatics, 15(11):930–936, 1999.

[124] Shai Uliel, Amit Fliess, and Ron Unger. Naturally occurring circular permutations inproteins. Protein Eng., 14(8):533–542, August 2001.

[125] L. Wang and T. Jiang. On the complexity of multiple sequence alignment. Journal ofcomputational biology, 1(4):337–348, 1994.

[126] J. Weiner and E. Bornberg-Bauer. Evolution of circular permutations in multidomainproteins. Mol. Biol. Evol, 23(4):734–743, 2006.

[127] J. Weiner, G. Thomas, and E. Bornberg-Bauer. Rapid motif-based prediction of circularpermutations in multi-domain proteins. Bioinformatics, 21(7):932–937, 2005.

[128] P. Weiner. Linear pattern matching algorithm. Proceedings, 14th IEEE Symposium onSwitching and Automata Theory, 21:1–11, 1973.

[129] Frans M. J. Willems, Yuri M. Shtarkov, and Tjalling J. Tjalkens. The context-tree weight-ing method: Basic properties. IEEE Transactions on Information Theory, 41(3):653–664,1995.

[130] S. Wu and U. Manber. Agrep — a fast approximate pattern matching tool. In Proceedingsof the Winter 1992 USENIX Conference, pages 153–62. USENIX Association, Berkeley,CA, 1992.

[131] Mikio Yamamoto and Kenneth Ward Church. Using suffix arrays to compute term fre-quency and document frequency for all substrings in a corpus. Computational Linguis-tics, 27(1):1–30, 2001.

BIBLIOGRAPHY 165

[132] J Yang, W Wang, P Yu, and J Han. Mining long sequential patterns in a noisy environ-ment. In SIGMOD. ACM, 2002.

[133] Jacob Ziv and Abraham Lempel. A universal algorithm for sequential data compression.IEEE Transactions on Information Theory, 23(3):337–343, 1977.

[134] Jacob Ziv and Abraham Lempel. Compression of individual sequences via variable-ratecoding. IEEE Transactions on Information Theory, 24(5):530–536, 1978.

Documents

Suffix Structures and Circular Pattern Problems