Parallel comparison of run-length-encoded strings on a linear systolic array

Information Sciences 177 (2007) 231–238

www.elsevier.com/locate/ins

Parallel comparison of run-length-encoded stringson a linear systolic array

Alessandro Bogliolo *, Valerio Freschi

STI – University of Urbino, Piazza della Repubblica, 13, Urbino 61029, Italy

Received 20 July 2005; received in revised form 13 July 2006; accepted 15 July 2006

Abstract

The length of the longest common subsequence (LCS) between two strings of M and N characters can be computed byan O(M · N) dynamic programming algorithm, that can be executed in O(M + N) steps by a linear systolic array. It hasbeen recently shown that the LCS between run-length-encoded (RLE) strings of m and n runs can be computed by anO(nM + Nm � nm) algorithm that could be executed in O(m + n) steps by a parallel hardware. However, the algorithmcannot be directly mapped on a linear systolic array because of its irregular structure.

In this paper, we propose a modified algorithm that exhibits a more regular structure at the cost of a marginal reductionof the efficiency of RLE. We outline the algorithm and we discuss its mapping on a linear systolic array.� 2006 Elsevier Inc. All rights reserved.

Keywords: Algorithms; Longest common subsequence; Run-length encoding; Systolic array; String comparison

1. Introduction

The length of the longest common subsequence (LCS) between two strings defined over a common finitealphabet is a similarity metric used in many applications, including genomics, natural language processing,image processing and data base search [8,10]. A subsequence of a given string S is any string obtained fromS by deleting some of its elements. Given two strings X and Y, a common subsequence for X and Y is a sub-sequence of both strings.

The LCS between two strings of M and N characters can be exactly computed by means of dynamic pro-gramming [6] with computational time complexity proportional to the product of the lengths of the twostrings: O(N · M). Furthermore, the dynamic programming algorithm exhibits a low-level parallelism thatcan be exploited to perform LCS computation in O(M + N) steps [7]. The parallel implementation makesuse of a systolic matrix of M · N locally-connected identical computational units. Since the number of unitsthat can execute in parallel is limited to min{M,N}, parallel implementations have been proposed that makeuse of only min{M,N} shared units organized in a linear systolic array [7].

0020-0255/$ - see front matter � 2006 Elsevier Inc. All rights reserved.

doi:10.1016/j.ins.2006.07.024

* Corresponding author.E-mail addresses: [email protected] (A. Bogliolo), [email protected] (V. Freschi).

mailto:[email protected]

mailto:[email protected]

232 A. Bogliolo, V. Freschi / Information Sciences 177 (2007) 231–238

Several algorithms have been proposed in the last decade to exploit string compression (either run-length

encoding (RLE) [9] or Lempel–Ziv compression (LZ78) [11]) to further reduce computational complexity[2,3,1,4]. All these approaches, however, reduce the asymptotic complexity at the cost of increasing the com-putational cost of each step and/or reducing the inherent parallelism of the algorithm. A new algorithmexploiting RLE both to reduce asymptotic complexity and to enhance parallelism has been recently proposed[5]. The algorithm, hereafter referred to as RLE–LCS, has complexity O(mN + Mn � mn) and can be executedin O(m + n) steps on parallel hardware, where m and n denote the number of runs (sequences of consecutivecharacters taking the same symbol) in the two strings. However, the RLE–LCS algorithm has an irregularstructure that makes it difficult to map on a linear systolic array.

In this paper, we discuss the issues related to the parallel implementation of RLE–LCS and we propose amodified version that is particularly suitable for comparing a short string (of N characters) against a muchlonger one (of M� N characters). A parallel implementation is proposed for the modified RLE–LCS, thatexecutes in Oð~mþ NÞ on a linear systolic array, where ~m is the number of runs in the longest string X, withlength limited to a fixed parameter LMax.

2. Basic LCS algorithm

The computation of the LCS between two strings (namely X = x1,x2, . . . ,xM and Y = y1,y2, . . . ,yN) bymeans of dynamic programming entails the incremental computation of a matrix (hereafter denoted byLCS) of M + 1 rows and N + 1 columns. Elements of X are associated to rows 1 to M while elements of Yare associated to columns 1 to N. Entry LCS(i, j) represents the LCS between the first i characters of X andthe first j characters of Y, and can be incrementally computed from LCS(i � 1, j), LCS(i � 1, j � 1) andLCS(i, j � 1) by means of the following recursive equation:

Fig. 1.of six

LCSði; jÞ ¼0; i ¼ 0 or j ¼ 0

maxfLCSði� 1; jÞ; LCSði; j� 1Þg; xi 6¼ yj and i; j > 0

LCSði� 1; j� 1Þ þ 1; xi ¼ yj and i; j > 0

8><>:

ð1Þ

The computation of entry LCS(M,N), representing the global LCS between X and Y, requires M · N incre-mental computations.

Fig. 1(a) shows the dynamic programming matrix used to compare two strings of 14 and 6 elements. Firstcolumn and row are not shown since in the LCS computation algorithm they are initialized to 0 and do notneed to be explicitly represented.

12

34

56

11 2

2 33 4

4 55 6

6aaccccgggg

ga

g

a a t t t ga

123456789

1011121314 15 16 17 18 19

4(9)

4(8)(7)

3

(8)3

43

3(9)

(9)(8)

4(10)

1 11111111111111

2222222222222

2 2 2 22 2 2 22 2 2 22 2 2 22 2 2 22 2 2 22 2 2 32 2 22 2 22 2 22 2 22 2 22 2 2

333333

1 1 1 1

Basic dynamic programming algorithm for LCS: (a) 14 · 6 dynamic programming matrix; (b) computation steps on a systolic arrayelements; (c) binding between computation tasks and systolic array elements.

A. Bogliolo, V. Freschi / Information Sciences 177 (2007) 231–238 233

The mapping of the LCS algorithm on a 2-dimensional array of M · N computational units is obtained byassigning each unit with the homologous matrix element. Since each unit implements recursive equation (1), itneeds to be connected only to the three neighbors with lower row and column indexes. On the other hand,each unit can start computation only when its inputs are available. Starting from entry (1, 1), enabling condi-tions are simultaneously satisfied for all units along a secondary diagonal, so that the maximum number ofunits that actually execute in parallel is min{M,N}. Fig. 1(b) shows the 19 computation steps required to com-pute all matrix entries.

The above observations suggest that a linear systolic array of min{M,N} units would be sufficient to fullyexploit the parallelism of the LCS algorithm. The same units will be re-used at each step to compute matrixentries along a different diagonal. The binding between the matrix entries and the shared units of a systolicarray of six elements is shown in Fig. 1(c). Data dependencies between subsequent computation steps are alsoshown in the figure. Numbers denote computational units, while apexes in brackets denote computation steps.We observe that the relative data dependencies of each unit do not change over time (see, for instance, unit 4 atsteps 9 and 10). This is the key property exploited by the linear systolic implementations of LCS.

3. RLE–LCS algorithm

If X and Y are run-length-encoded (RLE) strings [9] of m and n runs, the encoding induces a partitioning ofthe LCS matrix into sub-matrices (blocks) associated with ordered pairs of runs, as shown in Fig. 2(a). Withrespect to a block B we call: input elements the entries of LCS that do not belong to B but provide inputs tosome of its elements, root element the input element with smallest row and column indexes, output elements theelements of B that feed entries of LCS that do not belong to B, inner elements all other elements of B.

It has been shown [5] that output element LCS(i, j), belonging to a block rooted in LCS(i0, j0), can be incre-mentally computed by means of the following recursive equation:

Fig. 2.16 elem

LCSði; jÞ ¼0; i ¼ 0 or j ¼ 0

maxfLCSði0; jÞ;LCSði; j0Þg; xi 6¼ yj and i; j > 0

LCSði� d; j� dÞ þ d; xi ¼ yj and i; j > 0

8><>:

ð2Þ

where d = min{i � i0, j � j0} is the minimum distance of LCS(i, j) from block inputs. Data dependencies arepictorially shown in Fig. 2(a) for matrix entry LCS(11,5).

The dynamic programming algorithm based on Eq. (2), denoted by RLE–LCS, is shown in Fig. 3. It isworth noting that (i) the inner elements of the blocks are never visited, (ii) there are no functional dependencies

aaccccgggg

ga

g

a a t t t ga

1

2

34

65

8(4)

102

5(3)

(3)(2)

8(3)

6(2)(2)

3

(2)2

134567

8 10111213

13

7

9

11121314141516

4

69

1 23

57

10

9

22

46

8

10

1221222212222221

1 2

12222222222

32222222

22

22

1222222

333333

8

5

RLE–LCS dynamic programming algorithm: (a) 14 · 6 dynamic programming matrix; (b) computation steps on a systolic array ofents; (c) binding between computation tasks and systolic array elements.

Fig. 3. RLE–LCS algorithm for computing LCS between RLE strings.


among the output elements of the same block, and (iii) the comparison between the symbols in X and Y isperformed only once for each block.

Since inner elements are never visited, the complexity of RLE–LCS reduces to O(mN + Mn � mn), that isthe total size of block boundaries. In addition, the lack of functional dependencies among the outputs of thesame block grants to the algorithm improved parallelism that can be exploited to perform computation inO(m + n) steps.

The ideal steps of a parallel implementation of RLE–LCS are shown in Fig. 2(b), while a possible bindingto the units of a linear systolic array is shown in Fig. 2(c).

We remark that (i) the degree of parallelism changes at each step (the maximum number of units executingin parallel being 16, rather than 6), and (ii) data dependencies among computational units change over time.These observations make it difficult to map the RLE–LCS algorithm onto a linear systolic array, because ofthe need of a complex control/steering logic dynamically changing the interconnect network at each step.

In the following section we propose a modified version of the RLE–LCS algorithm, conceived to compare ashort string against a much longer one (i.e., M� N), that solves the above-mentioned mappability issues. Theidea is to enforce a regular structure of data dependencies at the cost of decompressing the shorter string (sayY) and imposing an upper bound to the length of the runs of the longest string (say, X), thus obtaining a com-putation time of Oð~mþ NÞ, where ~m is the number of run-length-limited runs of X ðm 6 ~m 6 MÞ.

4. RLLE–LCS algorithm

The main problem in mapping RLE–LCS on a systolic array is the variable length of the runs of X and Y tobe compared. We address mappability issues by reducing this variation, thus granting to the algorithm a reg-ular structure.

First, we impose an upper bound (LMax) to the maximum run length used to represent the longest string X.Runs longer than LMax are split into two or more contiguous runs associated with the same character. Forinstance, if LMax = 4, the run-length-limited encoding (RLLE) of our example string becomes ‘‘3a,4c,4g,2g, 1a’’. Second, we leave the shortest string (Y) uncompressed. Since a non-encoded string can be viewedas a RLLE string with LMax = 1, in practice we limit the run length of strings X and Y to a maximum of LMax

and 1 characters, respectively. The algorithm for computing LCS between RLLE strings will be hereafterdenoted by RLLEðLMax;1Þ–LCS.

The upper bounds to the length of the runs of X and Y limit the size of the blocks of the LCS matrix asso-ciated with the run pairs. This allows us to ideally allocate a maximum-size block to each pair of runs undercomparison, obtaining an oversized matrix of ~mLMax � N entries, where ~m is the number of length-limited runsof X. The oversized matrix for our example is shown in Fig. 4(a), for LMax = 4.

We align each run of X to the end of the corresponding block, adding gaps to all runs shorter than LMax.Matrix entries associated with gaps are shaded in Fig. 4(a) since they do not need to be computed (nor rep-resented in practice). All other entries are incrementally computed according to recursive equation (2).

aaaccccgggg

a a t t t g

g

a

g

6 7 8 9 10

1

2

3

4

5

6(3)

5(2)(2)

3

(2)2

6(4)

5(3)(3)

3

(3)2

6(5)

51

2(4)

(4)(3)

1234

5

78

6

9101112

1314151613141516

91011129101112

5

78

6

5

78

6

123412341234

5

78

617181920

1 111

2222222222

22

21

11

11111111

1 1 1 12222

2 2 2 22222

2 2 2 22222

2 2 2 22 2 2 3

333

33

32

22

22222

2 222

2 222

2 2

Fig. 4. RLLE–LCS dynamic programming algorithm: (a) 14 · 6 dynamic programming matrix; (b) computation steps on a systolic arrayof 20 elements; (c) binding between computation tasks and systolic array elements.


The complexity of the algorithm is given by the number of entries to be computed, that is M · N as in theoriginal LCS algorithm. Hence, the restrictions applied to the length of the runs impair the benefits of stringcompression in terms of computational complexity. However, the RLLE of X can still be used to speedup par-allel execution, as shown in Fig. 4(b). In fact, up to 14 matrix entries can be updated in parallel, and the overallcomputation requires only 10 (rather than 19) steps.

In general, the number of steps required to compute RLLEðLMax;1Þ–LCS is Oð~mþ NÞ, where ~m is the numberof limited-length runs of X (evaluated according to the upper bound imposed) and N is the number of char-acters of Y. The maximum number of computational units required to fully exploit the inherent parallelism isLMax minf~m;Ng.

The binding between matrix entries and computational units is shown in Fig. 4(c). It is worth noting thatrelative data dependencies among computational units are kept unchanged during subsequent steps, thusenabling mapping on a linear systolic array. There are only two exceptions to be handled:

• computational units are inactive when they are associated with dummy matrix entries;• in case of a match, input data need to be taken from two different sources depending on the size of the run.

Implementation issues will be addressed in the next section.

5. Systolic array implementation

To describe the systolic implementation of the RLLEðLMax;1Þ–LCS algorithm we refer to the block in position(i, j) of the extended matrix, associated with the ith run of X and the jth element of Y. The LMax elements ofeach block are indexed from 0 to LMax � 1. To denote element k of block (i, j) we use the following notation:(i, j)[k].

Fig. 5(a) shows the position of block (i, j) in the original and extended LCS matrix, while Fig. 5(b) shows thedata dependencies of its elements from those of blocks (i, j � 1), (i � 1, j � 1) and (i � 1, j). In case of a matchbetween characters xi and yj, element (i, j)[k] is computed by adding 1 to the value found either in(i, j � 1)[k + 1] or in (i � 1, j � 1)[0], depending on whether or not the ith run of X is longer than k + 1. In case

[0][0]

[0] [0]

[1][1]

[1] [1]

[2][2]

[2] [2]

[3][3]

[3] [3]

(i-1,j-1) (i-1,j)

(i,j-1) (i,j)

(i,j-1) (i-1,j-1) (i-1,j)

[0] [0]

[1] [1]

[2] [2]

[3] [3]

(i+1,j-1) (i,j)

1

0 1

0

empty match

(i,j)[k]

1(i-1,j-1)[0]

(i,j-1)[k+1]

(i,j-1)[k]

(i-1,j)[0] MA

X

i

i-1

j-1 j

i

i-1

j-1 j

Fig. 5. (a) Generic block in position (i, j) of basic and extended LCS matrixes. (b) Local connections required to provide input data to theelements of block (i, j). (c) Mapping of RLLEðLMax ;1Þ–LCS algorithm on a linear systolic array. (d) Schematic of the basic computational unit.


of mismatch, the value of (i, j)[k] is the maximum between (i, j � 1)[k] and (i � 1, j)[0]. The interconnect struc-ture shown in Fig. 5(b) is general enough to support all possible situations.

In practice, Fig. 5(b) represents the mapping of the algorithm onto a bi-dimensional systolic array. Themapping on a linear array of computational elements is shown in Fig. 5(c), where only two blocks are shownand the computational units associated with the same block are vertically aligned for the sake of readability.

At a generic execution step, the right-most elements are processing block (i, j), while the left-most elementsare processing block (i + 1, j � 1). The inputs needed to compute the entries of block (i, j) have been computedby the same processing elements during the last two execution steps. Hence, the output values need to bestored in registers in order to be made available for subsequent computations. Memory elements are repre-sented in Fig. 5(c) by means of small squares, shaded using the same patterns used in Fig. 5(a) and (b). Ingeneral, all the outputs of each block need to be kept in memory for one clock cycle, while the output of ele-ment 0 needs to be stored for two clock cycles.

The RTL schematic of the computational unit associated with each element is shown in Fig. 5(d). The fourinputs that represent data dependencies are shown to the right, while two additional control inputs are used toselect among different internal paths. Signal match is a Boolean flag that represents the result of the compar-ison between xi and yj. Signal empty is a Boolean flag (raised whenever entry (i, j � 1)[k + 1] is empty) used toselect the two alternative inputs to be used in case of a match. The comparator is not represented within thebasic computational element, since it is shared among all elements belonging to the same block.

6. Discussion and conclusions

The basic computational unit shown in Fig. 5(d) has the same structure of those used in the systolic imple-mentations of the basic LCS algorithm. The only differences are the presence of the MUX controlled by signalempty and the possibility of sharing the comparator among all units belonging to the same block.


Since comparators and MUXes have similar complexity, the basic computational elements of both algo-rithms have approximately the same area and critical path delay. Hence, using a common target technologyand design style, the two algorithms may be executed at the same clock frequency, so that the reduced numberof steps required by RLLEðLMax;1Þ–LCS corresponds to a shorter computation time.

The speedup achieved depends on the effectiveness of the compression provided by RLLE on string X. Fora given string X of M characters, the number of runs generated according to a given upper bound LMax is~m P m, where m is the number of unconstrained runs. The larger the upper bound LMax, the lower the numberof runs and the higher the speedup.

Fig. 6(a) shows the compression ratio ~m=M as a function of LMax for two different types of strings: a DNAsequence and a bitmap representing a scanned handwritten text. When LMax = 1 there is no compressionð~m ¼ MÞ, while ~m approaches the asymptotic value m as LMax gets larger. From Fig. 6(a) we observe thatRLE (and RLLE) are much more effective on a bitmap than on a DNA sequence, the asymptotic compressionratio being 0.102 and 0.724, respectively, leading to a speedup of about 90% and 28% on LCS computation.We also observe that most of the opportunities for compression are offered by short runs. For LMax = 5, thecompression ratio has already reached the asymptotic value on the DNA sequence, while for the bitmap ittakes value 0.224. This means that RLLE(5,1)–LCS provides a speedup of about 80% over the basic LCS algo-rithm, if applied to the comparison of two bitmaps. This is almost the same speedup theoretically achievableby an ideal parallel implementation of the RLE–LCS algorithm.

To further investigate the efficiency of the proposed approach we computed the product between the com-putation time and the number of computational units in the linear systolic array (hereafter denoted by time–space product) for different values of LMax. When LMax = 1, the RLLEðLMax;1Þ algorithm reduces to the basicLCS algorithm: Its parallel implementation requires only min{M,N} units and executes in O(M + N) steps.For LMax > 1, a larger number of computational units is used in order to exploit the improved parallelismgranted by the encoding. In the ideal situation, all additional units work in parallel so that the time–spaceproduct is a constant. In practice, the actual exploitability of the parallel hardware depends on the effective-ness of RLLE. The graph of Fig. 6(b) represents the behavior of the time–space product as a function of LMax.The time–space product of the original LCS algorithm (corresponding to LMax = 1) is used for normalization.The ideal situation (maximum efficiency) is represented by the horizontal line, while the worst-case situation(minimum efficiency) is represented by the diagonal line. As expected, the proposed approach is much moreeffective on bit streams than on DNA. It is worth noting that the normalized time–space product of the bitstream is close to the horizontal line. This demonstrates the efficiency of the proposed parallel implementation.

In conclusion, the run-length limit (LMax) can be used to span the tradeoff between performance and area.In fact, the higher is LMax the lower is the number of execution steps and the higher is the number the com-putational units required to fully exploit the parallelism of the RLLEðLMax;1Þ–LCS algorithm.

0 5 10 15 20 25 30 35 40

Run-Length Limit (LMax)

0

0.2

0.4

0.6

0.8

1

Nor

mal

ized

str

ing

leng

th (

m/M

)

DNABitmap

0 1 2 3 4 5 6 7 8 9 10

Run-Length Limit (LMax)

0

1

2

3

4

5

6

7

8

9

10

Nor

mal

ized

tim

e-sp

ace

prod

uct

DNABitmap

Minimum efficiency

Maximum efficiency

Fig. 6. (a) Efficiency of RLLE, expressed as the ratio between the number of runs in the RLLELMaxrepresentation and the number of

characters in the original string (in symbols, ~m=M). (b) Normalized time–space product for the parallel implementation of RLLEðLMax ;1Þ.


References

[1] A. Apostolico, G. Landau, S. Skiena, Matching for run-length encoded strings, Journal of Complexity 15 (1999) 4–16.[2] O. Arbell, G. Landau, J. Mitchell, Edit distance of run-length encoded strings, Information Processing Letters 83 (2002) 307–314.[3] H. Bunke, J. Csirik, An improved algorithm for computing the edit distance of run length coded strings, Information Processing

Letters 54 (1995) 93–96.[4] M. Crochemore, G. Landau, M. Ziv-Ukelson, A subquadratic sequence alignment algorithm for unrestricted scoring matrices, SIAM

Journal on Computing 32 (2003) 1654–1673.[5] V. Freschi, A. Bogliolo, Longest common subsequence between run-length-encoded strings: a new algorithm with improved

parallelism, Information Processing Letters 90 (2004) 167–173.[6] G. Gusfield, Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology, Cambridge University

Press, Cambridge, UK, 1999.[7] D. Lopresti, P-nac: a systolic array for comparing nucleic acid sequences, Computer 24 (1987) 6–13.[8] M.S. Sepesy, Z. Kacic, B. Horvat, Modelling highly inflected languages, Information Sciences 166 (2004) 249–269.[9] K. Sayood, Introduction to Data Compression, Morgan Kaufmann, San Francisco, CA, 2000.

[10] Y.H. Wang, Image indexing and similarity retrieval based on spatial relationship model, Information Sciences 154 (2003) 39–58.[11] J. Ziv, A. Lempel, Compression of individual sequences via variable-rate coding, IEEE Transactions on Information Theory II 24

(1978) 530–536.

Documents

Parallel comparison of run-length-encoded strings on a linear systolic array