Parallel Parsing: The Earley and Packrat Algorithmskubitron/courses/... · Parsing expression grammars: Though their roots lie in Top-Down Parsing Language[2], a for-mal grammar invented

Parallel Parsing:

The Earley and Packrat Algorithms

Seth Fowler and Joshua Paul

May 19, 2009

1 Introduction

Parsing plays a critical role in our modern computerinfrastructure: scripting languages such as Pythonand JavaScript, layout languages such as HTML,CSS, and Postscript/PDF, and data exchange lan-guages such as XML and JSON are all interpreted,and so require parsing. Moreover, by some estimates,the time spent parsing while producing a renderedpage from HTML, CSS, and JavaScript is as much as40%. Motivated by the importance of parsing, andalso by the wide-spread adoption of chip multiproces-sors, we are interested in investigating the potentialfor parallel parsing algorithms.

We begin with one classical and one more modernparsing algorithm, the Earley and the Packrat algo-rithm, respectively, and consider methods for paral-lelizing both. The general strategy is the same inboth cases: break the string to be parsed into con-tiguous blocks, process each block in parallel, andfinally re-join the blocks. We find that using thisgeneral framework, we are able to obtain a meaning-ful speedup relative to serial implementations of eachalgorithm. In particular, we obtain a speedup of 5.4using the parallel Earley implementation on 8 proces-sors, and a speedup of 2.5 using the parallel Packratalgorithm on 5 processors.

Another algorithmic property that we consider iswork efficiency, a measure of the amount of extrawork done by the parallel algorithm relative to theserial algorithm. Poor work efficiency indicates thatthe parallel algorithm is doing considerably morework than is required, and thereby using more en-ergy. With the relative importance of energy increas-

ing, even a parallel algorithm that achieves significantspeedup may not be worthwhile if it exhibits poorwork efficiency. Though it is difficult to formulate aprecise definition of work efficiency in terms of pro-gram operations, we qualitatively demonstrate thatour algorithms, with some caveats, exhibit reasonablework efficiency.

Related work: There is a substantial literature re-lated to the hardness of parsing context free gram-mars (CFG), most related to the result of Valiant[8]that CFG parsing is equivalent to binary matrix mul-tiplication. This equivalence provides, in theory, par-allel algorithms for parsing, but these (to the authors’knowledge) have not been put into practice.

There are several published algorithms (see, for ex-ample, Hill and Wayne[6]) for parallelizing a variantof the the Earley algorithm, the CYK algorithm. Un-fortunately, in practice CYK has a much longer run-ning time than Earley (even though it has the sameworst-case complexity of O(n3)), and so it is not typ-ically used.

For the Earley algorithm itself, there are very fewparallelization methods published (though there aremany optimizations – see, for example, Aycock andHorspool[1]). One such method by Chiang and Fu[3]uses a decomposition similar to the one we develop,but goes on develop the algorithm for use on a spe-cialized VLSI. Similarly, Sandstrom[7] develops analgorithm based on a similar decomposition, but im-plements it using a message passing architecture. Inboth cases, the grain of the parallelization is muchfiner than what we propose, and there is no notion,

1

of speculation, a central component of our algorithm.Finally, to the authors’ knowledge, there are no

published works on parallelizing the Packrat algo-rithm.

2 Background

Context-free grammars: A formal grammar is aset of formation rules that implicitly describe syntac-tically valid sequences of output symbols from somealphabet. A context-free grammar is a specific typeof grammar, specified by:

• Σ, a set of terminal symbols.• V , a set of non-terminal symbols.• R, a set of production rules, each taking a single

non-terminal to an arbitrary sequence of termi-nal and non-terminal symbols.• S, the designated start non-terminal.

Using the production rules, it possible to construct asequence of terminals, beginning with the start non-terminal. Such a sequence is said to be derived fromthe grammar, and the steps in the construction com-prise a derivation of the sequence. The parsing prob-lem can then be stated formally as follows: given asequence of terminals, determine whether it can bederived by the grammar, and if so, produce one ormore such derivations.

When referring to a context-free grammar, the fol-lowing notation will be used hereafter: A,B willbe used to denote non-terminals (e.g. A,B ∈ V );a, b will be used to denote terminals (e.g. a, b ∈Σ); α, β, γ will be used to denote arbitrary (possi-bly empty) sequences of terminals and non-terminals(e.g. α, β, γ ∈ (V ∪ Σ)∗).

The Earley algorithm: Given as input a length-n sequence of terminals Tn = x1x2 . . . xn, the Earleyalgorithm[4] constructs n + 1 Earley sets: an initialset S0, and a set Si associated with each input termi-nal xi. Elements of these sets are Earley items, eachof which is intuitively an ‘in-progress’ production ruleof the grammar. The Earley sets are constructed insuch a way that Si contains all possible Earley items

[A α Bβ, j]

[B a, i - 1]

S i-1

S i

Complete

Predict

Scan[B a , i - 1]

[A αB β, j]

Figure 1: Illustration of the Scan, Predict, andComplete mechanisms, supposing that our grammarcontains the rule B → a and that xi = a.

for terminals x1x2 . . . xi; upon completion, Sn there-fore contains all possible Earley items for the entireinput sequence.

Formally, an Earley item comprises a productionrule, a position in the right-hand side of rule indi-cating how much of the rule has been seen, and anindex to an earlier set indicating where the rule be-gan (we frequently refer to this as the origin of theitem). Earley items are written:

[A→ α • β, j],

where • denotes the position in the right-hand sideof the rule and j indexes the Earley set Sj . Itemsmay be added to the set Si using the following threemechanisms:

Scan : If i > 0, [A→ α•aβ, j] is in Si−1, and a = xi,add [A→ αa • β, j] to Si.

Predict : If [A → α • Bβ, j] is in Si and B → γ isa production rule, add [B → •γ, i] to Si.

Complete : If [A → α•, j] is in Si and [B → β •Aγ, k] is in Sj , add [B → βA • γ, k] to Si.

See Figure 1 for an illustration of each mechanism,and observe that constructing Earley set Si dependsonly on {Sj | j ≤ i}. Thus, dynamic programmingmay be used, with Earley sets constructed in increas-ing order.

Recalling that S is the start non-terminal, Earleyset S0 is initialized to contain [S → •α, 0] for all rulesS → α. Upon completion, the input sequence can bederived if and only if Sn contains [S → α•, 0] forsome rule S → α; if the sequence can be derived, oneor more derivations may be obtained by traversing

2

Algorithm 1 Earley(Tn)S0 ← {[S → •α, 0] | S → α ∈ P}for i = 0 to n do

repeatapply Scan, Predict, Complete to Si

until nothing new added to Si

end for

if [S → α•, 0] ∈ Sn for some S → α thenreturn true

elsereturn false

end if

backwards through the Earley sets. See Algorithm 1for pseudocode.

Parsing expression grammars: Though theirroots lie in Top-Down Parsing Language[2], a for-mal grammar invented in the 1970s to characterizetop-down recursive parsers, parsing expression gram-mars (PEGs) are a relatively recent group of gram-mars which first appear in the literature in 2002[5].PEGs, as their heritage suggests, describe the pro-cess by which a parser may recognize that a stringbelongs to a language, rather than the process bywhich a string belonging to a language may be gen-erated. They are defined by a 4-tuple of the sameform as that used for CFGs, but the interpretation ofthe rules and the set of operators that may be usedwithin a rule are different. 1 displays the operatorsavailable in a PEG, many of which resemble regularexpression operators.

Operator ExampleSequence [A← BC]Ordered choice [A← B/C]Zero or more [A← B∗]One or more [A← B+]Optional [A← B?]Followed by [A← &B]Not followed by [A←!B]

Table 1: PEG operators

In a PEG, each rule is treated as a deterministicalgorithm; for example, the simple rule [A ← BC]instructs the parser that it may recognize an A byfirst recognizing a B and then, if that is successful,by recognizing a C. This property means that PEGs,unlike CFGs, cannot directly support left-recursion;[A← AB], which requires that a parser recognize anA by first recognizing an A, can never match success-fully. In practice, this is not a problem, since left-recursive rules can always be rewritten to eliminateleft-recursion[5].

The alternation operator |, used either explicitly orimplicitly in CFGs when there are multiple ways toconstruct the same nonterminal, does not have a pre-cise equivalent in PEGs. Instead, PEGs use the or-dered choice operator /. From the point of view of theparser, the alternation operator has the effect that allof the alternatives are considered simultaneously. Itis possible for more than one of the alternatives tomatch, in which case both possibilities must be ex-plored further by the parser. In contrast, the orderedchoice operator is interpreted in an algorithmic, “if-else” fashion; for example, [A ← B/C] instructs theparser to recognize an A by first attempting to rec-ognize a B and only if that fails to attempt to recog-nize a C. Because of the difference between these twooperators, CFGs can express ambiguous grammars,while PEGs cannot, even if there is a need. PEGsare therefore better suited for machine-readable lan-guages than natural languages.

Packrat parsing: Just as parsing expressiongrammars are a modern extension of TDPL, packratparsing extends a theoretical linear-time algorithmfor parsing TDPL which also dates from the 1970s[5].Packrat parsing, like its ancestor, requires construct-ing a table in memory that grows in size linearly withthe string to be parsed; this almost always requiresmore memory than popular parsing algorithms forLL(1) and LR(1) grammars, which only require mem-ory proportional to the depth of the parse tree for agiven input string. This memory requirement madethis algorithm impractical for decades, but modernmachines have enough memory to make packrat pars-ing realistic.

3

The packrat table is two-dimensional, with therows representing nonterminals in the PEG and thecolumns representing positions in the input string.Each cell [Ci, j] in the table stores the result of anattempt to recognize nonterminal [i] at position [j].If the nonterminal is matched successfully, the cellstores the first position after that match – for exam-ple, if the parser consumed two characters, starting atposition [j], in the process of matching nonterminal[i], the cell will hold the value [j + 2]. If nontermi-nal [i] is not matched successfully, the cell will hold aspecial value indicating that the recognition attemptfailed.

The rule for matching nonterminal [i] may refer toanother nonterminal. If nonterminal [u] is referencedat input string position [v], the packrat parser pro-ceeds by examining the value of cell [Cu, v]; it mayeither find a reference to a later position [v + n], inwhich case it advances to that position and attemptsto match the next part of the rule for [i], or it mayfind the special failure value, which signals that non-terminal [u] does not exist a position [v]. The packratparser may also find that cell [Cu, v] has not yet beenevaluated; in this case, it recursively evaluates [Cu, v]before completing its original computation. Each cellmemoizes the result of a potentially expensive recur-sive computation; by ensuring that no cell is everevaluated more than once, the packrat parser canevaluate any PEG in linear time with respect to thelength of the input string.

To parse a string using the packrat algorithm, theparser begins by evaluating the cell corresponding tothe first position in the string and the start sym-bol of the grammar. Evaluation of cells continuesrecursively until it is possible to determine whetherthe start symbol matched or not. The start symbolmatches if and only if the parse was a success andthe input string belongs to the language specified bythe grammar. In the common case, only a fraction ofthe cells in the packrat table will have been evaluatedat this point; the majority are never touched becausethey could not possibly contribute to matching thestart symbol. In this way, the packrat algorithm useslazy evaluation to perform a minimal amount of workdespite the large size of its underlying data structure.

B2,1

B2,0B1,0 B3,0

B3,1

B3,2

B1 B2 B3

S14

S0,0

S14,1

S35,2

Figure 2: Illustration of the partition of the Earley setsS0, S1, . . . , S35 into the blocks B1, B2, B3, and the associ-ated decomposition into Earley subsets and sub-blocks.

3 Parallelizing the Earley andPackrat algorithms

3.1 Parallelizing Earley

We proceed towards a parallel Earley algorithm intwo steps: first, we show that the operation of theserial Earley algorithm can be decomposed into sub-blocks that depend upon one another in a very reg-ular way; second, we propose a method for remov-ing a key set of dependencies among the sub-blocks,thereby allowing them to be computed in parallel.

Sub-block decomposition: Recall that the serialEarley algorithm associates an Earley set Si witheach terminal xi. The first step in the decomposi-tion is to partition the Earley sets into a sequenceof m contiguous blocks, B1, B2, . . . , Bm. An Earleyitem belonging to an Earley set in block Bi may haveorigin in any block Bj with j ≤ i. Thus motivated,the second step in the decomposition is to partitioneach Earley set Sp in block Bi into Earley subsets,Sp,0, Sp,1, . . . Sp,i−1 with:

Sp,j = {e ∈ Sp | block origin of e = i− j}.

Additionally define Bi,j = {Sp,j | Sp ∈ Bi}, and referto this as a sub-block of block Bi. An Earley sub-set Sp,j in sub-block Bi,j contains only Earley itemswith an origin in block Bi−j . See Figure 2 for anillustration of the decomposition and terminology.

Having characterized the decomposition of Earleysets into sub-blocks, we now describe how to effi-ciently compute the Earley subsets therein. Suppose

4

B2,1

B2,0B1,0 B3,0

B3,1

B3,2L-Complete

L-Scan

L-Scan

L-Complete

Figure 3: Illustration of the L-Scan and L-Completemechanisms for several representative cases. Comparewith the ordinary Scan and Complete mechanisms inFigure 1.

that we are interested in computing the Earley sub-set Sp,j in sub-block Bi,j (with j > 0). There aretwo mechanisms by which items may be added, bothmodifications of the mechanisms described for the or-dinary Earley algorithm:

L-Scan : Suppose Sp−1 ∈ Bi: then if [A→ α•aβ, p′]is in Sp−1,j and a = xp, add [A → αa • β, p′] toSp,j . Otherwise, Sp−1 ∈ Bi−1: if [A→ α•aβ, p′]is in Sp−1,j−1 and a = xp, add [A → αa • β, p′]to Sp,j .

L-Complete : If [A→ α•, p′] is in Sp,k with k ≤ j,and [B → β • Aγ, p′′] is in Sp′,j−k (in sub-blockBi−k,j−k), add [B → βA • γ, p′′] to Sp,j .

These mechanisms are restricted relative to the anal-ogous mechanisms in the serial Earley algorithm sinceEarley subsets in sub-block Bi,j contain only Earleyitems with origin in block Bi−j . In particular, thescan mechanism may depend only on Sp−1,j ∈ Bi,j

or Sp−1,j−1 ∈ Bi−1,j−1 since the resulting item musthave origin in block Bi−j = B(i−1)−(j−1). Similarly,if a completion is made with respect to an Earley itemin block Bi−k (corresponding to the item in Sp,k ∈Bi,k), it must be in sub-block Bi−k,j−k so that the re-sulting item has origin in block B(i−k)−(j−k) = Bi−j .Finally, there is no predict mechanism, since predictwould create Earley items with origin p (and thus inblock Bi 6= Bi−j).

See Figure 3 for an illustration of the mechanisms,and observe that, analogously to the ordinary Earleyalgorithm, the computation of sub-block Bi,j (withj > 0) depends only on sub-blocks {Bi′,j′ | i′ − j′ ≤i − j}. We have thus far avoided discussion of sub-block Bi,0, as it does not naturally fit into the previ-ous analysis. Indeed, straight-forward computation

B2,1

B2,0B1,0 B3,0

B3,1

B3,2

Figure 4: Computational dependencies for sub-blocks.The dotted lines are the dependencies associated with thesub-blocks Bi,0 under a naıve computation.

of the sub-block Bi,0 depends on sub-block Bi−1,0

since successfully applying the predict mechanismrequires all Earley items, regardless of their origin.This leads to the sub-block dependence graph de-picted in Figure 4.

Elimination of key dependencies: We now pro-pose a method for computing the the Earley subsetSp,0 of sub-block Bi,0 without relying on the afore-mentioned dependencies. Intuitively, rather thanexplicitly depending upon previous sub-blocks, themethod assumes that every Earley item in previoussub-blocks is present. Such ‘speculative’ Earley itemsare represented as ordinary Earley items, but with anunspecified origin:

[A→ α • β,−1].

There are three mechanisms by which items may beadded to Earley subset Sp,0, both modifications ofthe mechanisms described for the ordinary Earley al-gorithm:

S-Scan : Suppose Sp−1 ∈ Bi: then if [A→ α•aβ, p′]is in Sp−1,0 and a = xp, add [A → αa • β, p′] toSp,0. Otherwise, Sp−1 ∈ Bi−1: add speculativeitem [A → αa • β,−1] to Sp,0 for all rules A →αaβ with a = xp.

S-Predict : If [A→ α•Bβ, p′] is in Sp,0 and B → γis a rule in R, add [B → •γ, p] to Sp,0.

S-Complete : Suppose [A → α•, p′] is in Sp,0. Ifp′ ≥ 0 and [B → β •Aγ, p′′] is in Sp′,0, add [B →βA • γ, p′′] to Sp,0. If p′ = −1, add speculativeitem [B → βA • γ,−1] to Sp,0 for all rules B →βAγ.

5

S i-1

S i

[B a , -1]

[A αB β, -1]

[B a, -1]

[C γB δ, -1]

S-Scan

S-Complete

Figure 5: Illustration of the S-Scan and S-Completemechanisms, supposing that our grammar contains therules A→ αBβ and C → γBδ, and that xi = a.

As in the ordinary Earley algorithm, S0,0 is initializedto contain [S → •α, 0] for all rules S → α. See Figure5 for a simple illustration of these mechanisms.

The method described above is potentially compu-tationally costly, since many speculative items mayhave to be processed. Additionally, it may produce‘excess’ Earley items originating in block i, namelythose that are not produced by the serial algorithm.Characterizing the number of speculative and ex-cess items and their effect on the work-efficiency andspeed-up of the parallelized algorithm will be an ob-jective of the following sections.

Parallel computation: Having eliminated the keydependencies, sub-blocks can be computed in paral-lel. In particular, sub-blocks Bi,0 for 1 ≤ i ≤ m donot depend on any prior computation, and sub-blockBi,j (for j > 0) directly depends only on the com-putation of sub-block Bi−1,j−1. Thus, it is naturalto express parallelism at the level of blocks, with theassociated sub-blocks computed one by one, start-ing with Bi,0. Algorithm 2 provides pseudo-code forcomputing block Bi.

Implementation details Though the parallelEarley algorithm as written above is a complete de-scription, several implementation details help to pro-duce a program that achieves a significant speedup.The most important such implementation details fol-low:

Memory allocation and caching: Consider the follow-ing properties of the parallel Earley algorithm:

Algorithm 2 ParallelEarley(Tn, i)s← start index of block Bi

e← end index of block Bi

if s = 0 thenSs,0 ← {[S → •α, 0] | S → α ∈ P}

elseSs,0 ← {[A→ αa • β,−1] | A→ αaβ ∈ P, a = xs}

end if

for i = s to e dorepeat

apply S-Scan,S-Predict,S-Complete to Si,0

until nothing new added to Si,0

end for

for j = 1 to i dowait for sub-block Bi−1,j−1

for i = s to e dorepeatapply L-Scan, L-Complete to Si,j

until nothing new added to Si,j

end fornotify sub-block Bi+1,j+1

end for

• After an Earley subset is computed, it can neverbe written again.

• An Earley subset may be read several timesthereafter, and each time the reader is interestedexclusively in those Earley items that have ‘ter-minated’ or those that are ‘in-progress’.

Thus, though a set data type is used to compute anEarley subset, upon completion the set is convertedto a semi-ordered array to expedite future accesses.This conversion has the added benefit of saving mem-ory (sets have significant overhead), and improvinglocality (arrays are contiguous in memory).

Moreover, such a conversion requires an iterationover the set, and so it becomes computationally inex-pensive to empty the set in the process. Afterwards,the set may be re-used to compute the next Earleysubset, sparing the allocation and initialization of anew set object.

Thread and synchronization model: As describedabove, the parallel Earley algorithm allocates a sin-gle thread to each block, with sub-blocks being cal-

6

culated one by one beginning with sub-block Bi,0.As a result, the only communication required be-tween threads is ensuring that, prior to computingBi,j (for j > 0), Bi−1,j−1 has completed (a resultthat follows from induction and the aforementionedread-only property of computed sub-blocks).

This can be accomplished with a countingsemaphore between each pair of neighboring blocks:when the thread associated with block Bi−1 com-pletes sub-block Bi−1,j the semaphore is ‘incre-mented’, and before the thread associated with blockBi begins computing Bi,j (j > 0), the semaphore is‘decremented’.

Finally, note that it may be worthwhile to parti-tion the input terminal sequence into a number ofblocks greater than the number of processors. In-tuitively, this allows a processor that completes itsassigned block to move on to a new block, increasinghardware utilization and performance. We empiri-cally investigate the consequences of such a partitionin the next section.

3.2 Parallelizing Packrat

The naıve approach: In the serial packrat algo-rithm, a single thread evaluates cells in the table re-cursively until it is able to determine whether thestart symbol matches successfully. A naıve approachto parallelizing this algorithm is to add additionalworker threads working on the same table. Each ad-ditional thread selects cells to work on based upontwo criteria: the current position of the main thread,which is moving along the table from left to right,and a heuristic that it uses to choose the cells thatare the most likely to be evaluated in the future bythe main thread.

In the most general case, each worker thread mayevaluate cells at any position. This creates synchro-nization problems; each time a thread chooses a cellto work on, it must ensure that no other threads areevaluating that cell, or work will be duplicated. Fur-thermore, all of the threads must constantly observethe position of the main thread; the further to theleft of the main thread’s position a cell lies, the morebacktracking would be required for the main threadto reach that cell, and therefore the less likely it is

that the cell will influence the result of the parse ifthe grammar is well-behaved. This design is evenmore problematic on NUMA machines or multicoremachines which do not share the L2 cache betweenprocessors; the constant reads and writes to differentparts of the packrat table may cause cache contentionthat will have a potentially profound effect on perfor-mance.

The problems of the general case can be solved bydividing the input string into blocks, and assigningeach worker thread to a block. Each worker threadno longer has to observe what all of the other workerthreads are doing; they can simply evaluate cells intheir block until the main thread reaches the block’sleft side. When the main thread is about to entera worker thread’s block, it notifies that thread toterminate; this notification is the only synchronizedcommunication that needs to occur. There are stillproblems, however. Terminating each worker threadas its block is reached means that later blocks havemany more cells speculatively evaluated than earlierblocks, so that work efficiency gets worse and worseas we move from left to right in the packrat table.Depending upon the details of the grammar and theplacement of the block boundaries, the main threadmay enter and exit the same block several times; evenif it restarts the block’s worker thread when it exitsthe block, time that could have been spent speculat-ing is wasted and unnecessary thread creation andtermination overhead must occur. Finally, we stillhave not eliminated the requirement that at least twothreads must access every block at some point dur-ing execution, and the main thread must still accessevery block in the packrat table; this is potentiallycostly on NUMA machines or in a cluster.

Start symbol synthesis: The existence of themain thread is desirable because it ensures that thealgorithm does not fall too far behind what a serial al-gorithm would do. However, the main thread makesit necessary to share the entire packrat table betweenmultiple threads and can cause poor behavior as itenters and exits a block. To maintain much of theadvantage the main thread provides while eliminat-ing its undesirable properties, we propose a message-

7

passing model based on start symbol synthesis. Theinput string is divided into blocks, and each block isassigned a worker thread. Each block uses heuristicsto speculate until the block to its left has finished.When a block finishes, its final act is to send theblock to its right a synthesized start symbol. Thisnew start symbol encapsulates all of the unfinishedwork of the block – cells that it could not evaluateusing only the information within the block. It takesthe forum of a PEG rule with the following prop-erty: the rule matches successfully if and only if theoriginal start symbol would have matched if its evalu-ation had continued to the same point. By evaluatingthis synthesized start symbol, the next block can con-tinue the computation that the original start symbolbegan.

Preprocessing the grammar is useful for efficientstart symbol synthesis. The repetition operators andthe optional operator are replaced with the expres-sions in 2. These expression make use of (), theempty string, a special terminal that always matcheswithout consuming any input. Simplifying the oper-ators reduces the complexity of the final algorithm;in addition, simplifying the repetition operators inthis way is always necessary to guarantee linear timeparsing[5]. Next, the grammar is transformed so thatthe rule for each nonterminal only references one op-erator; this again simplifies the final algorithm, al-though it may involve creating new nonterminals. Se-quences in the grammar are also simplified in a man-ner inspired by lisp: each sequence with more thantwo elements is replaced with a sequence with thesame first a element and a new nonterminal as thesecond element. The new nonterminal’s rule statesthat it matches when a new sequence, consistingof the remaining elements of the original sequence,matches. If this new sequence contains more thantwo elements, it is subdivided in the same way. Af-ter this transformation is complete, every sequencein the grammar has exactly two elements. By sim-plifying the sequences in this way, we gain simplicityand efficiency: every time a sequence crosses a blockboundary, the part that extends into the next blockis always an existing nonterminal or terminal. Termi-nals can always be evaluated very quickly. Existingnonterminals may already be memoized in the next

Original Transformed[A← B∗] [A← BA/()][A← B+] [A← BA/B][A← B?] [A← B/()]

Table 2: PEG operator transformations

block; even if they aren’t, since a row exists for themin the packrat table they can be memoized if theyneed to be evaluated more than once. In contrast,the portion of an arbitrary sequence that extends intothe next block may be a subsequence that does notcorrespond to any nonterminal; if it has to be eval-uated more than once, it can’t be memoized, whichwill make the algorithm take more than linear time.

The start symbol synthesis process fits naturallyinto the recursive evaluation of cells in the packrat al-gorithm. Each cell is expanded to hold two optionalfields: a complete future expression, and a positionin the block corresponding to an incomplete futureexpression. A complete future expression is a por-tion of the final synthesized start symbol that theblock will produce. It is complete because it reachesall of the way to the end of the block, so no morecells in the current block need to be evaluated to fin-ish constructing it. Complete future expressions arecombined using the PEG operators to produce thefinal synthesized start symbol. An incomplete futureexpression, on the other hand, is suspected to be im-possible to evaluate without leaving the block, but itsconstruction is not yet complete. Incomplete futureexpressions may be proved not to match within thecurrent block, but they cannot be proved to match.They are generated by the ordered choice operatorwhen a choice with higher priority extends outsidethe block, but a choice with lower priority does not.As each cell in the packrat table is evaluated recur-sively, it combines all of the complete future expres-sions of its children and resolves incomplete futureexpressions as much as it possibly can. The way inwhich it accomplishes these goals depends upon theoperator used by the nonterminal corresponding tothat cell:

Ordered choice: Let c be the first of the choices that

8

either matches successfully within the block, oryields an incomplete future expression position,p, when evaluated. Let f be the set of completefuture expressions encountered up to this point.If f is empty, and c was a successful match, yieldsuccess. If f is not empty, combine the elementsof f using the ordered choice operator to forme. If c was found, yield e as this cell’s completefuture expression and either p or, if p is not de-fined, the position following c as this cell’s in-complete future expression position. If no c wasfound, yield e, unless f was empty, in which caseyield failure.

Sequence: Let s1 be the first element of the se-quence and s2 be the second element. If evaluat-ing s1 yields failure, yield failure. If evaluatings1 yields a complete future expression c1, thenif s1 did not also yield an incomplete future ex-pression position p1, yield a sequence containingc1 followed by s2 as this cell’s complete futureexpression. If s2 yields success and s1 yieldedsuccess, yield success. If s2 yields success andp1 is defined, yield c1 as this cell’s complete fu-ture expression and the position following s2 asthis cell’s incomplete future expression position.If s2 yields a complete future expression c2 anda possible incomplete future expression positionp2, then if p1 is defined, yield c1 combined withc2 using the ordered choice operator as this cell’scomplete future expression and p2, if defined, asthis cell’s incomplete future expression position.If c2 is defined, yield c2 as this cell’s completefuture expression and p2, if defined, as this cell’sincomplete future expression position.

Not followed by: If the current position is at theend of the block, yield the current nonterminalas this cell’s complete future expression. Other-wise, yield the normal result of evaluating thisrule.

Others: If the current position is at the end of theinput string, yield failure. If the current positionis at the end of the block, yield the current non-terminal as this cell’s complete future expression.

Otherwise, yield the normal result of evaluatingthis rule.

Eventually, the recursive cell evaluations of eachblock will return to the start symbol for that block.At that point, there will be no incomplete future ex-pression position, and for most blocks, the start sym-bol will either yield failure or a complete future ex-pression that forms the synthesized start symbol forthe next block. The rightmost block’s start symbol,however, will either indicate success or failure whenits evaluation is complete; whichever value it indi-cates determines the result for the parse as a whole.

The processing window: A final enhancement tothe algorithm is the processing window, which servestwo purposes: it reduces the imbalance in work-efficiency between the blocks near the beginning ofthe string and the blocks at the end, and it can beused to make the parallel packrat algorithm requirean amount of space constant in the size of the inputstring. With a processing window in use, a fixed sizeis chosen for each block; ideally, a packrat table ofthe size chosen will fit entirely in the cache of eachprocessing node in the parallel machine the algorithmis being run on. The ideal size will vary dependingon the number of nonterminals in the grammar ofinterest. As many worker threads are then createdas there are processing nodes, unless the input stringis short enough that even fewer are required. Theseworker threads then begin parsing the input string,with the thread corresponding to the leftmost blockprocessing its start symbol and the others workingspeculatively, as described above. If the input stringis long enough, there will be more blocks than workerthreads; it is in this situation that the processing win-dow kicks in.

When the leftmost block has been completely eval-uated and the synthesized start symbol it created hasbeen passed to the right, the thread that was eval-uating that block is reused. It is given the blockjust to the right of the rightmost block that currentlyhas a thread working on it. The effect is as thoughthere is a window being held in front of the blocks,and only the blocks that show through the windowhave threads assigned to them; when a block is com-

9

pleted, the window is moved to the right one block,so that a new block receives a thread to work on it.This process continues until the processing windowhas moved all of the way to the right and there areno blocks which are not either complete or currentlybeing evaluated.

This technique is especially beneficial for packratparsing since packrat tables can be so large. To savespace, packrat tables can be allocated only for theblocks within the processing window; as the windowmoves, the old tables can either be wiped clean andreturned to a pristine state, or deallocated and re-placed with freshly allocated ones. If this is done,the space required by the algorithm becomes constantinstead of linear in the input length, which makespackrat parsing an even more compelling alternativeto convential parsing algorithms.

4 Results

4.1 Parallel Earley algorithm results

The above Java implementation was coupled with an‘off-the-shelf’ context free grammar for Java. Usingthe parallel and serial Earley algorithms, the belowresults are produced deterministically, with the ex-ception of the speedup results. The timing resultsused to compute reported speedups are the arith-metic mean of the timing results of the final 20 outof 25 consecutive runs.

Cost of speculation: Recall from Section 3.1 thatthe the final step in creating a parallel version of theEarley algorithm is breaking the dependence of sub-block Bi,0 on Bi−1,0. As described, this requires in-corporating speculative Earley items into the com-putation. Moreover, in using speculative items itis possible that ‘excess’ items, non-speculative itemsnot produced by the serial Earley algorithm, are con-structed. In this section, we attempt to provide someintuition and quantification of how and when Earleyitems are constructed as a by-product of speculation.

We first measure the total number of items in eachEarley subset using both the serial Earley algorithmand the parallel Earley algorithm with four blocks.

0 20 40 60 80 100

0100

200

300

40

0

Position

# Earley Items

Figure 6: Total number of Earley items in each Earleysubset using the serial Earley algorithm (gray) and paral-lel Earley algorithm using four blocks (black), when runon a typical Java file.

The results for a typical (small) Java file are sum-marized in Figure 6, and for an atypical Java fileconsisting of 30 (empty) nested classes in Figure 7.

In both of these plots, the beginning of each blockis distinguished by a significant elevation in the num-ber of Earley items, nearly all of which are speculativeitems produced by the S-Scan mechanism. Thoughthe number of speculative and excess items typicallyreturns to 0 very quickly, infrequent spikes not ex-plained by the start of a block remain.

Further analysis has shown that these subsequentspikes occur when a speculative production (namely,a production that did not begin in the current block)terminates, causing a cascade of new speculativeitems via the S-Complete mechanism. In particu-lar, this frequently occurs at the end of functions andclasses, when an ‘unmatched’ brace is encountered;this last explanation matches the observed large num-ber of speculative items towards the end of both Javafiles.

To quantify the effect of the number of blocks,we measure the total number items over all Earleysubsets produced by the parallel Earley algorithmfor various number of blocks. The results for boththe typical and atypical Java files, normalized to thenumber of items produced by the serial algorithm,are summarized in Figure 8.

10

0 50 100 150

050

100

150

200

250

Position

# Earley Items

Figure 7: Total number of Earley items in each Earleysubset using the serial Earley algorithm (gray) and paral-lel Earley algorithm using four blocks (black), when runon an atypical, deeply-nested Java file.

Note that, for small numbers of blocks, the nor-malized value for the deeply-nested Java file growsmore rapidly than that for the typical Java file. Onceagain, this is explained by the previous observationthat unmatched braces produce large number of spec-ulative items: in this case, unmatched braces con-stituting the end of the file (and the resultant largenumber of speculative items) are encountered even forsmall numbers of blocks. Once this source for spec-ulative items is saturated, however, both files addspeculative items at similar rates as the number ofblocks grows.

Observed work-efficiency / speedup: Thoughthe situation portrayed in 8 seems difficult to over-come, consider that nearly all Java files are ‘typical’,and are also much longer. As a result, occasionalspikes as in Figure 6 are amortized over long, well-behaved regions. We consider a more direct mea-surement of the work-efficiency of the parallel Earleyalgorithm over a long ( 12,000 line) Java file, namelythe ‘speedup’ of the parallel algorithm when run us-ing a single thread and various numbers of blocks.Also included in Figure 9 are the results when multi-ple threads are utilized.

Focusing on the case with a single thread, we seethat using multiple blocks produces a speedup of less

0 5 10 15 20 25 30

1.0

1.5

2.0

2.5

3.0

3.5

4.0

# Blocks

Earley Item Ratio

Figure 8: Ratio of the number of Earley items producedby the parallel algorithm to the number of Earley itemsproduced by the serial algorithm for the given block size:a typical Java file (•), and a deeply-nested Java file (◦).

than one, as expected. Even for 16 blocks, how-ever, the speedup is greater than 0.8, suggesting thework efficiency is actually quite good (a reasonableinterpretation is that > 80% of the time is spentdoing useful work). Thus, parallelization is feasible,and observing the speedups associated with multiplethreads, is realizable. The plot indicates a roughlylinear relationship between speedup and the numberof processors (so long as the number of blocks is suf-ficiently high). In the best case, a speedup of 5.44was obtained using 8 threads and 8 blocks.

4.2 Parallel Packrat algorithm results

The parallel packrat algorithm described above wasimplemented completely, except for the constantmemory usage feature. The algorithm was evalu-ated using an off-the-shelf PEG for parsing expres-sion grammars. It was run against two inputs, a small648 byte grammar for simple arithmetic, (Calc.peg)and a larger 2,916 byte grammar for parsing expres-sion grammars. (PEG.peg) The speculation heuris-tic used evaluated a nonterminal near the top of thegrammar’s parse tree, Definition, and assumed thatany matches were true matches, skipping over thepositions used in the match when performing furtherspeculation. Trials were conducted for worker thread

11

1 2 3 4 5 6 7 8

01

23

45

6

# Simultaneous Threads

Speedup versus Serial Implementation

Figure 9: Speedup of the parallel Earley algorithm overthe serial Earley algorithm as a function of the number ofthreads for: 2 blocks (◦), 4 blocks (N), 8 blocks (•), and16 blocks (+).

counts between 1 and 8 and for block sizes of 250,500, 750, and 1000. A special ’even’ block size, signi-fying that the input string was to be divided evenlyamong the worker threads, was also evaluated. Foreach combination of thread count, block size, andinput, 100 trials were conducted; results are basedon the arithmetic mean of these trials. The machineused to run these benchmarks was an 8-core Mac Pro.

Work efficiency: The packrat tables used in theworker threads were inspected after execution of thealgorithm was complete, and the number of cells eval-uated was recorded. The results are depicted in Fig-ures 10 and 11, which present the ratio of the num-ber of cells evaluated by the parallel algorithm to thenumber that would have been evaluated using the se-rial algorithm. We can see that smaller block sizesgenerally result in greater speculation in both cases.The ’even’ line in the Calc.peg example particularlystands out; Calc.peg is such a small file that evenly di-viding its characters between more than three threadsresults in tiny block sizes that create extreme ineffi-ciency. At the far right side of the figure, with 8threads, the block size is only 81 characters; blocksizes as small as this cause a great deal of additional

speculative work to be done because the threads asso-ciated with later blocks have time to evaluate a muchhigher percentage of the cells they consider, and eachthread must wait for all previous blocks to completebefore it stops speculating. The ’even’ measurementis additionally hurt because, since it divides the en-tire input string evenly between the workers, it doesnot take advantage of the processing window to pre-vent work on later parts of the string from getting outof control. Finally, in both figures, the larger blocksizes remain essentially flat after a certain number ofthreads has been reached; this is because, with a largeenough block size, only the first few threads have anywork to do.

Speedup: The time taken to execute the paral-lel parsing algorithm was recorded using the built-inJava function System.nanoTime(). The time takento allocate data structures, such as the packrat ta-bles, was excluded, since these data structures canbe reused between invocations of the algorithm and itseems realistic that such an optimization may be em-ployed in the real-world. The results are depicted inFigures 12 and 13. Both graphs demonstrate a peak– for PEG.peg, at 5 threads, and for Calc.peg at 4threads – but their behavior after the peak is reachedis quite different. Calc.peg’s speedup plunges precip-itously both to the left and the right; it’s probablethat this peak represents a block size sweet spot at162 characters per block that has more to do with thespecific content of the input at boundaries betweenblocks than with a consistent trend. PEG.peg dis-plays a better overall speedup – its peak is at about2.5 instead of 2.3 for Calc.peg – but its results, com-bined with those of Calc.peg, suggest that the par-allel packrat algorithm may be limited in its abilityto take advantage of more threads. In both cases,we see speedup climb to a certain level and then re-main more or less flat. PEG.peg, which is about 4.5times as large as Calc.peg, showed an improvementin speedup potential, but speedup didn’t increase lin-early with the increase in input size. These data sug-gest that the speedup of the parallel packrat algo-rithm, while promising, is far from linear either interms of the number of threads or the size of the in-

12

1 2 3 4 5 6 7 8

0.95

1.00

1.05

1.10

1.15

1.20

1.25

Threads

Eva

luat

ed C

ell R

atio

Even2505007501000

Figure 10: Ratio of the number of cells evaluated bythe parallel packrat algorithm with a given number ofworker threads to the number of cells evaluated withoutspeculation, for PEG.peg.

1 2 3 4 5 6 7 8

0.9

1.0

1.1

1.2

1.3

1.4

Threads

Eva

luat

ed C

ell R

atio

Even2505007501000

Figure 11: Ratio of the number of cells evaluated bythe parallel packrat algorithm with a given number ofworker threads to the number of cells evaluated withoutspeculation, for Calc.peg.

put.

5 Conclusion

We have presented parallel versions of two well-known parsing algorithms, Earley and Packrat.Though the algorithms work differently internally,they are both fundamentally based upon the ideathat the input to a parsing algorithm can be dividedinto blocks such that a significant amount of usefulwork can be done on a block without having parsedthe input which preceded it. Our results have con-firmed this intuition and shown that parsing is anarea with a substantial amount of untapped paral-lelization potential. Both of our algorithms achievedspeedup on realistic inputs; a maximum speedup of5.44 was observed in the case of Earley, and 2.5 inthe case of Packrat. Our algorithms are reasonablywork-efficient for real-world inputs, and the paral-lel Packrat algorithm is capable of achieving betterspace complexity than the original serial algorithm.

We believe that further improvements to the al-gorithms and their implementations may yield stillbetter results. Both algorithms would benefit from aheuristic to choose where block divisions occur; as wevaried the sizes of the blocks our algorithms used, ourresults changed in nonlinear ways that suggest oppor-tunities to further improve performance. In addition,the Packrat algorithm depends heavily on the heuris-tic used to choose cells to speculatively evaluate, andlittle work has been invested in optimizing this part ofthe algorithm. Beyond heuristics, neither algorithmcurrently takes advantage of vector or SIMD instruc-tions, which may be a fruitful area of further research.It may even be possible to formulate new vector-styleinstructions tailored for parsers that could help thesealgorithms improve even further.

These parallel algorithms will become increasinglyrelevant as a new wave of computing devices witha large number of small, simple cores become moreand more popular. Even for traditional computers,the ubiquity of the Web and interpreted languagesleads us to believe that parallel parsing algorithmslike these will prove useful in the future. We areexcited to have taken a first step towards the parallel

13

1 2 3 4 5 6 7 8

1.0

1.5

2.0

2.5

Threads

Speedup

Even2505007501000

Figure 12: Speedup of the parallel packrat algorithm ascompared to the single-threaded case, for PEG.peg.

1 2 3 4 5 6 7 8

1.0

1.2

1.4

1.6

1.8

2.0

2.2

Threads

Speedup

Even2505007501000

Figure 13: Speedup of the parallel packrat algorithm ascompared to the single-threaded case, for Calc.peg.

parsing algorithms of tomorrow.

References

[1] J Aycock and R Horspool. Practical earley pars-ing. The Computer Journal, 46 (6):620–630, 2002.

[2] A. Birman. The tmg recognition schema. PhDthesis, Princeton University, Princeton, NJ, USA,1970.

[3] Y. Chiang and K. Fu. Parallel parsing algorithmsand vlsi implementations for syntactic patternrecognition. Pattern Analysis and Machine Intel-ligence, IEEE Transactions on, PAMI-6 (3):302–314, 1984.

[4] J. Earley. An efficient context-free parsing algo-rithm. Commun. ACM, 13:94–102, 1970.

[5] B. Ford. Packrat parsing: a practical linear-time algorithm with backtracking. Master’s the-sis, Massachusetts Institute of Technology, 2002.

[6] J Hill and A Wayne. A cyk approach to parsing inparallel: A case study. Proceeding of the Twenty-Second SIGCSE Technical Symposium on Com-puter Science Education, pages 240–245, 1991.

[7] G Sandstrom. A parallel extension of earley’sparsing algorithm. 1994.

[8] L. Valiant. General context-free recognition inless than cubic time. Journal of Computer Sci-ence, 10:308–315, 1975.

14

Documents

Parallel Parsing: The Earley and Packrat Algorithmskubitron/courses/... · Parsing expression grammars: Though their roots lie in Top-Down Parsing Language[2], a for-mal grammar invented