Perl 6 Update - PGE and Pugs Dr. Patrick R. Michaud April 26, 2005

Perl 6 Update - PGE and Pugs

Dr. Patrick R. MichaudApril 26, 2005

Rules and Grammars

Perl 6 completely redesigns the regular expression syntaxRegular expressions are now "rules"Rules can call/embed other rulesGroups of rules can be combined into Grammars

Current events in Perl 6

Parrot 1.2 releasedThe Perl Foundation receives $25,000 for completion of Parrot milestonesNew Parrot pumpking - Chip SalzenburgNew version of Parrot Grammar Engine (PGE / Perl 6 rules) to be released this weekPugs - Autrijus Tang Perl 6 test suite

Pugs

Perl 6 compiler written in HaskellStarted by Autrijus TangCompiles directly to Haskell or to Parrot ASTBeing used to develop Perl 6 tests and experiment with Perl 6 designAvailable at http://pugscode.orgDiscussion on [email protected] mailing list

Perl 6 rules / Parrot Grammar Engine

The heart of the Perl 6 compiler is the Perl/Parrot Grammar Engine (PGE)Implements the Perl 6 rules syntax, compiles to Parrot codePerl 6 rules compiler currently written in CBootstrap to Perl 6

Steps to Perl 6 compiler

Finish PGE bootstrap in C Parse p6 "rule" statements and grammars

Use p6 rules to define the Perl 6 grammarP6 grammar can be used to generate Parrot abstract syntax trees from Perl 6 programsCompile, (optimize), execute the abstract syntax tree to get working Perl 6 programUse Perl 6 to rewrite the grammar engine in Perl 6 (faster)

Current state of PGE

Handles concatenation, alternation, quantifiers, captures*, subpatterns, subrulesCapture semantics redefined in Dec 2004, still not finalTo be added next Character classes (note: Unicode) Patterns containing scalars, arrays, hashes

P6 rule syntax

Changes from perl 5 No more trailing /e, /x, /s options [...] denotes non-capturing groups ^ and $ are beginning/end of string ^^ and $$ are beginning/end of line . matches any character, including newline \n and \N match newline/non-newline # marks a comment (to end of line) Quantifiers are *, +, ?, and **{m..n}

Character classes

[aeiou] changed to <[aeiou]>[^0-9] now <-[0..9]>Properties defined as <alpha> <digit> <alnum>

Combine classes using +/- syntax: <+<alpha>-[aeiou]>

Subrules

Patterns are now called "rules"Analogous to subroutines and closuresLike {...}, /.../ compiles into a "rule" subroutineP6 rule statement allows named rules:

rule ident / [<alpha>|_] \w* /;

Named rules can be easily used in other rules:

m / <ident> \:= (.*) /;rule expr / <term> [ <[+-]> <term> ]* /;

Interpolation

Variables no longer interpolate directly, thus/ $var /

matches the contents of $var literally, even if it contains rule metacharacters. (No \Q and \E) To treat $var as a rule, use

/ <$var> /

Interpolated arrays match as an alternation:/ @cmds /

/ [ @cmds[0] | @cmds[1] | @cmds[2] | ... ] /

Interpolation, cont'd

Hashes match the keys of the hash, and the value of the hash is either Executed if it is a closure Treated as a subrule if it's a string or rule

object Succeeds if value is 1 Fails for any other value

Useful for parsed languagesrule expr / <term> [ %infixop <expr> ]? /

< metasyntax >

The < ... > introduce various forms of metasyntaxA leading alphabetic character indicates a subrule or grammatical assertion<alpha><expr><before pattern><after pattern>

A leading ! negates the match<!before pattern>

< metasyntax >

Leading ' matches a literal string<'match this exactly (whitespace matters)'>

Leading " matches an interpolated string

<"match $THIS exactly (whitespace matters)">

Leading '+' or '-' are character classes/<-[a..z]> <-<alpha>>/

< metacharacters >

Leading '(' indicates code assertion/(\d**{1..3}) <( $1 < 256 )>/

# (fail if $1 is not less than 256)

A $, @, or % indicates a variable subrule, where each value (or key) is a subrule to be matched

<$myrule><@cmds>

<%commands>

A cool and somewhat scary example

%cmd{'^\d+'} = { say "You entered a number" };%cmd{'^hello'} = { say "world" };%cmd{'^print \s (.*)'} = { say $1; };%cmd{'^exit'} = { exit() };

while =$*IN { /<%cmd>/ || say "Unrecognized command";

}

Backtracking control

Single colons skip previous atomm/ $ <expr> [ , <expr> ]* : $ /

(if we don't find closing paren, no point in trying to match fewer <expr>s)

Two colons break an alternation:m:w/ [ if :: <expr> <block> | for :: <list> <block> | loop :: <loop_controls>? <block> ]

(once we've found "if", "for", or "loop", no point in trying the other branches of the alternation)

Backtracking control

Three colons (:::) fail the current ruleThe <commit> assertion fails the entire match (including any rules that called the current rule)The <cut> assertion matches successfully, removes the matched portion of the string up to the <cut>, and if backtracked over fails the match entirely Useful for throwing away successfully processed

input when matching from an input stream Like, say, when writing a compiler :-)

Backslash

\L, \U, \Q, \E, \A, \z gone from rules\n and \N match newline/not newline\s matches any Unicode spacebackreferences are gone, use $1, $2, $3 (non-interpolated)Perl 6 allows defining custom backslash sequences for use in rules

Closures

Anything in curlies is executed as a Perl 6 closure

/ (\w+) { say "Got $1"; } /

Capture semantics

Captures are different in Perl 6The result of a match is a "match object"If a match succeeds, the match object has: Boolean value true Numeric value 1 (except for global matches) String value the matched substring Array component is matched subpatterns Hash component is matched subrules

Subpattern captures

Part of a rule in parenthesis is a subpatternEach subpattern produces its own match object

/Scooby (dooby) (doo)!/ $1 $2

Quantified subpatterns produce arrays of match objects:

/Scooby (\w+ \s+)* (doo)!/ $1 $2

$1 is a (possibly empty) array of matches

Non-capturing groups

Brackets do not capture, thus they don't result in a match object

/Scooby [ (\w+ \s+)* (doo) ]!/ $1 $2

Quantified brackets replace nested subpatterns with the last component matched:

/Scooby [ (\w+ \s+)* (doo) ]+ !/ $1 $2

Nested capturing subpatterns

Each capturing subpattern introduces a new lexical scope, with nested captures inside the new match object:

/Scooby ( (\w+ \s+)* (doo) ) !/ $1[0] $1[1] <-------- $1 --------->

Alternations

Alternations introduce a new lexical scope, thus subpatterns restart counting at zero for each alternative branch (unlike p5): $1 $2

m/ Scooby (dooby)* (doo)! | Yabba (dabba)* (doo) /

$1 $2

This avoids lots of empty subpatterns when an alternation doesn't match.

Subrules

Subrules capture into a hash keyed by the name of the subrule:

rule ident / [<alpha>|_] \w* /; rule num / \d+ /;

m/ <ident> \:= <num> /;

places match objects into $<ident> and $<num>

Quantified subrules

Like subpatterns, quantified subrules produce arrays of matches

m:w / dir <file>* /

produces matches in $<file>[0], $<file>[1], etc.

Nested parens in a subrule capture to the subrule's match object

Named captures

Portions of a match can be captured directly into a match object without a subrule:

m:w/ $<name> := \w+ , <$val> := \d+ /

captures the first sequence of alphanumerics into $<name>, and digits following the comma into $<val>.

Grammars

Rules can be packaged together into separate name spaces to form Grammars

grammar Perl6 {rule ident { ... };

rule term { ... }; rule expr { ... }; }

:parsetree

The :parsetree flag to a rule causes the grammar engine to keep all information about a match. Thus, one can do something like

$parse = ($source ~~ Perl6::program);

to get the entire parsetree for a program (including comments)

Questions?

Documents

Perl 6 Update - PGE and Pugs Dr. Patrick R. Michaud April 26, 2005