Upload
beverly-hardy
View
214
Download
0
Embed Size (px)
Citation preview
Rules and Grammars
Perl 6 completely redesigns the regular expression syntaxRegular expressions are now "rules"Rules can call/embed other rulesGroups of rules can be combined into Grammars
Current events in Perl 6
Parrot 1.2 releasedThe Perl Foundation receives $25,000 for completion of Parrot milestonesNew Parrot pumpking - Chip SalzenburgNew version of Parrot Grammar Engine (PGE / Perl 6 rules) to be released this weekPugs - Autrijus Tang Perl 6 test suite
Pugs
Perl 6 compiler written in HaskellStarted by Autrijus TangCompiles directly to Haskell or to Parrot ASTBeing used to develop Perl 6 tests and experiment with Perl 6 designAvailable at http://pugscode.orgDiscussion on [email protected] mailing list
Perl 6 rules / Parrot Grammar Engine
The heart of the Perl 6 compiler is the Perl/Parrot Grammar Engine (PGE)Implements the Perl 6 rules syntax, compiles to Parrot codePerl 6 rules compiler currently written in CBootstrap to Perl 6
Steps to Perl 6 compiler
Finish PGE bootstrap in C Parse p6 "rule" statements and grammars
Use p6 rules to define the Perl 6 grammarP6 grammar can be used to generate Parrot abstract syntax trees from Perl 6 programsCompile, (optimize), execute the abstract syntax tree to get working Perl 6 programUse Perl 6 to rewrite the grammar engine in Perl 6 (faster)
Current state of PGE
Handles concatenation, alternation, quantifiers, captures*, subpatterns, subrulesCapture semantics redefined in Dec 2004, still not finalTo be added next Character classes (note: Unicode) Patterns containing scalars, arrays, hashes
P6 rule syntax
Changes from perl 5 No more trailing /e, /x, /s options [...] denotes non-capturing groups ^ and $ are beginning/end of string ^^ and $$ are beginning/end of line . matches any character, including newline \n and \N match newline/non-newline # marks a comment (to end of line) Quantifiers are *, +, ?, and **{m..n}
Character classes
[aeiou] changed to <[aeiou]>[^0-9] now <-[0..9]>Properties defined as <alpha> <digit> <alnum>
Combine classes using +/- syntax: <+<alpha>-[aeiou]>
Subrules
Patterns are now called "rules"Analogous to subroutines and closuresLike {...}, /.../ compiles into a "rule" subroutineP6 rule statement allows named rules:
rule ident / [<alpha>|_] \w* /;
Named rules can be easily used in other rules:
m / <ident> \:= (.*) /;rule expr / <term> [ <[+-]> <term> ]* /;
Interpolation
Variables no longer interpolate directly, thus/ $var /
matches the contents of $var literally, even if it contains rule metacharacters. (No \Q and \E) To treat $var as a rule, use
/ <$var> /
Interpolated arrays match as an alternation:/ @cmds /
/ [ @cmds[0] | @cmds[1] | @cmds[2] | ... ] /
Interpolation, cont'd
Hashes match the keys of the hash, and the value of the hash is either Executed if it is a closure Treated as a subrule if it's a string or rule
object Succeeds if value is 1 Fails for any other value
Useful for parsed languagesrule expr / <term> [ %infixop <expr> ]? /
< metasyntax >
The < ... > introduce various forms of metasyntaxA leading alphabetic character indicates a subrule or grammatical assertion<alpha><expr><before pattern><after pattern>
A leading ! negates the match<!before pattern>
< metasyntax >
Leading ' matches a literal string<'match this exactly (whitespace matters)'>
Leading " matches an interpolated string
<"match $THIS exactly (whitespace matters)">
Leading '+' or '-' are character classes/<-[a..z]> <-<alpha>>/
< metacharacters >
Leading '(' indicates code assertion/(\d**{1..3}) <( $1 < 256 )>/
# (fail if $1 is not less than 256)
A $, @, or % indicates a variable subrule, where each value (or key) is a subrule to be matched
<$myrule><@cmds>
<%commands>
A cool and somewhat scary example
%cmd{'^\d+'} = { say "You entered a number" };%cmd{'^hello'} = { say "world" };%cmd{'^print \s (.*)'} = { say $1; };%cmd{'^exit'} = { exit() };
while =$*IN { /<%cmd>/ || say "Unrecognized command";
}
Backtracking control
Single colons skip previous atomm/ \( <expr> [ , <expr> ]* : \) /
(if we don't find closing paren, no point in trying to match fewer <expr>s)
Two colons break an alternation:m:w/ [ if :: <expr> <block> | for :: <list> <block> | loop :: <loop_controls>? <block> ]
(once we've found "if", "for", or "loop", no point in trying the other branches of the alternation)
Backtracking control
Three colons (:::) fail the current ruleThe <commit> assertion fails the entire match (including any rules that called the current rule)The <cut> assertion matches successfully, removes the matched portion of the string up to the <cut>, and if backtracked over fails the match entirely Useful for throwing away successfully processed
input when matching from an input stream Like, say, when writing a compiler :-)
Backslash
\L, \U, \Q, \E, \A, \z gone from rules\n and \N match newline/not newline\s matches any Unicode spacebackreferences are gone, use $1, $2, $3 (non-interpolated)Perl 6 allows defining custom backslash sequences for use in rules
Capture semantics
Captures are different in Perl 6The result of a match is a "match object"If a match succeeds, the match object has: Boolean value true Numeric value 1 (except for global matches) String value the matched substring Array component is matched subpatterns Hash component is matched subrules
Subpattern captures
Part of a rule in parenthesis is a subpatternEach subpattern produces its own match object
/Scooby (dooby) (doo)!/ $1 $2
Quantified subpatterns produce arrays of match objects:
/Scooby (\w+ \s+)* (doo)!/ $1 $2
$1 is a (possibly empty) array of matches
Non-capturing groups
Brackets do not capture, thus they don't result in a match object
/Scooby [ (\w+ \s+)* (doo) ]!/ $1 $2
Quantified brackets replace nested subpatterns with the last component matched:
/Scooby [ (\w+ \s+)* (doo) ]+ !/ $1 $2
Nested capturing subpatterns
Each capturing subpattern introduces a new lexical scope, with nested captures inside the new match object:
/Scooby ( (\w+ \s+)* (doo) ) !/ $1[0] $1[1] <-------- $1 --------->
Alternations
Alternations introduce a new lexical scope, thus subpatterns restart counting at zero for each alternative branch (unlike p5): $1 $2
m/ Scooby (dooby)* (doo)! | Yabba (dabba)* (doo) /
$1 $2
This avoids lots of empty subpatterns when an alternation doesn't match.
Subrules
Subrules capture into a hash keyed by the name of the subrule:
rule ident / [<alpha>|_] \w* /; rule num / \d+ /;
m/ <ident> \:= <num> /;
places match objects into $<ident> and $<num>
Quantified subrules
Like subpatterns, quantified subrules produce arrays of matches
m:w / dir <file>* /
produces matches in $<file>[0], $<file>[1], etc.
Nested parens in a subrule capture to the subrule's match object
Named captures
Portions of a match can be captured directly into a match object without a subrule:
m:w/ $<name> := \w+ , <$val> := \d+ /
captures the first sequence of alphanumerics into $<name>, and digits following the comma into $<val>.
Grammars
Rules can be packaged together into separate name spaces to form Grammars
grammar Perl6 {rule ident { ... };
rule term { ... }; rule expr { ... }; }
:parsetree
The :parsetree flag to a rule causes the grammar engine to keep all information about a match. Thus, one can do something like
$parse = ($source ~~ Perl6::program);
to get the entire parsetree for a program (including comments)