23
A PEG -based pattern matching library EX tended by back reference with regex- like notation in Scala Kota Mizushima Graduate School of Systems and Information Engineering, University of Tsukuba

A PEG-based pattern matching library EXtended by back reference with regex-like notation in Scala Kota Mizushima Graduate School of Systems and Information

Embed Size (px)

Citation preview

A PEG-based pattern matching library EXtended by back reference with regex-like notation in Scala

Kota MizushimaGraduate School of Systems and

Information Engineering, University of Tsukuba

About myself

Name: Kota Mizushima Country: Japan Ph.D student in University of Tsukuba

Research: Parsing Algorithm (especially Packrat Parsing)

Packrat Parsing will be supported in Scala 2.8 (scala.util.parsing.combinator.PackratParsers)

I'm interested in programming languages Currently developing my programming language Onion

Object Oriented and Statically Typed

Wish

Please speak slowly in the questions of my presentation Because I'm not good at English, I can't hear the question correctly if you speak fast

Scala in Japan

The number of peaple which interest in Scala is increasing rapidly

The reason is: Widely known web services started to use Scala

Twitter Foursquare

Some advanced Java programmers started to use Scala

e.g. @ymnk (a committer of Lift), @yuroyoro, @keisuke_n It is expected that functional languages can solve multi-core problems

Scala Books in Japan

Currently, four book is published in Japan はじめての Scala (it means Scala for beginners) やさしい Scala入門 (it means easy introduction to Scala. Unfortunately, it is not a good book)

Scala スケーラブルプログラミング (Japanese translation of "Programming in Scala" book)

Scalaプログラミング入門 (Japanese translation of "Beginning Scala" book)

Why PEGEX ? - Scala's (PEG) Parser Combinator VS. Regex -

Parser Combinator (scala.util.parsing.combinator) Pros. powerful & extensible Cons. more verbose than Regex

e.g. "abc" instead of abc, e1 ~ e2 instead of e1e2

Regex (scala.util.matching) Pros. brevity Cons. not powerful & not extensible

cannot handle recursive structures such as (), (()), ....

What is PEGEX?

Wanted to something which has both power of PEG parser combinator and conciseness of Regex

Wanted to back reference in Regex useful for handling the case such as correspondence of XML tags

PEGEX is abbreviation of the following: PEG-based pattern matching library EXtended by backreference with regex-like notation

named by @kinaba http://twitter.com/kinaba/status/8714614395 (in Japanese)

Syntax of PEGEX (1)

(Name = e ;)+ Repetition of rules Name is name of the nonterminal

e (expression) is consisted of the followings: a: character (e.g. x) [...]: character class (e.g. [a-z]) $: end of input .: any character e*: zero or more repetition (e.g. a*) e+: one or more repetition (e.g. a+) e?: zero or one (e.g. a?)

Syntax of PEGEX (2)

e1e2: sequence (e.g. ab) e1|e2: ordered choice (e.g. a|b) &e: and-predicate (e.g. &(a|b) b) !e: not-predicate (e.g. !a .)

predicates don't consume input #(Name): reference to a nonterminal Name #(Tag:e): assign name "Tag" to the parsing result of e

##(Name): backreference (#(Tag)) <#(Tag:Ident)></##(Tag)>

Syntax of PEGEX in PEGEX

PEGEX=#(S)(#(Name)#(Eq)#(Expression)\;)+; Name=[a-zA-Z_][a-zA-Z_0-9]+#(S); Eq==#(S); Expression=#(Sequence)(#(BAR)#(Sequence))*;

Sequence=#(Prefix)+; Prefix=#(Primary)(#(QUESTION)#(Primary)|...)*;

...

Basic Usage: Identifier val ident: Pegex = """ L=#(IdentStart)#(IdentRest)*$; IdentStart=[a-zA-Z_]; IdentRest=#(IdentStart)|[0-9]; """.e // Represents identifier. invocation of

method e makes an instance of Pegex. println(ident.matches("HogeFooBar")) //

Some(HogeFooBar) println(ident.matches("Hoge_Foo_Bar"))

//Some(Hoge_Foo_Bar) println(ident.matches("Hoge10")) //Some(Hoge10) println(ident.matches("10Hoge")) //None

Example:Nested Comment

Parser Combinator lazy val C: Parser[Any] = "/*" ~ (C | not("*/") ~ ".".r).* ~ "*/";

PEGEX C=/\*(#(C)|!(\*/).)*\*/;

Regex impossible (usually)

PEGEX version is more terse than Parser Combinator version

Example:XML-like Language

Parser Combinator lazy val E: Parser[String] = ("<" ~> "[a-z]+".r <~ ">") ~ E.* ~ ("</" ~> "[a-z]+".r <~ ">") ^? { case t1 ~ _ ~ t2 if t1 == t2 => t1 }

PEGEX E=<#(tag:I)>#(E)*</##(tag)>; I=[a-z]+;

Regex impossible (usually)

Implementation

Written in Scala (About 1300 lines) PEGEX parser, which creates PEGEX AST AST to Parsing VM Instructions compiler Parsing VM

Includes several implementations for experiments AST interpreter (greedy) AST interpreter (possessive) Parsing VM (greedy)

PEGEX parser

About 300 lines PEGEX parser: 150 lines. PEG parser (notation is like normal PEG): 150 lines.

Written using scala.util.parsing.combinator Pros. On the fly error checking (with IDE plugin)

Cons. Error-reporting is poor Confusion by operator precedence

e.g. ~> and <~ have different precedence

AST to Parsing VM Instructions compiler

Simple and straightforward About only 70 lines

Pattern matching and first-class function are excellent features No longer need not write boilerplate code for Visitor pattern

foldLeft and map simplify code

Parsing VM

About 220 lines based on Medeiros' A parsing machine for PEGs

consisted of: array of instructions input string pc (index of array of instructions) cursor (index of input string) stack of pc and cursor ...

Implementation Issue: Packrat or not

Parsing technique presented by Ford 2002 Pros.

Guarantees linear time parsing by memoization Cons.

Memory consumption is large O(n), where n is the size of input

Execution overhead Currently memoization code is removed from PEGEX Execution overhead was large in my interpreter

Implementation Issue:Possessive or Greedy

PEG's operators (e*, e+, e1|e2) behave like possessive (not greedy) operators in Regex e.g. ("a".* ~ "a") in PEG parser combinator doesn't match to any input

because "a".* consumes all of input Users of Regex may confuse

PEGEX support "greedy" flag for PEGEX operators to behave like greedy operators in Regex new Pegex("a*a", likeRegex=true) However, this flag makes parsers slow

Current Status

Currently, PEGEX still has ragged edges API will be changed frequently Documentation is not enough

Source-code is available on GitHub http://github.com/kmizu/pegex

Welcome feedback!

Future Prospects

Better API for actual use Support handling semantic values by actions.

val pegex = "L=[1-9][0-9]*|0;".e(v => v.toInt) println(pegex.parse("100").asInstanceOf[Int])

Support many flags in Regex such as (?i) Represents case insensitivity

Speeding up Compiles PEGEX to Java bytecode

Conclusions

Introduction to PEGEX Syntax Example Implementation Overview

Future prospects of PEGEX Support handling semantic values Better API Support many flags in Regex Speeding up

Thanks for listening