Upload
dinesh-bolkensteyn
View
348
Download
1
Embed Size (px)
DESCRIPTION
What attracts researchers starting from the 60s till nowadays? What is studied in university by engineers in computer science and then successfully forgotten? What is at the heart of the compilers used daily by any software developer? Parsers! From a practical point of view using a small pill of theory, this session will bring lights on questions like: if there is so many parser-generators based on formal theory, then why javac, GCC and Clang are all hand-written? And how we, insiders of the world of parsing, do this at SonarSource for languages like Java, C/C++, C#, JavaScript, Python, COBOL?
Citation preview
@dbolkensteyn @_godin_#parsing
The Art of Parsing
Evgeny Mandrikov @_godin_Dinesh Bolkensteyn @dbolkensteynhttp://sonarsource.com
@dbolkensteyn @_godin_#parsing 2/56
The Art of Parsing// TODO: don't forget to add huge disclaimer that all opinions hereinbelow are our own and not our employer (they wish they had them)
Evgeny Mandrikov@_godin_
Dinesh Bolkensteyn@dbolkensteyn
@dbolkensteyn @_godin_#parsing 3/56
I want to create a parser
«Done»!
Use Yacc, JavaCC, ANTLR, SSLR, …
or hand-written ?
@dbolkensteyn @_godin_#parsing 4/56
What is the plan?
Why• javac and GCC are hand-written• do we use parser-generators ?
Together we will implement parser for• arithmetic expressions• common constructions from Java• C++ ;)
@dbolkensteyn @_godin_#parsing 5/56
Java formal grammar
JLS8
JLS7
@dbolkensteyn @_godin_#parsing 6/56
Answer is
42
@dbolkensteyn @_godin_#parsing 7/56
Pill of theory
NUM ➙ 42Nonterminal
Productions
Terminals(tokens)
@dbolkensteyn @_godin_#parsing 8/56
Grammar for numbers
NUM ➙ NUM DIGIT | DIGITDIGIT ➙ 0|1|2|3|4|5|6|7|8|9
4, 8, 15, 16, 23, 42,…
Alternatives
@dbolkensteyn @_godin_#parsing 9/56
Arithmetic expressions
4 – 3 – 2 = ?
@dbolkensteyn @_godin_#parsing 10/56
expr ➙ expr – expr | NUM
Arithmetic expressions
4 – 3 – 2 = ?
@dbolkensteyn @_godin_#parsing 11/56
Arithmetic expressions
expr
4 3
2
expr
expr ➙ expr – expr | NUM
(4 – 3)– 2 =-1
@dbolkensteyn @_godin_#parsing 12/56
Arithmetic expressions
4
3 2
expr
expr
expr ➙ expr – expr | NUM
(4 – 3)– 2 =-1 4 –(3 – 2)= 3
expr
4 3
2
expr
@dbolkensteyn @_godin_#parsing 13/56
Arithmetic expressionsexpr ➙ NUM – expr | NUM
expr ➙ expr – expr | NUM
(4 – 3)– 2 =-1 4 –(3 – 2)= 3
expr
4 3
2
expr 4
3 2
expr
expr
@dbolkensteyn @_godin_#parsing 14/56
Arithmetic expressionsexpr ➙ NUM – expr | NUM
expr ➙ expr – expr | NUM
expr ➙ expr – NUM | NUM
(4 – 3)– 2 =-1 4 –(3 – 2)= 3
4
3 2
expr
expr
expr
4 3
2
expr
@dbolkensteyn @_godin_#parsing 15/56
Show me the code
int expr() { int res = expr(); if (token == '–') return res – num(); return num(); }
int expr() { int res = expr(); if (token == '–') return res – num(); return num(); }
expr ➙ expr – NUM | NUM
@dbolkensteyn @_godin_#parsing 16/56
Show me the code right code
? ? int expr() { int res = expr(); if (token == '–') return res – num(); return num(); }
int expr() { int res = expr(); if (token == '–') return res – num(); return num(); }
expr ➙ expr – NUM | NUM
@dbolkensteyn @_godin_#parsing 17/56
Show me the code right code
int expr() { int res = expr(); if (token == '–') return res – num(); return num(); }
int expr() { int res = expr(); if (token == '–') return res – num(); return num(); }
expr ➙ expr – NUM | NUM
int expr() { int res = num(); while (token == '–') res = res – num(); return res; }
int expr() { int res = num(); while (token == '–') res = res – num(); return res; }
@dbolkensteyn @_godin_#parsing 18/56
Arithmetic expressions
4 – 3 * 2 = ?
@dbolkensteyn @_godin_#parsing 19/56
Arithmetic expressions
4 – 3 * 2 = -2
expr ➙ expr – NUM | expr * NUM | NUM
@dbolkensteyn @_godin_#parsing 20/56
Arithmetic expressions
4 –(3 * 2)= -2(4 – 3)* 2 = 2 expr ➙ expr – NUM | expr * NUM | NUM
@dbolkensteyn @_godin_#parsing 21/56
Arithmetic expressions
subs ➙ subs – mult | mult mult ➙ mult * NUM | NUM
4 –(3 * 2)= -2
@dbolkensteyn @_godin_#parsing 22/56
Show me the code
int subs() { res = mult() ; while (token == '–') res = res – mult(); return res; }
int mult() { int res = num(); while (token == '*') res = res * num(); return res; }
int subs() { res = mult() ; while (token == '–') res = res – mult(); return res; }
int mult() { int res = num(); while (token == '*') res = res * num(); return res; }
subs ➙ subs – mult | multmult ➙ mult * NUM | NUM
@dbolkensteyn @_godin_#parsing 23/56
LL(1)
● back to 1969● one token lookahead● no left-recursion
@dbolkensteyn @_godin_#parsing 24/56
What is the plan?
✔ arithmetic expressions✔ LL(1)
• a few common constructions from Java• C++ ;)
@dbolkensteyn @_godin_#parsing 25/56
The real deal
expr-stmt ➙ expr ; obj.method(); a = obj.field; obj.method(); a = obj.field;
@dbolkensteyn @_godin_#parsing 26/56
The real deal
expr-stmt ➙ expr ;
expr ➙ field-access
| method-call
| assignment
obj.method(); a = obj.field; obj.method(); a = obj.field;
@dbolkensteyn @_godin_#parsing 27/56
The real deal
expr-stmt ➙ expr ;
expr ➙ field-access
| method-call
| assignment
field-access ➙ qualified-id
qualified-id ➙ qualified-id . id
| id
obj.method(); a = obj.field; obj.method(); a = obj.field;
@dbolkensteyn @_godin_#parsing 28/56
The real deal
expr-stmt ➙ expr ;
expr ➙ field-access
| method-call
| assignment
field-access ➙ qualified-id
qualified-id ➙ qualified-id . id
| id
method-call ➙ qualified-id ()
obj.method(); a = obj.field; obj.method(); a = obj.field;
@dbolkensteyn @_godin_#parsing 29/56
The real deal
expr-stmt ➙ expr ;
expr ➙ field-access
| method-call
| assignment
field-access ➙ qualified-id
qualified-id ➙ qualified-id . id
| id
method-call ➙ qualified-id ()
assignment ➙ qualified-id = expr
obj.method(); a = obj.field; obj.method(); a = obj.field;
@dbolkensteyn @_godin_#parsing 30/56
int qualified_id() { /* easy */ } int field_access() { /* easy */ } int method_call() { /* easy */ } int assignment() { /* easy */ }
int expr() { // ??? }
int qualified_id() { /* easy */ } int field_access() { /* easy */ } int method_call() { /* easy */ } int assignment() { /* easy */ }
int expr() { // ??? }
Show me the code
expr-stmt ➙ expr ;
expr ➙ field-access
| method-call
| assignment
field-access ➙ qualified-id
qualified-id ➙ qualified-id . id
| id
method-call ➙ qualified-id ()
assignment ➙ qualified-id = expr
obj.method(); a = obj.field; obj.method(); a = obj.field;
@dbolkensteyn @_godin_#parsing 31/56
int expr() { String id = qualified_id(); if (token == '(') return method_call(); else if (token == '=') return assignment(); else return field_access(); }
int expr() { String id = qualified_id(); if (token == '(') return method_call(); else if (token == '=') return assignment(); else return field_access(); }
The LL(1) wayexpr ➙ field-access
| method-call
| assignment
obj.method(); a = obj.field; obj.method(); a = obj.field;
@dbolkensteyn @_godin_#parsing 32/56
Realityhttp://hg.openjdk.java.net/jdk8/jdk8/langtools/.../JavacParser.java
@dbolkensteyn @_godin_#parsing 33/56
The better wayexpr ➙ field-access
| method-call
| assignment
int expr() { try { return field_access(); } catch (RE e1) { try { return method_call(); } catch (RE e2) { return assignment(); } } }
int expr() { try { return field_access(); } catch (RE e1) { try { return method_call(); } catch (RE e2) { return assignment(); } } }
obj.method(); a = obj.field; obj.method(); a = obj.field;
@dbolkensteyn @_godin_#parsing 34/56
int expr() { try { return method_call() ; } catch (RE e1) { try { return assignment(); } catch (RE e2) { return field_access(); } } }
int expr() { try { return method_call() ; } catch (RE e1) { try { return assignment(); } catch (RE e2) { return field_access(); } } }
Show me the code right codeexpr ➙ method-call
/ assignment
/ field-access
obj.method(); a = obj.field; obj.method(); a = obj.field;
@dbolkensteyn @_godin_#parsing 35/56
Parsing Expression Grammars
● 2002● ordered choice «/»● backtracking● no left-recursion
@dbolkensteyn @_godin_#parsing 36/56
enum Nonterminals { EXPR, METHOD_CALL, … }
void grammar() { rule(EXPR).is( firstOf( METHOD_CALL, ASSIGNMENT, FIELD_ACCESS)); }
enum Nonterminals { EXPR, METHOD_CALL, … }
void grammar() { rule(EXPR).is( firstOf( METHOD_CALL, ASSIGNMENT, FIELD_ACCESS)); }
DSL for PEGexpr ➙ method-call
/ assignment
/ field-access
obj.method(); a = obj.field; obj.method(); a = obj.field;
@dbolkensteyn @_godin_#parsing 37/56
What is the plan?
✔ arithmetic expressions✔ LL(1)
✔ common constructions from Java✔ PEG
• C++ ;)
@YourTwitterHandle#DVXFR14{session hashtag} @dbolkensteyn @_godin_#parsing
Tea
Break
@dbolkensteyn @_godin_#parsing 39/56
if (false) if (true) System.out.println("foo"); else System.out.println("bar");
if (false) if (true) System.out.println("foo"); else System.out.println("bar");
Quiz
@dbolkensteyn @_godin_#parsing 40/56
if (false) if (true) System.out.println("foo"); else System.out.println("bar");
if (false) if (true) System.out.println("foo"); else System.out.println("bar");
«Dangling else»
if-stmt ➙ IF (cond) stmt ELSE stmt / IF (cond) stmt
@dbolkensteyn @_godin_#parsing 41/56
Java is awesome
(A)*B
(A)*B
@dbolkensteyn @_godin_#parsing 42/56
C++ all the pains of the world
int *B; typedef int A; (A)*B; // cast to type 'A' ('int' alias) // of dereference of expression 'B'
int A, B; (A)*B; // multiplication of 'A' and 'B' // with redundant parenthesis around 'A'
int *B; typedef int A; (A)*B; // cast to type 'A' ('int' alias) // of dereference of expression 'B'
int A, B; (A)*B; // multiplication of 'A' and 'B' // with redundant parenthesis around 'A'
Java is good, because itwas influenced by bad experience of C++ (A)*B (A)*B
@dbolkensteyn @_godin_#parsing 43/56
rule(MUL_EXPR).is( UNARY_EXPR, zeroOrMore('*', UNARY_EXPR)); rule(UNARY_EXPR).is( firstOf( sequence('(', TYPE_ID, ')', UNARY_EXPR), PRIMARY, sequence('*', UNARY_EXPR))); rule(PRIMARY).is( firstOf( sequence('(', EXPR, ')'), ID));
rule(MUL_EXPR).is( UNARY_EXPR, zeroOrMore('*', UNARY_EXPR)); rule(UNARY_EXPR).is( firstOf( sequence('(', TYPE_ID, ')', UNARY_EXPR), PRIMARY, sequence('*', UNARY_EXPR))); rule(PRIMARY).is( firstOf( sequence('(', EXPR, ')'), ID));
Hit the wall !
(A)*B (A)*B
@dbolkensteyn @_godin_#parsing 44/56
rule(MUL_EXPR).is( UNARY_EXPR, zeroOrMore('*', UNARY_EXPR)); rule(UNARY_EXPR).is( firstOf( sequence('(', TYPE_ID, ')', UNARY_EXPR), PRIMARY, sequence('*', UNARY_EXPR))); rule(PRIMARY).is( firstOf( sequence('(', EXPR, ')'), ID));
rule(MUL_EXPR).is( UNARY_EXPR, zeroOrMore('*', UNARY_EXPR)); rule(UNARY_EXPR).is( firstOf( sequence('(', TYPE_ID, ')', UNARY_EXPR), PRIMARY, sequence('*', UNARY_EXPR))); rule(PRIMARY).is( firstOf( sequence('(', EXPR, ')'), ID));
Hit the wall !
(A)*B (A)*B
@dbolkensteyn @_godin_#parsing 45/56
Dreammul-expr ➙ mul-expr * unary-expr | unary-exprunary-expr ➙ ( type-id ) unary-expr | * unary-expr | primaryprimary ➙ ( expr ) | id
(A)*B (A)*B
@dbolkensteyn @_godin_#parsing 46/56
Generalized parsers
● Earley (1968)● slow
● GLR (1984)● complex
@dbolkensteyn @_godin_#parsing 47/56
Chicken and egg problem
(A)*B
unary-expr mul-expr
(A) (A)*B
B*...
(A)*B (A)*Bmul-expr ➙ mul-expr * unary-expr
| unary-expr
unary-expr ➙ ( type-id ) unary-expr
| * unary-expr
| primary
primary ➙ ( expr )
| id
@dbolkensteyn @_godin_#parsing 48/56
Back to the future «dangling else»
if (…) if (…) then-stmt else else-stmt
if (…) if (…) then-stmt else else-stmt
outer-if
inner-if inner-if
then-stmt else-stmt
inner-if · else-stmt
@dbolkensteyn @_godin_#parsing 49/56
GLL : How does it work ?
mul-expr ➙ mul-expr * unary-expr
| unary-expr
@dbolkensteyn @_godin_#parsing 50/56
Generalized LL
● 2010● no grammar left behind (left-recursive, ambiguous)
● simpler than GLR● syntactic ambiguities
@YourTwitterHandle#DVXFR14{session hashtag} @dbolkensteyn @_godin_#parsing
Sum
mar
y
@dbolkensteyn @_godin_#parsing 52/56
Summary
LL(1)
• trivial• major grammar changes• only good for arithmetic expressions• on steroids as in JavaCC usable for real languages
@dbolkensteyn @_godin_#parsing 53/56
Summary
PEG
• trivial• fewer grammar changes• no ambiguities• usable for real languages• nice tools such as SSLR• dead-end for C/C++
@dbolkensteyn @_godin_#parsing 54/56
Summary
GLL
• any grammar• relatively simple• ambiguities• reasonable performances• the only clean choice for C/C++• only «academic» tools for now... ;)
@dbolkensteyn @_godin_#parsing 55/56
Summary
Hand-written
● based on LL(1)● precise error-reporting and recovery
● best performances● maintainance hell
@YourTwitterHandle#DVXFR14{session hashtag} @dbolkensteyn @_godin_#parsing
Q & A