Reachability and error diagnosis in LR(1)...

Reachability and error diagnosis in LR(1) parsers

François Pottier

Inria, ParisFebruary 15, 2016

40 years ago

While it is at least honest, a compiler that quits upon first detecting anerror will not be popular with its users.

Many runs may be required just to remove trivial keypunching errorsfrom a program.

– James J. Horning, “What the compiler should tell the user” (1976)

The finest tools have shortcomings

let f x == 3

$ ocamlc -c f.ml

File "f.ml", line 1, characters 8-10:Error: Syntax error

Diagnosis using yacc’s error token is hard to get right

module StringSet = Set.Make(String)let add (x : int) (xs : StringSet) =

StringSet.add (string_of_int x) xs

$ ocamlc -vThe Objective Caml compiler, version 3.10.0$ ocamlc -c s.ml

File "s.ml", line 2, characters 33-34:Syntax error: ’)’ expectedFile "s.ml", line 2, characters 18-19:This ’(’ might be unmatched

Diagnosis using yacc’s error token is hard to get right

$ ocamlc -vThe OCaml compiler, version 4.02.1$ ocamlc -c s.ml

File "s.ml", line 2, characters 33-34:Error: Syntax error: type expected.

Should this particular error be so hard to explain ?

$ echo "implementation: LET LIDENT LPAREN LIDENT COLON UIDENT RPAREN" \> | menhir --lalr --interpret-error parser.mly

#### Ends in an error in state: 194.#### mod_ext_longident -> mod_ext_longident . DOT UIDENT [ LPAREN DOT ]## mod_ext_longident -> mod_ext_longident . LPAREN mod_ext_longident

RPAREN [ LPAREN DOT ]## type_longident -> mod_ext_longident . DOT LIDENT [ ... 63 tokens ]

Only 3 continuations are possible at this point.

The last one is a good suggestion, as it allows a more interesting reduction.

$ echo "implementation: LET LIDENT LPAREN LIDENT COLON UIDENT RPAREN" \> | menhir --lalr --interpret-error parser.mly

#### Ends in an error in state: 194.#### mod_ext_longident -> mod_ext_longident . DOT UIDENT [ LPAREN DOT ]## mod_ext_longident -> mod_ext_longident . LPAREN mod_ext_longident

RPAREN [ LPAREN DOT ]## type_longident -> mod_ext_longident . DOT LIDENT [ ... 63 tokens ]

Only 3 continuations are possible at this point.

The last one is a good suggestion, as it allows a more interesting reduction.

We might hope to see this :

$ ocamlc -vThe OCaml compiler, version 4.242640687$ ocamlc -c s.ml

File "s.ml", line 2, characters 33-34:Syntax error: ill-formed type.Up to this point, an extended module path has been recognized.If this path is complete, then at this point,a dot ’.’, followed with a type constructor, is expected.

What’s the idea ?

Jeffery’s idea (2003) :

I Associate a handwritten diagnostic message...I ...with this invalid sentence (LET LIDENT LPAREN LIDENT COLON UIDENT RPAREN).I Let a tool translate the sentence to a state number (194).

This way, build a collection of state/message pairs...

...BY HAND.

What’s the idea ?

I Associate a handwritten diagnostic message...I ...with this invalid sentence (LET LIDENT LPAREN LIDENT COLON UIDENT RPAREN).I Let a tool translate the sentence to a state number (194).

This way, build a collection of state/message pairs...

...BY HAND.

Yea, right.

$ menhir --lalr -lg 1 -la 1 parser.mly

Grammar has 206 nonterminal symbols, among which 7 start symbols.Grammar has 118 terminal symbols.Grammar has 749 productions.Built an LR(1) automaton with 1551 states.

Research questions

This raises two obvious questions :

I How to come up with a collection of sentences that covers all error states ?I How to maintain this collection as the grammar evolves ?

An unanticipated question came up during this work :

I Is it possible to write an accurate diagnostic message for every error state ?

Research questions

This raises two obvious questions :

I How to come up with a collection of sentences that covers all error states ?I How to maintain this collection as the grammar evolves ?

An unanticipated question came up during this work :

I Is it possible to write an accurate diagnostic message for every error state ?

Per-state diagnostic messages ?

Menhir’s reachability algorithm and new features

CompCert’s new diagnostic messages

Conclusion

How to diagnose syntax errors ?

Choose a diagnostic message based on the LR automaton’s state,ignoring its stack entirely.

Is this a reasonable idea ?

Let’s have a look at a few example situations...

Is this a reasonable idea ? – Yes

Sometimes, yes, clearly the state alone contains enough information.

int f (int x) { do {} while (x--) }

The error is detected in a state that looks like this :

statement: DO statement WHILE LPAREN expr RPAREN . SEMICOLON [...]

It is easy enough to give an accurate message :

$ ccomp -c dowhile.c

dowhile.c:1:34: syntax error after ’)’ and before ’}’.Ill-formed statement.At this point, a semicolon ’;’ is expected.

Is this a reasonable idea ? – Yes, it seems... ?

Here is another example where things seem to work out as hoped :

int f (int x) { return x + 1 }

The error is detected in a state that looks like this :

expr -> expr . COMMA assignment_expr [ SEMICOLON COMMA ]expr? -> expr . [ SEMICOLON ]

We decide to omit the first possibility, and say a semicolon is expected.

$ ccomp -c return.c

return.c:1:29: syntax error after ’1’ and before ’}’.Up to this point, an expression has been recognized:

’x + 1’If this expression is complete,then at this point, a semicolon ’;’ is expected.

Yet, ’,’ and ’;’ are clearly not the only permitted futures ! What is going on ?

Is this a reasonable idea ? – Uh, oh...

Let us change just the incorrect token in the previous example :

int f (int x) { return x + 1 2; }

The error is now detected in a different state, which looks like this :

postfix_expr -> postfix_expr . LBRACK expr RBRACK [ ... ]postfix_expr -> postfix_expr . LPAREN arg_expr_list? RPAREN [ ... ]postfix_expr -> postfix_expr . DOT general_identifier [ ... ]postfix_expr -> postfix_expr . PTR general_identifier [ ... ]postfix_expr -> postfix_expr . INC [ ... ]postfix_expr -> postfix_expr . DEC [ ... ]unary_expr -> postfix_expr . [ SEMICOLON RPAREN and 34 more tokens ]

Based on this information, what diagnostic message can one propose ?

Is this a reasonable idea ? – No !

Based on this, the diagnostic message could say that :

I The “postfix expression” x + 1 can be continued in 6 different ways ;I Or maybe this “postfix expression” forms a complete “unary expression”...I ...and in that case, it could be followed with 36 different tokens...I among which ’;’ appears, but also ’)’, ’]’, ’}’, and others !

I there is a lot of worthless information,I yet there is still not enough information :I we cannot see that ’;’ is permitted, while ’)’ is not.

The missing information is not encoded in the state : it is buried in the stack.

Two problems

We face two problems :

I depending on which incorrect token we look ahead at,the error is detected in different states ;

I in some of these states, there is not enough informationto propose a good diagnostic message.

What can we do about this ?

We propose two solutions to these problems :

I Selective duplication.In the grammar, distinguish “expressions that can be followed with asemicolon”, “expressions that can be followed with a closing parenthesis”, etc.

(Uses Menhir’s expansion of parameterized nonterminal symbols.)

This fixes the problematic states by building more information into them.I Reduction on error.

In the automaton, perform one more reduction to get us out of the problematicstate before the error is detected.

(Uses Menhir’s new %on_error_reduce directive.)

This avoids the problematic states.

How do we know what we are doing ?

I how do we find all states where an error can be detected ?I in a canonical LR(1) automaton, this is easy...I in a non-canonical automaton and in the presence of conflicts, it is not !

I after tweaking the grammar or automaton, how do we know for surethat we have fixed or avoided the problematic states ?

We need tool support.

Conclusion

Finding all error states

How do we find all states where an error can be detected ?

I if the grammar is LR(1) and the automaton is canonical,then they are exactly the targets of terminal transitions.

I no longer true if the grammar has conflicts or the automaton is noncanonical !

For every state s and terminal symbol z, if (s, z) is an error entry, we must ask :

I is the configuration (s, z) reachable ?

We need a reachability algorithm.

Finding all error states

How do we find all states where an error can be detected ?

I if the grammar is LR(1) and the automaton is canonical,then they are exactly the targets of terminal transitions.

I no longer true if the grammar has conflicts or the automaton is noncanonical !

For every state s and terminal symbol z, if (s, z) is an error entry, we must ask :

I is the configuration (s, z) reachable ?

We need a reachability algorithm.

The algorithm’s specification : a big-step semantics of LR(1) automata

sε / ε−−� s [z]

Step-Terminal

sα /w−−−� s′ [z] A ` s′ z

−→ s′′

sαz /wz−−−−−� s′′ [z′]

Step-NonTerminal

sα /w−−−� s′ [z] A ` s′ A

−→ s′′

s′A /w′−−−−→ s′′ [z′] z = first(w ′z′)

sαA /ww′−−−−−−� s′′ [z′]

Reduce

A ` s A−→ s′′ s

α /w−−−� s′ [z]

A ` s′ reduces A → α on z

sA /w−−−→ s′′ [z]

Figure: Inductive characterization of the predicates sα /w−−−−� s′ [z] and s

A /w−−−−→ s′ [z].

The algorithm’s performance

●●

●● ●●

●●

●●●

●●

●●●

●●●●

●●●

●●CompCert

●Unicon

100 1000grammar size

●●● ●

●●

●●●

●●

●●●●●

●●●

●●

●●●

● ●

●●

●●● ●●●●●

●● ●●●●●●●

●●● ●●●●● ●

● ● ●●●

●● CompCert

●Unicon

1e+05 1e+06 1e+07# recorded facts

●●●

● ●

●●● ●

●●

●●●●● ●●●

●●

●●●

● ●● ●

●●

●●●

●●●●●●● ●●● ●●

●●●●●

● ●●

●●CompCert

●Unicon

1e+05 1e+06 1e+07# recorded facts

●● ●

●●

● ●

● ●●●

●●

●●●

●●

●●●●

● ●

●●

●● ●

●●●

●●

●●● ●●●● ●●

● ●●● ●●●●

● ●●

● ●● ●●●

●● ● ●

●●CompCert

●Unicon

1e+07 1e+08 1e+09total star size * # terminals ^ 2

Menhir’s new features

Menhir can now :

I list all states where an error can be detected,together with example sentences that cause these errors.

The grammar author :

I manually constructs a diagnostic message for each error state ;I adjusts the grammar or automaton to make this task easier.

Menhir :

I updates the list of example sentences and messages as the grammar evolves ;I checks that this list remains correct, irredundant, and complete.

A few figures

(One version of) CompCert’s ISO C99 parser :

I 145 nonterminal symbols, 93 terminal symbols, 365 productions ;I 677-state LALR(1) automaton ;I 263 error states found in 43 seconds using 1Gb of memory ;I 150 distinct hand-written diagnostic messages.

Conclusion

Show the past, show (some) futures

color->y = (sc.kd * amb->y + il.y + sc.ks * is.y * sc.y;

$ ccomp -c render.c

render.c:70:57: syntax error after ’y’ and before ’;’.Up to this point, an expression has been recognized:

’sc.kd * amb->y + il.y + sc.ks * is.y * sc.y’If this expression is complete,then at this point, a closing parenthesis ’)’ is expected.

Guidelines :

I Show the past : what has been recently understood.I Show the future : what is expected next...I ...but do not show every possible future.

Stay where we are

multvec_i[i = multvec_j[i] = 0;

$ ccomp -c subsumption.c

subsumption.c:71:34: syntax error after ’0’ and before ’;’.Ill-formed expression.Up to this point, an expression has been recognized:

’i = multvec_j[i] = 0’If this expression is complete,then at this point, a closing bracket ’]’ is expected.

Guidelines :

I Show where the problem was detected,I even if the actual error took place earlier.

Show high-level futures ; show enough futures

void f (void) { return; }}

$ gcc -c braces.c

braces.c:1: error: expected identifier or ‘(’ before ‘}’ token

$ clang -c braces.c

braces.c:1:26: error: expected external declaration

$ ccomp -c braces.c

braces.c:1:25: syntax error after ’}’ and before ’}’.At this point, one of the following is expected:

a function definition; ora declaration; ora pragma; orthe end of the file.

Show high-level futures ; show enough futures

Guidelines :

I Do not just say what tokens are allowed next :I instead, say what high-level constructs are allowed.I List all permitted futures, if that is reasonable.

Show enough futures

int f(void) { int x;) }

$ gcc -c extra.c

extra.c: In function ‘f’:extra.c:1: error: expected statement before ‘)’ token

$ clang -c extra.c

extra.c:1:7: error: expected expression

$ ccomp -c extra.c

extra.c:1:20: syntax error after ’;’ and before ’)’.At this point, one of the following is expected:

a declaration; ora statement; ora pragma; ora closing brace ’}’.

Show the goal(s)

int main (void) { static const x; }

$ ccomp -c staticconstlocal.c

staticconstlocal.c:1:31: syntax error after ’const’ and before ’x’.Ill-formed declaration.At this point, one of the following is expected:

a storage class specifier; ora type qualifier; ora type specifier.

Guidelines :

I If possible and useful, show the goal.I Here, we definitely hope to recognize a “declaration”.

Show the goal(s)

static const x;

$ ccomp -c staticconstglobal.c

staticconstglobal.c:1:13: syntax error after ’const’ and before ’x’.Ill-formed declaration or function definition.At this point, one of the following is expected:

a storage class specifier; ora type qualifier; ora type specifier.

Guidelines :

I Show multiple goals when the choice has not been made yet.I Here, we hope to recognize a “declaration” or a “function definition”.

Conclusion

Contribution

I We equip the Menhir parser generator with tools that help :I understand and fine-tune the landscape of syntax errors ;I build and maintain a complete collection of diagnostic messages.

I We apply this approach to the CompCert C99 (pre-)parser.

You can do it, too !

Reachability and error diagnosis in LR(1)...

Documents

11 Outline 6.0 Introduction 6.1 Shift-Reduce Parsers 6.2 LR Parsers 6.3 LR(1) Parsing 6.4 SLR(1)Parsing 6.5 LALR(1) 6.6 Calling Semantic

TRACING OUR ANCESTORS - Riksavisen · 2017-04-16 · are sometimes incredible. The Great Menhir near Locmariaquer, now thrown down and broken {probably by an earthquake}, was nearly

Obelix - Atari 2600 - Manual - gamesdatabase...Obelix drop a menhir. Le jeu se joue en quatre manches: la fin d'une manche, tous soldats disparaissent momentanëment de l'écran et

Performativity and Metaphor in New Materialist Media Theoryeprints.lincoln.ac.uk/25947/1/409-1-744-1-10-20160215 (1... · 2017-01-31 · Performativity and Metaphor in New Materialist

De la DOULEUR AIGUË POSTOPERATOIRE à la DOULEUR CHRONIQUE ? Département d'Anesthésie et Centre dEvaluation et de Traitement de la Douleur DOULEUR LALR

Nevzat Ilhan.ppt [Read-Only] - TEElibrary.tee.gr/digital/ker/ker_m311/ker_m311_ilhan.pdfLalapasha can be followed in Bulgaria and Greece. DOLMENS IN THRACE A Menhir. BY THE BEGINNING

InBio Pro Installation Guide-20160215zkteco.in/.../2016/07/InBio-Pro-Installation-Guide-20160215-2.pdf · INSTALLATION GUIDE InBio Pro Series ... 6 LED Indicators ... InBio Pro Series

Table of Contents - Celtic studiesceltic.cmrs.ucla.edu/csana/newsletter/csana_25.2.pdf · supernatural powers. Obelix can carry a menhir upon his back, but only because he fell into

Lark - An LALR(2) Parser Generator With Backtracking J - CoCoLab

eMAGAZINE | FEBRUARY 2016s3.amazonaws.com/gloportalemag/Zambezi/20160215/issue_17.pdf · This eMagazine is produced especially for residents of De Zalze Winelands Golf Estate to provide

merschmersch · The «Hinkelstein» of Reckingen is the first archeological menhir known in Luxembourg. Wichtelslee T he «Wichtelslee» is a refuge- stronghold built to protect the

Brave New World: MiFID2 and MiFIR – The changes facing … february/20160215... · Brave New World: MiFID2 and MiFIR – The changes facing the Financial Markets . Charlotte Stalin

DATA SHEET > CNV 508T5 > 20160215 CNV 508T5gomeasure.dk/wp-content/uploads/2017/02/CNV508T5.pdfDATA SHEET > CNV 508T5 > 20160215 CNV 508T5 ... additional protection circuit is integrated

Meubles du Menhir

GINKGO BILOBA ‘LEMONLIME SPIRE’ (MENHIR) · 2019. 4. 4. · GINKGO BILOBA ‘LEMONLIME SPIRE’ (MENHIR) FEATURES USES CARE Picture and information intended as a guide only. Visit

7 Syntactic Analysis LALR

Visualized guideline model for measuring the …selab.hongik.ac.kr/down/conference_p/20160215...2016/02/15 · 1 Visualized guideline model for measuring the Assessment Model based

Efficient Computation of LALR(1) Look-Ahead Sets

save sheet Quest Dial #1 - Awaken Realmsawakenrealms.com/wp-content/uploads/2019/10/... · save sheet Location Dial Value Menhir #1 Menhir #2 Menhir #3 Location Dial Value Quest Dial

Ontario Archaeological Society Arch Notes€¦ · Ontario Archaeological Society Arch Notes ... 17 Menhir hunting and other French pastimes, by Mima Kapches Notices 18 2006 OAS Symposium