cst-to-ast-sablecc

7/31/2019 cst-to-ast-sablecc

1/16

CST to AST Tutorial

Fidel Viegas

August 18, 2004

1 Introduction

Prior to SableCC 3.0, the user did not have any control over the construction ofAbstract Syntax Trees. This led to huge trees with loads of unnecessary nodes.

To overcome this problem, in SableCC 3.0, a new section was introduced:the Abstract Syntax Tree section. In this section, the user designates the typesof nodes that may be part of a tree.

This tree is constructed by the parser as before, but now the user specifieshow this tree is constructed in the Productions section of the grammar specifica-tion; this comes to an advantage, because experienced users can reduce the sizeof the tree generated by the parser, hence reducing the memory consumption,which leads to speed improvement.

In this document I will introduce users to the new features found in SableCC3.0.

To be more precise, this tutorial will be an extension to a tutorial I published

on my web site [2]. So, for users unfamiliar with SableCC, please do read thattutorial before going through this one.

With the inclusion of the Abstract Syntax Tree section, this is how the filespecification will look1:

Package declaration

Tokens declaration

Ignored Tokens declaration

Productions

Abstract Syntax Tree

In the sections that follow, I will take you through the new syntax for Pro-ductions and the Abstract Syntax Tree.

1All of these sections are optional

1


2/16

2 What to do before creating the AST section?

Before you go about creating an Abstract Syntax Tree section in your file spec-ification, you should first create the grammar as you would in SableCC prior toversion 3.0. This is so that you have a working parser. Once you have the work-ing parser, then you start thinking about your AST: you look at your grammar,and you take the most important elements, and add them in the AST section.

Obvious things that you may remove are keywords and operators (e.g math-ematical and logical operators). In this tutorial, we are going to use the Small-Pascal grammar as an example.

Before describing the new syntax for the Productions section, we are goingto describe below the Abstract Syntax Tree and what SableCC generates out ofit.

3 Abstract Syntax Tree section

In SableCC prior to version 3.0 we used to implement the Productions section asour grammar to generate the parser. And out of that grammar, SableCC wouldgenerate tree node types. Recapitulating, lets have a look at the followingProductions section from the grammar found in Appendix B:

Productionsexp =

{plus} exp plus term |{minus} exp minus term;

term ={mult} term mult factor |{div} term div factor |

{factor} factor;

factor ={number} number |{exp} l_par exp r_par;

From this Productions section, SableCC generates the parser for the lan-guage, and also generates the types. The way it generates the types is asfollows:

For each production, it generates a type constituted by P concatenated tothe name of the production with its first letter capitalised. (E.g.: for expit generates PExp, for term, PTerm, etc...)

For each alternative, it generates a type constituted by the name of thealternative prefixed by A and concatenated to the name of the produc-tion. (E.g.: for plus, it generates a type APlusExp, for mult, it generatesAMultTerm, etc...)

Now, if we replace the name Productions by Abstract Syntax Tree, we haveour Abstract Syntax Tree section.

2


3/16

That is all you have to know to create an Abstract Syntax Tree section.

The only difference being that this section only generates the types, it doesntgenerate the parser. The parser is generated by the grammar in the Productionssection.

Also, bare in mind that if you include a Abstract Syntax Tree section in yourfile specification, you have to do a CST-AST mapping. This is done in theProductions section, which we are going to describe below.

If you include a Abstract Syntax Tree section and dont do a mapping in theProductions section, SableCC will report an error. Well, if your Productions isexactly as your Abstract Syntax Tree then, it wont report an error, as SableCCdoes the mapping automatically. But, this would be the same as ommiting theAbstract Syntax Tree section, and wouldnt make much sense, because the wholepurpose of this new section is to reduce the size of the trees constructed by theparser removing the the unnecessary nodes.

In the following section, Ill introduce you to new Productions section inSableCC 3.0.

4 The New Productions Section

The new Productions section, has a new syntax2, which looks like this:

production {-> prod_transform1 prod_transform2 ... prod_transformN} =element1 element2 ... elementN {-> alt_transform1 alt_transform2 ... alt_transformN}

In this new productions, we can specify two types of transformations: thetransformation of the production itself and the transformation of the alternativesfor this production. As you can see above, you have a set of prod transformsinside {-3 and } before the = sign, and another set before the semicolon.The set before the = sign is the production transformation, and the set beforethe semicolon is the alternatives transformation. The {- token means, thistransforms to4. So, in the case above, that means that production transforms(or returns) to prod transform1 prod transform2up to prod transformN. That is,it means that the parser will generate a node of type prod transform1 followeda node of type prod transform2 up to a node of type prod transformN.

There are a couple of rules we must follow in order for the production tosucceed:

1. The number of prod transforms inside the production transformation has

to be the same number of alt transforms in the alternatives transforma-tion. That is, the number of elements in {- prod transform1 prod transform2

2SableCC 3.0 still accepts the old SableCC 2.0 syntax, which means that you can still

generate code from your old grammars3This is treated as a single token composed of the { character, the - character and the

character. No spaces are permitted in between.4In programming languages parlance, we could say that it is the return variable of a function

or method.

3


4/16

. . . prod transformN} has to be the same as the number of elements in {-

alt transform1 alt transform2 . . . alt transformN}2. The type of each element should match on both sides. That is, prod transform1

should be the same type as alt transform1, prod transform2 should be thesame type as alt transform2, etc. . . . This means that prod transform1and alt transform1 are the same type as defined in the Abstract SyntaxTree section, that they both are a list of homogenious tokens or ASTnodes, or they match the same token.

The best way to understand how this works, is to pretend that the productionis a function (or method) that returns multiple values. So, lets say that produc-tion is our function and prod transform1 prod transform2 . . . prod transformNare the types of values it should return. So rewriting this as a function, we

would have something like this:function production returns prod_transform1, prod_transform2, ... prod_transformN

Now, as in a function, we do some processing and then return some values;prod transform1 . . . prod transformN are the variables that are going to holdthose values. In a function to return the values, we would use a returnstatement,which would look like this:

return alt_transform1, alt_transform2, ... alt_transformN

As you can see, when we return our values, there must be a match of typesbetween prod transform1 and al transform1, and so on. The same applies whenwe work with SableCC production transformations. We now know that the out-

put variables correspond to prod transform1 . . . prod transformN, and that ourreturn values correspond to alt transform1 . . . alt transformN, and that beforewe return the values in a variable, we need to do some processing, right? Thesame is true in SableCC.

One thing I did not mention when I described the syntax were the elements(element1 element2 . . . elementN) before the alternatives transformations. Youwere probably wondering what they were used for. Well, these elements are therules that govern that production. That is, they are what goes after the = signin SableCC prior to version 3.0. And they are what is going to constitute analternative transformation. That is, they are going to be the alt transformN in-side the alternative transformation. Now there are five ways of constructing thealternative transformation: identity, new node, homogeneous list, eliminationand empty. If you do not understand the terminology as you go through eachof them, dont worry. It will become clear when we go through the SmallPascalgrammar in detail.

Below are the descriptions of each one with a practical example.

4


5/16

4.1 Identity

This, as its name implies, returs the element itself. That is, it can be either atoken or the return value of a production5. Here is an example:

Lets look at the program heading production of the SmallPascal grammarin Appendix A:

program_heading {-> identifier} =T.program identifier semicolon {-> identifier} ;

As you can see, the production transformation returns an identifier, so onthe right hand side, we must also return and identifier. That identifier is theone that follows T.program. This is what we call identity. As you can see, youjust place the token itself.

Now lets look at an example where we use the return value (remember we

are still thinking in terms of functions):expression =

{term} term {-> term.expression} |{plus} expression plus term {-> New expression.plus(expression, term.expression)}|{minus} expression minus term {-> New expression.minus(expression, term.expression)};

term {-> expression} ={factor} factor {-> factor.expression} |{mult} term mult factor {-> New expression.mult(term.expression, factor.expression)} |{div} term div factor {-> New expression.div(term.expression, factor.

If you look at production expression, it is omitting the production trans-formation. That is, there is nothing preceding the = sign. This is because wehave a production of the same name in the Abstract Syntax Tree section. Whenthis happens, we dont have to provide the production transformation, because

SableCC includes it by default. That is, it transforms to a tree node of the samename.

Now, if you look at production term, you can see that it returns an expres-sion, and if you look at alternative term of production expression, we have {-term.expression}: term.expression is the return value of production term. Ifwe were thinking in terms of functions, we would get the return value of term,which is this case is expression. Basically, you only use this numenclature whenthere is a return value. As we stated before, production expression does notreturn a value explicitly, so when we place it inside the alternative transforma-tion, we justp place its name without the .. That is, as its own identity. Onthe subsection below I will explain the usage of expression.

4.2 New NodeThis is used when we want to create a new node. If we look at the productionsshown in the above, we can see the usage of this in alternatives plus, minus,mult and div

The syntax for this kind of transformation is:

5Rember the function concept above? More on this later.

5


6/16

New production[.alternative](arg1, arg2, ..., argN)

Where [.alternative] is an optional part, and arg1 ... argN are what goesin the alternative. production is an Abstract Syntax Tree production andalternative is an alternative of that same production. To better understandit, lets include a production from the Abstract Syntax Tree section:

expression ={plus} [left]:expression [right]:expression |{minus} [left]:expression [right]:expression |{mult} [left]:expression [right]:expression |{div} [left]:expression [right]:expression |{identifier} identifier |{number} number |{expression} expression ;

Here we have a production expression, with alternatives plus, minus, mult,

div, identifier, numberand expression. The arguments of the transformation arewhat goes after the alternative name. (E.g. [left]:expression [right]:expressionin the plus alternative)

So, if we want to create a node of type AExpressionPlus, we would do it asfollows:

New expression.plus(expression1, expression1)

Where expression1 and expression2 are nodes of type expression.Looking at the grammar in Appendix A, this is an example:

{plus} expression plus term {-> New expression.plus(expression, term.expression)}

As you can see, the arguments to this constructor is expression, and term.expression.

Both of type expression.We construct the new node from the elements of the alternative. That is,

expression and term. If we look at production term, we can see that it returnsa node of type expression. This is what we use to pass in to the constructor forplus, because that is what it takes in as argument.

As you can see for expression, we used its identity. This is because, expressiondoes not have a production transformation (or returns anything. remember theconcept of functions? As you can see, expression does not return anything).If, on the other hand our expression production looked like this:

expression {-> expression} ={term} term {-> term.expression} |{plus} expression plus term {-> New expression.plus(expression, term.expression)}|{minus} expression minus term {-> New expression.minus(expression, term.expression)};

Then, we would have to change our alternative transformation as follows:

{plus} expression plus term {-> New expression.plus(expression.expression, term.expression)}

Now, we are also using the node being returned by expression, which is:

{-> expression}

6


7/16

4.3 Homogeneous List

There are times, when a production is actually return a list of homogeneousnodes, or we are constructing a node that takes in a list of homogeneous nodes.When we have something like this, we append a * character to the elementname in the production transformation, and in the alternative transformation,we create a list with the list constructor [elm1, elm2, ..., elmN].

Here is an example:

exp_list {-> exp*} = exp exp_list_tail* {-> [exp, exp_list_tail.exp]};

exp_list_tail {-> exp} = comma exp {-> exp};

In this example, we have production exp list, which returns a list of exp(remember this exp is from the Abstract Syntax Tree section. As you can see,we appended a * to exp in {- exp*}). Now, when you construct a list on

the right hand side, or in the alternative transformation, you can see that weused the list operator: [exp, exp list tail.exp]. And in that list we includeboth exp and exp list tail.exp. Note that we are including the node returnedby exp list tail, which is exp. Remember, this is a homogeneous list. In theproduction transformation {- exp*}, we are saying that exp list returns a listof exp. So, on the right hand side, you have to return a list of exp.

If you want to return an empty list, just return []. This is an empty list.

4.4 Elimination Node

There isnt much to talk about this one, because I havent come across a gram-mar that uses it. Basically, you return null in place of a node, which is optional.

Here is an example:

{plus} [exp1]:expression plus [exp2]:expression? {-> New expression.plus(expression, Null)}

As you can see, we replaced Null in place of expression2. You can only usethe Null operator, when your nonterminal or terminal is optional. That is, hasgot a ? appended to it.

4.5 Empty

Finally, there are times when we dont want to production to produce anynode. In cases like that, we just provide a production transformation, and analternative transformation without any elements as follows:

type {-> } ={boolean} boolean {-> } |

{integer} integer {-> };

When we do this, the parser will ignore this, and wont include it in the tree.This is quite different from the Null operator. The null operator, just gen-

erates null, whereas this one, wont generate anything on the tree.Now, we are going to have a look at the grammar in Appendix B, and explain

how we created the Abstract Syntax Tree as shown in the grammar shown inAppendix C.

7


8/16

5 CST-AST mapping

The first thing to do is create our SableCC file specification from our EBNFgrammar. So, for the EBNF grammar:

exp =exp "+" term |exp "-" term

term =term "*" factor |term "/" factor |factor

factor =number |"(" exp ")"

number = digit+

digit = "0" .. "9"

We create a SableCC specification file as found in Appendix B.Then, from the grammar found in Appendix B, we create something like

this:

Package expression;

Helpersdigit = [0 .. 9];tab = 9;cr = 13;lf = 10;eol = cr lf | cr | lf; // this takes care of the different platformsblank = ( | tab | eol)+;

Tokensl_par = (;r_par = );plus = +;minus = -;mult = *;div = /;comma = ,;blank = blank;number = digit+;

Ignored Tokensblank;

Productionsexp =


term ={mult} term mult factor |{div} term div factor |{factor} factor;



8


9/16

exp ={plus} exp plus term |

{minus} exp minus term;



Now that, we have our Abstract Syntax Tree, we look at elements whicharent needed in our tree. By looking at the grammar, we can see that we dontneed any of the operators, because to know if it is a plus or minus, we just needthe type. That is, APlusExp, AMinusExp, etc... These are sufficient to allow usto process the tree later on. We also dont need l par nor r par. So, removing

all the terminals, will give us a Abstract Syntax Tree as follows:

Abstract Syntax Treeexp =

{plus} exp term |{minus} exp term;

term ={mult} term factor |{div} term factor |{factor} factor;

factor ={number} number |{exp} exp ;

Now, if you look at our AST grammar, you can see that alternatives plus

and minus look exactly the same, and mult and div also look the same. If,this was in the Productions section, it wouldnt be allowed, because you wouldhave a reduce/reduce conflict. The same does not happen in the AST section,because we are creating the types, not the parser. So, this is valid.

We could leave our AST grammar as it is, but usually, we create it as a singleproduction. So, we get rid of proudctions term and factor and shift everythingto production exp, and replace any occurrences of term and factor by exp.Here is our final AST grammar:


{plus} [exp1]:exp [exp2]:exp|{minus} [exp1]:exp [exp2]:exp |{mult} [exp1]:exp [exp2]:exp |

{div} [exp1]:exp [exp2]:exp |{number} number |{exp} exp ;

Remember in prior versions of SableCC, when you had more than one oc-currences of a nonterminal or a terminal, you had to preprend it with [name]:?The same holds true for the AST section. When you have two nonterminals, ortwo terminals of the same name, you have to distinguish them by giving them a

9


10/16

name. And you do this by prepending it with [name]: as shown in the grammar

above.Now, that we have our AST grammar ready, it is time to map it to ourProductions grammar.

Lets look at our Productions again:

Productionsexp =



factor ={number} number |

{exp} l_par exp r_par;

By looking at the grammar above, we can see some resemblances with ournew AST section. Lets then start from the top of our grammar. We want tomap exp to a production in our AST section. The firt thing to do is to create theproduction transformations for each production. Since we only have one type inour AST, all our productions return an exp in their production transformation.The grammar will now look like this:

Productionsexp {-> exp} =


term {-> exp} ={mult} term mult factor |

{div} term div factor |{factor} factor;

factor {-> exp} ={number} number |{exp} l_par exp r_par;

Next, we create the alternative transformations: So, if we begin in our pro-duction factor, we know that we have to return a node of type exp. So, lookingat alternative exp of this production, we can see that we need to return thereturn value of exp. So, we just return its identity. For alternative number, onther other hand, we need to construct a new node. So, we need the New oper-ator as described in the beginning of this document. Here is how the grammarlooks like, after we mapped production factor

Productionsexp {-> exp} =


term {-> exp} ={mult} term mult factor |{div} term div factor |{factor} factor;

10


11/16

factor {-> exp} ={number} number {-> New exp.number(number)} |

{exp} l_par exp r_par {-> exp.exp} ;

If you look at alternative number, you can see that we used the New operator.Whereas in alternative exp we used the identity.

Remember when I said that, if you have a production in Productions withthe same name as a production in Abstract Syntax Tree, you could ommit theproduction transformation? Well, we do have exp. So, in this case, we can getrid of it. Our grammar will now look like this:

Productionsexp =


term {-> exp} =

{mult} term mult factor |{div} term div factor |{factor} factor;

factor {-> exp} ={number} number {-> New exp.number(number)} |{exp} l_par exp r_par {-> exp} ;

As you can see, in alternative exp of production factor, we return exp only.This is because, we ommited it from the production itself. That is, exp doesnot return a node any more. Well, it does, but it is implicitly, and you dontuse the ..

Now, following the same principle, we do the other transformations, and ourfinal grammar will look like this:

Productionsexp ={plus} exp plus term {-> New exp.plus(exp, term.exp)} |{minus} exp minus term {-> New exp.minus(exp, term.exp)};

term {-> exp} ={mult} term mult factor {-> New exp.mult(term.exp, factor.exp)} |{div} term div factor {-> New exp.div(term.exp, factor.exp)} |{factor} factor {-> factor.exp} ;


Looking at alternative factor of production term, we can see that we usedthe identify. In fact, we are using the return node from production factor:factor.exp, which is the same exp in factor {- exp}. The same applies to all

the others.If you want a more complex grammar, have a look at the one in Appendix

A. This is just a prelimiary version of the tutorial. I am going to refine it sothat it makes sense to users, and it is easy to read.

If you have any queries regarding this tutorial, please drop me an email.All the bestFidel.

11


12/16

Appendix A

Package org.sablecc.pascal; // package name

Helpers/**

* Pascal is a case-insensitive language. So, well use helpers* to simplify our regular expressions. E.g. Instead of writing,* for instance, end = (e | E) (n | N) (d | D), we* may write: end = e n d, which takes less space is makes the* regular expression more readable.*/

a = a | A ; // this could also be written as [a + A]b = b | B ; // but I prefer the old lex styled = d | D ;e = e | E ;g = g | g ;i = i | I ;l = l | L ;m = m | M ;

n = n | N ;o = o | O ;p = p | P ;r = r | R ;t = t | T ;v = v | V ;w = w | W ;

l_curly_bracket = { ;r_curly_bracket = } ;ascii_char = [32 .. 127] ;

// letters and digitsletter = [[a .. z] + [A .. Z]];digit = [0 .. 9] ;

// un-printable characterstab = 9 ;cr = 13 ;

lf = 10 ;blank = ;

Tokens// reserved wordse n d = e n d ;div = d i v ; // integer divisionv a r = v a r ;b e g i n = b e g i n ;program = p r o g r a m ;writeln = w r i t e l n ;// I prefer to let the parser do the job// of tracking the standard type rather// then processing it in the semantic phaseinteger = i n t e g e r ;

// arithmetic symbolsplus = + ;minus = - ;

mult = * ;assignop = := ;

// symbols separatorscomma = , ;colon = : ;semicolon = ; ;dot = . ;l_paren = ( ;r_paren = ) ;

12


13/16

// identifiers

identifier = letter (letter | digit)* ;

// numbersnumber = digit+ ; // integer numbers only

// commentscomment = l_curly_bracket [ascii_char - [l_curly_bracket + r_curly_bracket]]*

r_curly_bracket ;

// blanksblanks = blank | cr lf | cr | lf | tab ;

Ignored Tokenscomment,blanks ;

Productions

program =program_heading

declarationsbodydot {-> New program(program_heading.identifier, [declarations.identifier], [body.statement])};

program_heading {-> identifier} =// program must be prefixed with T. because there is a token and a production with// the same nameT.program identifier semicolon {-> identifier};

// declarations

declarations {-> identifier*}=variables_declaration? {-> [variables_declaration.identifier]};

variables_declaration {-> identifier*}=var variables_definition_list {-> [variables_definition_list.identifier]};

variables_definition_list {-> identifier*} ={single} variables_definition {-> [variables_definition.identifier]} |{multiple} variables_definition_list variables_definition {-> [variables_definition_list.identifier, variables_definit

variables_definition {-> identifier*} =identifier_list colon type semicolon {-> [identifier_list.identifier]};

identifier_list {-> identifier*} ={single} identifier {-> [identifier]} |{multiple} identifier_list comma identifier {-> [identifier_list.identifier, identifier]};

type =integer ; // only data type allowed is the integer data type

// body definitionbody {-> statement*} =

beginstatement_sequence

end {-> [statement_sequence.statement]};

// statementsstatement_sequence {-> statement*} =

{single} statement {-> [statement]} |{multiple} statement_sequence semicolon statement {-> [statement_sequence.statement, statement]};

statement ={writeln} writeln l_paren expression r_paren {-> New statement.writeln(expression)}|{assignment} identifier assignop expression {-> New statement.assignment(identifier, expression)}|{empty} ;

13


14/16

// expressions

expression ={term} term {-> term.expression} |{plus} expression plus term {-> New expression.plus(expression, term.expression)}|{minus} expression minus term {-> New expression.minus(expression, term.expression)};

term {-> expression} ={factor} factor {-> factor.expression} |{mult} term mult factor {-> New expression.mult(term.expression, factor.expression)} |{div} term div factor {-> New expression.div(term.expression, factor.expression)};

factor {-> expression} ={identifier} identifier {-> New expression.identifier(identifier)}|{number} number {-> New expression.number(numer)} |{expression} l_paren expression r_paren {-> New expression.expression(expression)} ;


program =identifieridentifier*statement* ;

statement ={writeln} expression |{assignment} identifier expression |{empty} ;

expression ={plus} [left]:expression [right]:expression |{minus} [left]:expression [right]:expression |{mult} [left]:expression [right]:expression |{div} [left]:expression [right]:expression |{identifier} identifier |{number} number |{expression} expression ;

// end of grammar.

Appendix B

Package expression;




14


15/16

Productions

exp ={plus} exp plus term |{minus} exp minus term;



Appendix C

Package expression;




Productionsexp =

{plus} exp plus term {-> New exp.plus(exp, term.exp)} |{minus} exp minus term {-> New exp.minus(exp, term.exp)};

term {-> exp} ={mult} term mult factor {-> New exp.mult(term.exp, factor.exp)} |{div} term div factor {-> New exp.div(term.exp, factor.exp)} |{factor} factor {-> factor.exp} ;



{plus} [exp1]:exp [exp2]:exp|{minus} [exp1]:exp [exp2]:exp |{mult} [exp1]:exp [exp2]:exp |{div} [exp1]:exp [exp2]:exp |{number} number |{exp} exp ;

15


16/16

References

[1] Etienne Gagnon, SableCC, An Object-Oriented Compiler Framework, Mas-ters thesis, McGill University, Montreal, Quebec, March 1998.

[2] Fidel Viegas, SableCC Tutorial, 2003. World-Wide Web Page URL:http://www.brainycreatures.co.uk/compiler/sablecc.asp/.

16

Documents

cst-to-ast-sablecc