LectureAll - University of Wisconsin–Madisonpages.cs.wisc.edu/~fischer/cs536.s06/lectures/LectureAll.pdfall). This supports transportable code, that can run on a variety of computers

1CS 536 Spring 2006©

CS 536

Introduction to Programming Languages

and Compilers

Charles N. Fischer

Spring 2006

http://www.cs.wisc.edu/~fischer/cs536.html


Class MeetsMondays, Wednesdays & Fridays,11:00 — 11:501325 Computer Sciences

InstructorCharles N. Fischer6367 Computer SciencesTelephone: 262-6635E-mail: [email protected] Hours:

10:30 - Noon, Tuesdays &Thursdays, or by appointment


Teaching AssistantKevin Roundy5385 Computer SciencesTelephone: 262-1079E-mail: [email protected] Hours:

9:30 - 11:00, Mondays,1:00-2:30, Fridays,or by appointment


Key Dates• February 1: Assignment #1

(Symbol Table Routines)• February 22: Assignment #2

(CSX Scanner)• March 8: Midterm Exam (tentative)• March 22: Assignment #3

(CSX Parser)• April 14: Assignment #4

(CSX Type Checker)• May 5: Assignment #5

(CSX Code Generator)• May 11: Final Exam

2:45 pm-4:45 pm


Class Text• Draft Chapters from:

Crafting a Compiler featuring Java.(Available from DoIt Tech Store)

• Handouts and Web-based reading willalso be used.

Reading Assignment• Chapters 1-2 of CaC (as background)

Class Notes• The transparencies used for each lecture

will be made available prior to, and after,that lecture on the class Web page(under the “Lecture Nodes” link).


Instructional ComputersDepartmental Unix (Linux) Machines(royal01-royal30, emperor01-emperor40) have been assigned to CS536. All necessary compilers andtools will be available on thesemachines.

You may also use your own PC orworkstation. It will be yourresponsibility to load needed software(instructions on where to find neededsoftware are included on the classweb page).

The Systems Lab teaches brieftutorials on Unix if you are unfamiliarwith that OS.


Academic Misconduct Policy

• You must do your assignments—nocopying or sharing of solutions.

• You may discuss general concepts andIdeas.

• All cases of Misconduct must be reportedto the Dean’s office.

• Penalties may be severe.


Program & Homework LatePolicy• An assignment may be handed in up to

one week late.

• Each late day will be debited 3%, up to amaximum of 21%.

Approximate Grade WeightsProgram 1 - Symbol Tables 5%Program 2 - Scanner 12%Program 3 - Parser 12%Program 4 - Type Checker 12%Program 5 - Code Generator 12%Homework 1 9%Midterm Exam 19%Final Exam (non-cumulative) 19%


Partnership Policy• Program #1 and the written homework

must be done individually.

• For undergraduates, programs 2 to 5may be done individually or by twoperson teams (your choice). Graduatestudents must do all assignmentsindividually.

10CS 536 Spring 2006©

CompilersCompilers are fundamental to moderncomputing.They act as translators, transforminghuman-oriented programminglanguages into computer-orientedmachine languages.To most users, a compiler can beviewed as a “black box” that performsthe transformation shown below.

ProgrammingLanguage MachineLanguage

Compiler

11CS 536 Spring 2006©

A compiler allows programmers toignore the machine-dependent detailsof programming.Compilers allow programs andprogramming skills to be machine-independent.Compilers also aid in detectingprogramming errors (which are all toocommon).Compiler techniques can also help toimprove computer security. Forexample, the Java Bytecode Verifierhelps to guarantee that Java securityrules are satisfied.Compilers currently help in protectionof intellectual property (usingobfuscation) and provenance(through watermarking).

12CS 536 Spring 2006©

History of CompilersThe term compiler was coined in theearly 1950s by Grace Murray Hopper.Translation was viewed as the“compilation” of a sequence ofmachine-language subprogramsselected from a library.One of the first real compilers wasthe FORTRAN compiler of the late1950s. It allowed a programmer touse a problem-oriented sourcelanguage.Ambitious “optimizations” were usedto produce efficient machine code,which was vital for early computerswith quite limited capabilities.Efficient use of machine resources isstill an essential requirement formodern compilers.

13CS 536 Spring 2006©

Compilers EnableProgramming Languages

Programming languages are used formuch more than “ordinary”computation.• TeX and LaTeX use compilers to

translate text and formattingcommands into intricate typesettingcommands.

• Postscript, generated by text-formatters like LaTeX, Word, andFrameMaker, is really a programminglanguage. It is translated andexecuted by laser printers anddocument previewers to produce areadable form of a document. Astandardized documentrepresentation language allows

14CS 536 Spring 2006©

documents to be freely interchanged,independent of how they werecreated and how they will be viewed.

• Mathmatica is an interactive systemthat intermixes programming withmathematics; it is possible to solveintricate problems in both symbolicand numeric form. This system reliesheavily on compiler techniques tohandle the specification, internalrepresentation, and solution ofproblems.

• Verilog and VHDL support thecreation of VLSI circuits. A siliconcompiler specifies the layout andcomposition of a VLSI circuit mask,using standard cell designs. Just as anordinary compiler understands andenforces programming language

15CS 536 Spring 2006©

rules, a silicon compiler understandsand enforces the design rules thatdictate the feasibility of a givencircuit.

• Interactive tools often need aprogramming language to supportautomatic analysis and modificationof an artifact.How do you automatically filter orchange a MS Word document? Youneed a text-based specification thatcan be processed, like a program, tocheck validity or produce an updatedversion.

16CS 536 Spring 2006©

When do We Run a Compiler?• Prior to execution

This is standard. We compile a programonce, then use it repeatedly.

• At the start of each executionWe can incorporate values set at thestart of the run to improve performance.A program may be partially complied,then completed with values set atexecution-time.

• During executionUnused code need not be compiled.Active or “hot” portions of a programmay be specially optimized.

• After executionWe can profile a program, looking forheavily used routines, which can bespecially optimized for later runs.

17CS 536 Spring 2006©

What do Compilers Produce?

Pure Machine CodeCompilers may generate code for aparticular machine, not assuming anyoperating system or library routines.This is “pure code” because it includesnothing beyond the instruction set.This form is rare; it is sometimes usedwith system implementationlanguages, that define operatingsystems or embedded applications(like a programmable controller). Purecode can execute on bare hardwarewithout dependence on any othersoftware.

18CS 536 Spring 2006©

Augmented Machine CodeCommonly, compilers generate codefor a machine architecture augmentedwith operating system routines andrun-time language support routines.To use such a program, a particularoperating system must be used and acollection of run-time supportroutines (I/O, storage allocation,mathematical functions, etc.) must beavailable. The combination ofmachine instruction and OS and run-time routines define a virtualmachine—a computer that exists onlyas a hardware/software combination.

19CS 536 Spring 2006©

Virtual Machine CodeGenerated code can consist entirely ofvirtual instructions (no native code atall). This supports transportable code,that can run on a variety ofcomputers.Java, with its JVM (Java VirtualMachine) is a great example of thisapproach.If the virtual machine is kept simpleand clean, the interpreter can bequite easy to write. Machineinterpretation slows execution speedby a factor of 3:1 to perhaps 10:1over compiled code.A “Just in Time” (JIT) compiler cantranslate “hot” portions of virtualcode into native code to speedexecution.

20CS 536 Spring 2006©

Advantages of VirtualInstructions

Virtual instructions serve a variety ofpurposes.• They simplify a compiler by providing

suitable primitives (such as procedurecalls, string manipulation, and so on).

• They contribute to compilertransportability.

• They may decrease in the size ofgenerated code since instructions aredesigned to match a particularprogramming language (for example,JVM code for Java).

Almost all compilers, to a greater orlesser extent, generate code for avirtual machine, some of whoseoperations must be interpreted.

21CS 536 Spring 2006©

Formats of TranslatedPrograms

Compilers differ in the format of thetarget code they generate. Targetformats may be categorized asassembly language, relocatable binary,or memory-image.

• Assembly Language (Symbolic) FormatA text file containing assemblersource code is produced. A number ofcode generation decisions (jumptargets, long vs. short address forms,and so on) can be left for theassembler. This approach is good forinstructional projects. Generatingassembler code supports cross-compilation (running a compiler onone computer, while its target is asecond computer). Generating

22CS 536 Spring 2006©

assembly language also simplifiesdebugging and understanding acompiler (since you can see thegenerated code).C rather than a specific assemblylanguage can generated, using C as a“universal assembly language.” C isfar more machine-independent thanany particular assembly language.However, some aspects of a program(such as the run-time representationsof program and data) are inaccessibleusing C code, but readily accessible inassembly language.

• Relocatable Binary FormatTarget code may be generated in abinary format with external referencesand local instruction and dataaddresses are not yet bound. Instead,

23CS 536 Spring 2006©

addresses are assigned relative to thebeginning of the module or relativeto symbolically named locations. Alinkage step adds support librariesand other separately compiledroutines and produces an absolutebinary program format that isexecutable.

• Memory-Image (Absolute Binary) FormCompiled code may be loaded intomemory and immediately executed.This is faster than going through theintermediate step of link/editing. Theability to access library andprecompiled routines may be limited.The program must be recompiled foreach execution. Memory-imagecompilers are useful for student anddebugging use, where frequent

24CS 536 Spring 2006©

changes are the rule and compilationcosts far exceed execution costs.Java is designed to use and shareclasses defined and implemented at avariety of organizations. Rather thanuse a fixed copy of a class (which maybe outdated), the JVM supportsdynamic linking of externally definedclasses. When first referenced, a classdefinition may be remotely fetched,checked, and loaded during programexecution. In this way “foreign code”can be guaranteed to be up-to-dateand secure.

25CS 536 Spring 2006©

The Structure of a CompilerA compiler performs two major tasks:• Analysis of the source program being

compiled

• Synthesis of a target programAlmost all modern compilers aresyntax-directed: The compilationprocess is driven by the syntacticstructure of the source program.A parser builds semantic structure outof tokens, the elementary symbols ofprogramming language syntax.Recognition of syntactic structure is amajor part of the analysis task.Semantic analysis examines themeaning (semantics) of the program.Semantic analysis plays a dual role.

26CS 536 Spring 2006©

It finishes the analysis task byperforming a variety of correctnesschecks (for example, enforcing typeand scope rules). Semantic analysisalso begins the synthesis phase.The synthesis phase may translatesource programs into someintermediate representation (IR) or itmay directly generate target code.If an IR is generated, it then serves asinput to a code generator componentthat produces the desired machine-language program. The IR mayoptionally be transformed by anoptimizer so that a more efficientprogram may be generated.

27CS 536 Spring 2006©

Type Checker

Optimizer

Code

Scanner

Symbol Tables

Parser

SourceProgram

(CharacterStream)

Tokens SyntaxTree

(AST)

DecoratedAST

IntermediateRepresentation

(IR)

IR

Generator

Target MachineCode

Translator

Abstract

The Structure of a Syntax-Directed Compiler

28CS 536 Spring 2006©

ScannerThe scanner reads the sourceprogram, character by character. Itgroups individual characters intotokens (identifiers, integers, reservedwords, delimiters, and so on). Whennecessary, the actual character stringcomprising the token is also passedalong for use by the semantic phases.The scanner does the following:• It puts the program into a compact

and uniform format (a stream oftokens).

• It eliminates unneeded information(such as comments).

• It sometimes enters preliminaryinformation into symbol tables (for

29CS 536 Spring 2006©

example, to register the presence of aparticular label or identifier).

• It optionally formats and lists thesource program

Building tokens is driven by tokendescriptions defined using regularexpression notation.Regular expressions are a formalnotation able to describe the tokensused in modern programminglanguages. Moreover, they can drivethe automatic generation of workingscanners given only a specification ofthe tokens. Scanner generators (likeLex, Flex and Jlex) are valuablecompiler-building tools.

30CS 536 Spring 2006©

ParserGiven a syntax specification (as acontext-free grammar, CFG), theparser reads tokens and groups theminto language structures.Parsers are typically created from aCFG using a parser generator (likeYacc, Bison or Java CUP).The parser verifies correct syntax andmay issue a syntax error message.As syntactic structure is recognized,the parser usually builds an abstractsyntax tree (AST), a conciserepresentation of program structure,which guides semantic processing.

31CS 536 Spring 2006©

Type Checker(Semantic Analysis)

The type checker checks the staticsemantics of each AST node. Itverifies that the construct is legal andmeaningful (that all identifiersinvolved are declared, that types arecorrect, and so on).If the construct is semanticallycorrect, the type checker “decorates”the AST node, adding type or symboltable information to it. If a semanticerror is discovered, a suitable errormessage is issued.Type checking is purely dependent onthe semantic rules of the sourcelanguage. It is independent of thecompiler’s target machine.

32CS 536 Spring 2006©

Translator(Program Synthesis)

If an AST node is semantically correct,it can be translated. Translationinvolves capturing the run-time“meaning” of a construct.For example, an AST for a while loopcontains two subtrees, one for theloop’s control expression, and theother for the loop’s body. Nothing inthe AST shows that a while looploops! This “meaning” is capturedwhen a while loop’s AST is translated.In the IR, the notion of testing thevalue of the loop control expression,and conditionally executing the loopbody becomes explicit.The translator is dictated by thesemantics of the source language.

33CS 536 Spring 2006©

Little of the nature of the targetmachine need be made evident.Detailed information on the nature ofthe target machine (operationsavailable, addressing, registercharacteristics, etc.) is reserved forthe code generation phase.In simple non-optimizing compilers(like our class project), the translatorgenerates target code directly,without using an IR.More elaborate compilers may firstgenerate a high-level IR (that issource language oriented) and thensubsequently translate it into a low-level IR (that is target machineoriented). This approach allows acleaner separation of source andtarget dependencies.

34CS 536 Spring 2006©

OptimizerThe IR code generated by thetranslator is analyzed andtransformed into functionallyequivalent but improved IR code bythe optimizer.The term optimization is misleading:we don’t always produce the bestpossible translation of a program,even after optimization by the best ofcompilers.Why?Some optimizations are impossible todo in all circumstances because theyinvolve an undecidable problem.Eliminating unreachable (“dead”)code is, in general, impossible.

35CS 536 Spring 2006©

Other optimizations are too expensiveto do in all cases. These involve NP-complete problems, believed to beinherently exponential. Assigningregisters to variables is an example ofan NP-complete problem.Optimization can be complex; it mayinvolve numerous subphases, whichmay need to be applied more thanonce.Optimizations may be turned off tospeed translation. Nonetheless, a welldesigned optimizer can significantlyspeed program execution bysimplifying, moving or eliminatingunneeded computations.

36CS 536 Spring 2006©

Code GeneratorIR code produced by the translator ismapped into target machine code bythe code generator. This phase usesdetailed information about the targetmachine and includes machine-specific optimizations like registerallocation and code scheduling.Code generators can be quite complexsince good target code requiresconsideration of many special cases.Automatic generation of codegenerators is possible. The basicapproach is to match a low-level IRto target instruction templates,choosing instructions which bestmatch each IR instruction.A well-known compiler usingautomatic code generation

37CS 536 Spring 2006©

techniques is the GNU C compiler.GCC is a heavily optimizing compilerwith machine description files forover ten popular computerarchitectures, and at least twolanguage front ends (C and C++).

38CS 536 Spring 2006©

Symbol TablesA symbol table allows information tobe associated with identifiers andshared among compiler phases. Eachtime an identifier is used, a symboltable provides access to theinformation collected about theidentifier when its declaration wasprocessed.

39CS 536 Spring 2006©

ExampleOur source language will be CSX, ablend of C, C++ and Java.Our target language will be the JavaJVM, using the Jasmin assembler.

• Our source line is a = bb+abs(c-7);this is a sequence of ASCII characters ina text file.

• The scanner groups characters intotokens, the basic units of a program.

a = bb+abs ( c- 7) ; After scanning, we have the followingtoken sequence: Ida Asg Idbb Plus Idabs Lparen Idc Minus IntLiteral7 Rparen Semi

40CS 536 Spring 2006©

• The parser groups these tokens intolanguage constructs (expressions,statements, declarations, etc.)represented in tree form:

(What happened to the parenthesesand the semicolon?)

Asg

Ida Plus

Idbb Call

Idabs Minus

Idc IntLiteral7

41CS 536 Spring 2006©

• The type checker resolves types andbinds declarations within scopes:

Asg

Ida Plus

Idbb Call

Idabs Minus

Idc IntLiteral7

int

intintloc

intloc int

int

intlocint

method

42CS 536 Spring 2006©

• Finally, JVM code is generated for eachnode in the tree (leaves first, then roots):

iload 3 ; push local 3 (bb)iload 2 ; push local 2 (c)ldc 7 ; Push literal 7isub ; compute c-7invokestatic java/lang/Math/abs(I)Iiadd ; compute bb+abs(c-7)istore 1 ; store result into local 1(a)

43CS 536 Spring 2006©

InterpretersThere are two different kinds ofinterpreters that support execution ofprograms, machine interpreters andlanguage interpreters.

Machine InterpretersA machine interpreter simulates theexecution of a program compiled fora particular machine architecture.Java uses a bytecode interpreter tosimulate the effects of programscompiled for the JVM. Programs likeSPIM simulate the execution of aMIPS program on a non-MIPScomputer.

44CS 536 Spring 2006©

Language InterpretersA language interpreter simulates theeffect of executing a programwithout compiling it to any particularinstruction set (real or virtual).Instead some IR form (perhaps anAST) is used to drive execution.Interpreters provide a number ofcapabilities not found in compilers:

• Programs may be modified as executionproceeds. This provides a straightforwardinteractive debugging capability.Depending on program structure,program modifications may requirereparsing or repeated semantic analysis.In Python, for example, any stringvariable may be interpreted as a Pythonexpression or statement and executed.

45CS 536 Spring 2006©

• Interpreters readily support languages inwhich the type of a variable denotes maychange dynamically (e.g., Python orScheme). The user program iscontinuously reexamined as executionproceeds, so symbols need not have afixed type. Fluid bindings are much moretroublesome for compilers, since dynamicchanges in the type of a symbol makedirect translation into machine codedifficult or impossible.

• Interpreters provide better diagnostics.Source text analysis is intermixed withprogram execution, so especially gooddiagnostics are available, along withinteractive debugging.

• Interpreters support machineindependence. All operations areperformed within the interpreter. To

46CS 536 Spring 2006©

move to a new machine, we justrecompile the interpreter.

However, interpretation can involvelarge overheads:

• As execution proceeds, program text iscontinuously reexamined, with bindings,types, and operations sometimesrecomputed at each use. For verydynamic languages this can represent a100:1 (or worse) factor in executionspeed over compiled code. For morestatic languages (such as C or Java), thespeed degradation is closer to 10:1.

• Startup time for small programs isslowed, since the interpreter must beload and the program partiallyrecompiled before execution begins.

47CS 536 Spring 2006©

• Substantial space overhead may beinvolved. The interpreter and all supportroutines must usually be kept available.Source text is often not as compact as ifit were compiled. This size penalty maylead to restrictions in the size ofprograms. Programs beyond these built-in limits cannot be handled by theinterpreter.

Of course, many languages (including, C,C++ and Java) have both interpreters(for debugging and programdevelopment) and compilers (forproduction work).

48CS 536 Spring 2006©

Symbol Tables & ScopingProgramming languages use scopes tolimit the range an identifier is active(and visible).Within a scope a name may bedefined only once (thoughoverloading may be allowed).A symbol table (or dictionary) iscommonly used to collect all thedefinitions that appear within ascope.At the start of a scope, the symboltable is empty. At the end of a scope,all declarations within that scope areavailable within the symbol table.A language definition may or may notallow forward references to anidentifier.

49CS 536 Spring 2006©

If forward references are allowed, youmay use a name that is defined laterin the scope (Java does this for fieldand method declarations within aclass).If forward references are not allowed,an identifier is visible only after itsdeclaration. C, C++ and Java do thisfor variable declarations.In CSX no forward references areallowed.In terms of symbol tables, forwardreferences require two passes over ascope. First all declarations aregathered. Next, all references areresolved using the complete set ofdeclarations stored in the symboltable.

50CS 536 Spring 2006©

If forward references are disallowed,one pass through a scope suffices,processing declarations and uses ofidentifiers together.

51CS 536 Spring 2006©

Block Structured Languages• Introduced by Algol 60, includes C, C++,

CSX and Java.

• Identifiers may have a non-global scope.Declarations may be local to a class,subprogram or block.

• Scopes may nest, with declarationspropagating to inner (contained) scopes.

• The lexically nearest declaration of anidentifier is bound to uses of thatidentifier.

52CS 536 Spring 2006©

Example (drawn from C):

int x,z;void A() {

float x,y; print(x,y,z);

}void B() { print (x,y,z)

}

floatfloat

int

int

intundeclared

53CS 536 Spring 2006©

Block Structure Concepts• Nested Visibility

No access to identifiers outside theirscope.

• Nearest Declaration AppliesUsing static nesting of scopes.

• Automatic Allocation and Deallocationof Locals

Lifetime of data objects is bound tothe scope of the Identifiers thatdenote them.

54CS 536 Spring 2006©

Block-Structured SymbolTables

Block structured symbol tables aredesigned to support the scoping rulesof block structured languages.For our CSX project we’ll use classSymb to represent symbols andSymbolTable to implementedoperations needed for a block-structured symbol table.

Class Symb will contain a methodpublic String name()

that returns the name associated witha symbol.

55CS 536 Spring 2006©

Class SymbolTable contains thefollowing methods:• public void openScope() {

A new and empty scope is opened.

• public void closeScope() throws

EmptySTException

The innermost scope is closed. Anexception is thrown if there is noscope to close.

• public void insert(Symb s)throws DuplicateException,

EmptySTException

A Symb is inserted in the innermostscope. An exception is thrown if aSymb with the same name is alreadyin the innermost scope or if there isno symbol table to insert into.

56CS 536 Spring 2006©

• public Symb localLookup(String s)

The innermost scope is searched for aSymb whose name is equal to s . Nullis returned if no Symb named s isfound.

• public Symb globalLookup(String s)

All scopes, from innermost tooutermost, are searched for a Symbwhose name is equal to s . The firstSymb that matches s is found;otherwise null is returned if nomatching Symb is found.

57CS 536 Spring 2006©

Is Case Significant?In some languages (C, C++, Java andmany others) case is significant inidentifiers. This means aa and AA aredifferent symbols that may haveentirely different definitions.In other languages (Pascal, Ada,Scheme, CSX) case is not significant.In such languages aa and AA are twoalternative spellings of the sameidentifier.Data structures commonly used toimplement symbol tables usually treatdifferent cases as different symbols.This is fine when case is significant ina language. When case isinsignificant, you probably will needto strip case before entering orlooking up identifiers.

58CS 536 Spring 2006©

This just means that identifiers areconverted to a uniform case beforethey are entered or looked up. Thus ifwe choose to use lower caseuniformly, the identifiers aaa , AAA,and AaA are all converted to aaa forpurposes of insertion or lookup.BUT, inside the symbol table theidentifier is stored in the form it wasdeclared so that programmers see theform of identifier they expect inlistings, error messages, etc.

59CS 536 Spring 2006©

How are Symbol TablesImplemented?

There are a number of data structuresthat can reasonably be used toimplement a symbol table:• An Ordered List

Symbols are stored in a linked list,sorted by the symbol’s name. This issimple, but may be a bit too slow ifmany identifiers appear in a scope.

• A Binary Search TreeLookup is much faster than in alinked list, but rebalancing may beneeded. (Entering identifiers in sortedorder can turn a search tree into alinked list.)

• Hash TablesThe most popular choice.

60CS 536 Spring 2006©

Implementing Block-Structured Symbol Tables

To implement a block structuredsymbol table we need to be able toefficiently open and close individualscopes, and limit insertion to theinnermost current scope.This can be done using one symboltable structure if we tag individualentries with a “scope number.”It is far easier (but more wasteful ofspace) to allocate one symbol tablefor each scope. Open scopes arestacked, pushing and popping tablesas scopes are opened and closed.Be careful though—manypreprogrammed stackimplementations don’t allow you to

61CS 536 Spring 2006©

“peek” at entries below the stack top.This is necessary to lookup anidentifier in all open scopes.If a suitable stack implementation(with a peek operation) isn’tavailable, a linked list of symboltables will suffice.

62CS 536 Spring 2006©

More on HashtablesHashtables are a very useful datastructure. Java provides a predefinedHashtable class. Python includes abuilt-in dictionary type.Every Java class has a hashCodemethod, which allows any object tobe entered into a Java Hashtable .For most objects, hash codes arepretty simple (the address of thecorresponding object is often used).But for strings Java uses a much more

elaborate hash function:

n is the length of the string, ci is thei-th character and all arithmetic isdone without overflow checking.

ci 37×i

i 0=

n 1–

∑

63CS 536 Spring 2006©

Why such an elaborate hashfunction?Simpler hash functions can havemajor problems.

Consider (add the characters).

For short identifiers the sum growsslowly, so large indices won’t often beused (leading to non-uniform use ofthe hash table).

We can try (product of

characters), but now (surprisingly)the size of the hash table becomes anissue. The problem is that if even onecharacter is encoded as an evennumber, the product must be even.

cii 0=

n 1–

∑

cii 0=

n 1–

∏

64CS 536 Spring 2006©

If the hash table is even in size (anatural thing to do), most hash tableentries will be at even positions.Similarly, if even one character isencoded as a multiple of 3, the wholeproduct will be a multiple of 3, sohash tables that are a multiple ofthree in size won’t be uniformly used.To see how bad things can get,consider a hash table with size 210(which is equal to 2×3×5×7). Thisshould be a particularly bad table sizeif a product hash is used. (Why?)Is it? As an experiment, all the wordsin the Unix spell checker’s dictionary(26000 words) where entered. Over50% (56.7% actually) hit position 0in the table!

65CS 536 Spring 2006©

Why such non-uniformity?If an identifier contains charactersthat are multiples of 2, 3, 5 and 7,then their hash will be a multiple of210 and will map to position 0.For example, in Wisconsin , n has anASCII code of 110 (2×55) and i has acode of 105 (7×5×3).If we change the table size ever soslightly, to 211, no table entry getsmore than 1% of the 26000 wordshashed, which is very good.Why such a big difference? Well 211is prime and there is a bit a folk-wisdom that states that primenumbers are good choices for hashtable sizes. Now our product hash willcover table entries far more uniformly

66CS 536 Spring 2006©

(small factors in the hash don’t dividethe table size evenly).Now the reason for Java’s morecomplex string hash functionbecomes evident—it can uniformly filla hash table whose size isn’t prime.

67CS 536 Spring 2006©

How are Collisions Handled?Since identifiers are often unlimitedin length, the set of possibleidentifiers is infinite. Even if we limitourselves is short identifiers (say 10of fewer characters), the range ofvalid identifiers is greater than 2610.This means that all hash tables needto contend with collisions, when twodifferent identifiers map to the sameplace in the table.How are collisions handled?The simplest approach is linearresolution. If identifier id hashes toposition p in a hash table of size sand position p is already filled, we try(p+1) mod s , then (p+2) mod s ,until a free position is found.

68CS 536 Spring 2006©

As long as the table is not too filled,this approach works well. When weapproach an almost-filled situation,long search chains form, and wedegenerate to an unordered list.If the table is 100% filled, linearresolution fails.Some hash table implementations,including Java’s, set a load factorbetween 0 and 1.0. When the fractionof filled entries in the table exceedsthe load factor, table size is increasedand all entries are rehashed.Note that bundling of a hashCodemethod within all Java objects makesrehashing easy to do automatically. Ifthe hash function is external to thesymbol table entries, rehashing mayneed to be done manually by the user.

69CS 536 Spring 2006©

An alternative to linear resolution ischained resolution, in which symboltable entries contain pointers tochains of symbols rather than a singlesymbol. All identifiers that hash tothe same position appear on the samechain. Now overflowing table size isnot catastrophic—as the table fills,chains from each table position getlonger. As long as the table is not toooverfilled, average chain length willbe small.

70CS 536 Spring 2006©

Reading AssignmentGet and read Chapter 3 ofCrafting a Compiler featuring Java.(Available from DoIt Tech Store)

71CS 536 Spring 2006©

ScanningA scanner transforms a characterstream into a token stream.A scanner is sometimes called a lexicalanalyzer or lexer.Scanners use a formal notation(regular expressions) to specify theprecise structure of tokens.But why bother? Aren’t tokens verysimple in structure?Token structure can be more detailedand subtle than one might expect.Consider simple quoted strings in C,C++ or Java. The body of a string canbe any sequence of characters excepta quote character (which must beescaped). But is this simple definitionreally correct?

72CS 536 Spring 2006©

Can a newline character appear in astring? In C it cannot, unless it isescaped with a backslash.C, C++ and Java allow escapednewlines in strings, Pascal forbidsthem entirely. Ada forbids allunprintable characters.Are null strings (zero-length)allowed? In C, C++, Java and Adathey are, but Pascal forbids them.(In Pascal a string is a packed array ofcharacters, and zero length arrays aredisallowed.)A precise definition of tokens canensure that lexical rules are clearlystated and properly enforced.

73CS 536 Spring 2006©

Regular ExpressionsRegular expressions specify simple(possibly infinite) sets of strings.Regular expressions routinely specifythe tokens used in programminglanguages.Regular expressions can drive ascanner generator.Regular expressions are widely used incomputer utilities:• The Unix utility grep uses regular

expressions to define search patternsin files.

• Unix shells allow regular expressionsin file lists for a command.

74CS 536 Spring 2006©

• Most editors provide a “contextsearch” command that specifiesdesired matches using regularexpressions.

• The Windows Find utility allows someregular expressions.

75CS 536 Spring 2006©

Regular SetsThe sets of strings defined by regularexpressions are called regular sets.When scanning, a token class will bea regular set, whose structure isdefined by a regular expression.Particular instances of a token classare sometimes called lexemes, thoughwe will simply call a string in a tokenclass an instance of that token. Thuswe call the string abc an identifier ifit matches the regular expression thatdefines valid identifier tokens.Regular expressions use a finitecharacter set, or vocabulary(denoted Σ).This vocabulary is normally thecharacter set used by a computer.

76CS 536 Spring 2006©

Today, the ASCII character set, whichcontains a total of 128 characters, isvery widely used.Java uses the Unicode character setwhich includes all the ASCIIcharacters as well as a wide variety ofother characters.An empty or null string is allowed(denoted λ, “lambda”). Lambdarepresents an empty buffer in whichno characters have yet been matched.It also represents optional parts oftokens. An integer literal may beginwith a plus or minus, or it may beginwith λ if it is unsigned.

77CS 536 Spring 2006©

CatenationStrings are built from characters inthe character set Σ via catenation.As characters are catenated to astring, it grows in length. The stringdo is built by first catenating d to λ,and then catenating o to the string d.The null string, when catenated withany string s, yields s. That is, s λ ≡λ s ≡ s. Catenating λ to a string islike adding 0 to an integer—nothingchanges.Catenation is extended to sets ofstrings:Let P and Q be sets of strings. (Thesymbol ∈ represents set membership.)If s1 ∈ P and s2 ∈ Q then strings1s2 ∈(P Q).

78CS 536 Spring 2006©

AlternationSmall finite sets are convenientlyrepresented by listing their elements.Parentheses delimit expressions, and |,the alternation operator, separatesalternatives.For example, D, the set of the tensingle digits, is defined asD = (0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9).The characters (, ), ' , ∗, +, and | aremeta-characters (punctuation andregular expression operators).Meta-characters must be quotedwhen used as ordinary characters toavoid ambiguity.

79CS 536 Spring 2006©

For example the expression( '(' | ')' | ; | , )defines four single character tokens(left parenthesis, right parenthesis,semicolon and comma). Theparentheses are quoted when theyrepresent individual tokens and arenot used as delimiters in a largerregular expression.Alternation is extended to sets ofstrings:Let P and Q be sets of strings.Then string s ∈ (P | Q) if and only ifs ∈ P or s ∈ Q.For example, if LC is the set oflower-case letters and UC is the setof upper-case letters, then (LC | UC)is the set of all letters (in either case).

80CS 536 Spring 2006©

Kleene ClosureA useful operation is Kleene closurerepresented by a postfix ∗ operator.

Let P be a set of strings. Then P *represents all strings formed by thecatenation of zero or more selections(possibly repeated) from P.Zero selections are represented by λ.

For example, LC* is the set of allwords composed only of lower-caseletters, of any length (including thezero length word, λ).

Precisely stated, a string s ∈ P * ifand only if s can be broken into zeroor more pieces: s = s1 s2 ... sn sothat each si ∈ P (n ≥ 0, 1 ≤ i ≤ n).

We allow n = 0, so λ is always in P* .

81CS 536 Spring 2006©

Definition of RegularExpressions

Using catenations, alternation andKleene closure, we can define regularexpressions as follows:• ∅ is a regular expression denoting

the empty set (the set containing nostrings). ∅ is rarely used, but isincluded for completeness.

• λ is a regular expression denoting theset that contains only the emptystring. This set is not the same as theempty set, because it contains oneelement.

• A string s is a regular expressiondenoting a set containing the singlestring s.

82CS 536 Spring 2006©

• If A and B are regular expressions,then A | B, A B , and A* are alsoregular expressions, denoting thealternation, catenation, and Kleeneclosure of the corresponding regularsets.

Each regular expression denotes a setof strings (a regular set). Any finiteset of strings can be represented by aregular expression of the form(s1 | s2 | … | sk ). Thus the reservedwords of ANSI C can be defined as(auto | break | case | …).

83CS 536 Spring 2006©

The following additional operationsuseful. They are not strictly necessary,because their effect can be obtainedusing alternation, catenation, Kleeneclosure:

• P + denotes all strings consisting ofone or more strings in P catenatedtogether:P* = (P+| λ) and P+ = P P*.For example, ( 0 | 1 )+ is the set of allstrings containing one or more bits.

• If A is a set of characters, Not(A)denotes (Σ − A); that is, all charactersin Σ not included in A. Since Not(A)can never be larger than Σ and Σ isfinite, Not(A) must also be finite,and is therefore regular. Not(A) doesnot contain λ since λ is not a

84CS 536 Spring 2006©

character (it is a zero-length string).For example, Not(Eol) is the set ofall characters excluding Eol (the endof line character, '\n' in Java or C).

It is possible to extend Not to strings,rather than just Σ. That is, if S is aset of strings, we define S to be(Σ* − S); the set of all strings exceptthose in S. Though S is usuallyinfinite, it is also regular if S is.

• If k is a constant, the set Ak

represents all strings formed bycatenating k (possibly different)strings from A.That is, Ak = (A A A …) (k copies).Thus ( 0 | 1 )32 is the set of all bitstrings exactly 32 bits long.

85CS 536 Spring 2006©

ExamplesLet D be the ten single digits and letL be the set of all 52 letters. Then• A Java or C++ single-line comment

that begins with // and ends with Eolcan be defined as:

Comment = // Not(Eol) * Eol

• A fixed decimal literal (e.g., 12.345 )can be defined as:

Lit = D +. D+

• An optionally signed integer literalcan be defined as:

IntLiteral = ( '+' | − | λ ) D+

(Why the quotes on the plus?)

86CS 536 Spring 2006©

• A comment delimited by ## markers,which allows single #’s within thecomment body:

Comment2 =## ((# | λ) Not(#) ) * ##

All finite sets and many infinite setsare regular. But not all infinite setsare regular. Consider the set ofbalanced brackets of the form

[ [ […] ] ] .This set is defined formally as{ [m ]m | m ≥ 1 }.This set is known not to be regular.Any regular expression that tries todefine it either does not get allbalanced nestings or it includes extra,unwanted strings.

87CS 536 Spring 2006©

Finite Automata and ScannersA finite automaton (FA) can be usedto recognize the tokens specified by aregular expression. FAs are simple,idealized computers that recognizestrings belonging to regular sets. AnFA consists of:• A finite set of states

• A set of transitions (or moves) fromone state to another, labeled withcharacters in Σ

• A special state called the start state

• A subset of the states called theaccepting, or final, states

88CS 536 Spring 2006©

These four components of a finiteautomaton are often representedgraphically:

Finite automata (the plural ofautomaton is automata) arerepresented graphically usingtransition diagrams. We start at thestart state. If the next input charactermatches the label on a transition

is a transition

is the start state

is an accepting state

is a state

89CS 536 Spring 2006©

from the current state, we go to thestate it points to. If no move ispossible, we stop. If we finish in anaccepting state, the sequence ofcharacters read forms a valid token;otherwise, we have not seen a validtoken.

In this diagram, the valid tokens arethe strings described by the regularexpression (a b (c)+ )+.

a b c

c

a

90CS 536 Spring 2006©

Deterministic Finite AutomataAs an abbreviation, a transition maybe labeled with more than onecharacter (for example, Not(c)). Thetransition may be taken if the currentinput character matches any of thecharacters labeling the transition.If an FA always has a uniquetransition (for a given state andcharacter), the FA is deterministic(that is, a deterministic FA, or DFA).Deterministic finite automata areeasy to program and often drive ascanner.If there are transitions to more thanone state for some character, then theFA is nondeterministic (that is, anNFA).

91CS 536 Spring 2006©

A DFA is conveniently represented ina computer by a transition table. Atransition table, T, is a twodimensional array indexed by a DFAstate and a vocabulary symbol.Table entries are either a DFA state oran error flag (often represented as ablank table entry). If we are in states, and read character c, then T[s,c]will be the next state we visit, orT[s,c] will contain an error markerindicating that c cannot extend thecurrent token. For example, theregular expression

// Not(Eol) * Eol

which defines a Java or C++ single-line comment, might be translatedinto

92CS 536 Spring 2006©

The corresponding transition table is:

A complete transition table containsone column for each character. Tosave space, table compression may beused. Only non-error entries areexplicitly represented in the table,using hashing, indirection or linkedstructures.

State Character/ Eol a b …

1 22 33 3 4 3 3 34

eof

Eol/ /

Not(Eol)

1 2 3 4

93CS 536 Spring 2006©

All regular expressions can betranslated into DFAs that accept(as valid tokens) the strings definedby the regular expressions. Thistranslation can be done manually by aprogrammer or automatically using ascanner generator.A DFA can be coded in:• Table-driven form

• Explicit control formIn the table-driven form, thetransition table that defines a DFA’sactions is explicitly represented in arun-time table that is “interpreted”by a driver program.In the direct control form, thetransition table that defines a DFA’sactions appears implicitly as thecontrol logic of the program.

94CS 536 Spring 2006©

For example, suppose CurrentChar isthe current input character. End offile is represented by a specialcharacter value, eof . Using the DFAfor the Java comments shown earlier,a table-driven scanner is:State = StartState

while (true){

if (CurrentChar == eof)break

NextState =T[State][CurrentChar]

if(NextState == error)break

State = NextState

read(CurrentChar)

}

if (State in AcceptingStates)

// Process valid token

else // Signal a lexical error

95CS 536 Spring 2006©

This form of scanner is produced by ascanner generator; it is definition-independent. The scanner is a driverthat can scan any token if T containsthe appropriate transition table.Here is an explicit-control scanner forthe same comment definition:if (CurrentChar == '/'){

read(CurrentChar)

if (CurrentChar == '/')

repeat

read(CurrentChar)

until (CurrentChar in

{eol, eof})

else //Signal lexical error

else // Signal lexical error

if (CurrentChar == eol)

// Process valid token

else //Signal lexical error

96CS 536 Spring 2006©

The token being scanned is“hardwired” into the logic of thecode. The scanner is usually easy toread and often is more efficient, butis specific to a single token definition.

97CS 536 Spring 2006©

More Examples• A FORTRAN-like real literal (which

requires digits on either or both sidesof a decimal point, or just a string ofdigits) can be defined as

RealLit = (D + (λ | . )) | (D* . D+)

This corresponds to the DFA

. D

DD

D .

98CS 536 Spring 2006©

• An identifier consisting of letters,digits, and underscores, which beginswith a letter and allows no adjacentor trailing underscores, may bedefined as

ID = L (L | D) * ( _ (L | D)+)*

This definition includes identifierslike sum or unit_cost , butexcludes _one and two_ andgrand___total . The DFA is:

L | D

L

L | D

_

99CS 536 Spring 2006©

Lex/Flex/JLexLex is a well-known Unix scannergenerator. It builds a scanner, in C,from a set of regular expressions thatdefine the tokens to be scanned.Flex is a newer and faster version ofLex.Jlex is a Java version of Lex. Itgenerates a scanner coded in Java,though its regular expressiondefinitions are very close to thoseused by Lex and Flex.Lex, Flex and JLex are largely non-procedural. You don’t need to tell thetools how to scan. All you need to tellit what you want scanned (by givingit definitions of valid tokens).

100CS 536 Spring 2006©

This approach greatly simplifiesbuilding a scanner, since most of thedetails of scanning (I/O, buffering,character matching, etc.) areautomatically handled.

101CS 536 Spring 2006©

JLexJLex is coded in Java. To use it, youenterjava JLex.Main f.jlex

Your CLASSPATH should be set tosearch the directories where JLex’sclasses are stored.(The CLASSPATH we gave youincludes JLex’s classes).After JLex runs (assuming there areno errors in your tokenspecifications), the Java source filef.jlex.java is created. (f stands forany file name you choose. Thuscsx.jlex might hold tokendefinitions for CSX, andcsx.jlex.java would hold thegenerated scanner).

102CS 536 Spring 2006©

You compile f.jlex.java just likeany Java program, using your favoriteJava compiler.After compilation, the class fileYylex.class is created.It contains the methods:• Token yylex() which is the actual

scanner. The constructor for Yylextakes the file you want scanned, sonew Yylex(System.in)will build a scanner that reads fromSystem.in . Token is the token classyou want returned by the scanner;you can tell JLex what class you wantreturned.

• String yytext() returns thecharacter text matched by the lastcall to yylex .

103CS 536 Spring 2006©

A simple example of using JLex is in~cs536-1/pubic/jlexJust entermake test

104CS 536 Spring 2006©

Input to JLexThere are three sections, delimited by%%. The general structure is:User Code

%%

Jlex Directives

%%

Regular Expression rules

The User Code section is Java sourcecode to be copied into the generatedJava source file. It contains utilityclasses or return type classes youneed. Thus if you want to return aclass IntlitToken (for integerliterals that are scanned), you includeits definition in the User Codesection.

105CS 536 Spring 2006©

JLex directives are variousinstructions you can give JLex tocustomize the scanner you generate.These are detailed in the JLex manual.The most important are:• %{

Code copied into the Yylexclass (extra fields ormethods you may want)%}

• %eof{Java code to be executed whenthe end of file is reached%eof}

• %type classnameclassname is the return type youwant for the scanner method,yylex()

106CS 536 Spring 2006©

Macro DefinitionsIn section two you may also definemacros, that are used in section three.A macro allows you to give a name toa regular expression or characterclass. This allows you to reusedefinitions and make regularexpression rule more readable.Macro definitions are of the formname = def

Macros are defined one per line.Here are some simple examples:Digit=[0-9]

AnyLet=[A-Za-z]

In section 3, you use a macro byplacing its name within { and } . Thus{Digit} expands to the characterclass defining the digits 0 to 9.

107CS 536 Spring 2006©

Regular Expression RulesThe third section of the JLex inputfile is a series of token definitionrules of the formRegExpr {Java code}

When a token matching the givenRegExpr is matched, thecorresponding Java code (enclosed in“{“ and “}”) is executed. JLex figuresout what RegExpr applies; you needonly say what the token looks like(using RegExpr ) and what you wantdone when the token is matched (thisis usually to return some tokenobject, perhaps with some processingof the token text).

108CS 536 Spring 2006©

Here are some examples:"+" {return new Token(sym.Plus);}(" ")+ {/* skip white space */}{Digit}+ {return new

IntToken(sym.Intlit,new Integer(yytext()).intValue());}

109CS 536 Spring 2006©

Regular Expressions in JLexTo define a token in JLex, the user toassociates a regular expression withcommands coded in Java.When input characters that match aregular expression are read, thecorresponding Java code is executed.As a user of JLex you don’t need totell it how to match tokens; you needonly say what you want done when aparticular token is matched.Tokens like white space are deletedsimply by having their associatedcommand not return anything.Scanning continues until a commandwith a return in it is executed.The simplest form of regularexpression is a single string thatmatches exactly itself.

110CS 536 Spring 2006©

For example,if {return new Token(sym.If);}

If you wish, you can quote the stringrepresenting the reserved word("if" ), but since the string containsno delimiters or operators, quoting itis unnecessary.For a regular expression operator, like+, quoting is necessary:"+" {return new Token(sym.Plus);}

111CS 536 Spring 2006©

Character ClassesOur specification of the reserved wordif, as shown earlier, is incomplete. Wedon’t (yet) handle upper or mixed-case.To extend our definition, we’ll use avery useful feature of Lex and JLex—character classes.Characters often naturally fall intoclasses, with all characters in a classtreated identically in a tokendefinition. In our definition ofidentifiers all letters form a classsince any of them can be used toform an identifier. Similarly, in anumber, any of the ten digitcharacters can be used.

112CS 536 Spring 2006©

Character classes are delimited by [and ] ; individual characters are listedwithout any quotation or separators.However \ , ^ , ] and - , because oftheir special meaning in characterclasses, must be escaped. Thecharacter class [xyz] can match asingle x , y, or z .The character class [\])] can matcha single ] or ) .(The ] is escaped so that it isn’tmisinterpreted as the end of characterclass.)Ranges of characters are separated bya - ; [x-z] is the same as [xyz] .[0-9] is the set of all digits and[a-zA-Z] is the set of all letters,upper- and lower-case. \ is theescape character, used to represent

113CS 536 Spring 2006©

unprintables and to escape specialsymbols.Following C and Java conventions, \nis the newline (that is, end of line),\t is the tab character, \\ is thebackslash symbol itself, and \010 isthe character corresponding to octal10.The ^ symbol complements acharacter class (it is JLex’srepresentation of the Not operation).[^xy] is the character class thatmatches any single character exceptx and y. The ^ symbol applies to allcharacters that follow it in acharacter class definition, so [ ^0-9]is the set of all characters that aren’tdigits. [ ^ ] can be used to match allcharacters.

114CS 536 Spring 2006©

Here are some examples of characterclasses:

CharacterClass Set of Characters Denoted[abc] Three characters: a, b and c[cba] Three characters: a, b and c[a-c] Three characters: a, b and c[aabbcc] Three characters: a, b and c[^abc] All characters except a, b

and c[\^\-\]] Three characters: ^ , - and ][^] All characters"[abc]" Not a character class. This

is one five character string:[abc]

115CS 536 Spring 2006©

Regular Operators in JLexJLex provides the standard regularoperators, plus some additions.• Catenation is specified by the

juxtaposition of two expressions; noexplicit operator is used.Outside of character class brackets,individual letters and numbers matchthemselves; other characters shouldbe quoted (to avoid misinterpretationas regular expression operators).

Case is significant.

Regular Expr Characters Matcheda b cd Four characters: abcd(a)(b)(cd) Four characters: abcd[ab][cd] Four different strings: ac or

ad or bc or bdwhile Five characters: while" while " Five characters: while[w][h][i][l][e] Five characters: while

116CS 536 Spring 2006©

• The alternation operator is | .Parentheses can be used to controlgrouping of subexpressions.If we wish to match the reservedword while allowing any mixtureof upper- and lowercase, we can use(w|W)(h|H)(i|I)(l|L)(e|E)or[wW][hH][iI][lL][eE]

Regular Expr Characters Matchedab|cd Two different strings: ab or cd(ab)|(cd) Two different strings: ab or cd[ab]|[cd] Four different strings: a or b or

c or d

117CS 536 Spring 2006©

• Postfix operators:* Kleene closure: 0 or more matches(ab)* matches λ or ab or abab orababab ...

+ Positive closure: 1 or more matches(ab)+ matches ab or abab orababab ...

? Optional inclusion:expr?

matches expr zero times or once.expr? is equivalent to (expr) | λand eliminates the need for anexplicit λ symbol.[-+]?[0-9]+ defines an optionallysigned integer literal.

118CS 536 Spring 2006©

• Single match:The character ". " matches any singlecharacter (other than a newline).

• Start of line:The character ^ (when used outside acharacter class) matches thebeginning of a line.

• End of line:The character $ matches the end of aline. Thus,

^A.*e$matches an entire line that beginswith A and ends with e.

119CS 536 Spring 2006©

Overlapping DefinitionsRegular expressions map overlap(match the same input sequence).In the case of overlap, two rulesdetermine which regular expression ismatched:• The longest possible match is

performed. JLex automatically bufferscharacters while deciding how manycharacters can be matched.

• If two expressions match exactly thesame string, the earlier expression (inthe JLex specification) is preferred.Reserved words, for example, areoften special cases of the patternused for identifiers. Their definitionsare therefore placed before the

120CS 536 Spring 2006©

expression that defines an identifiertoken.

Often a “catch all” pattern is placedat the very end of the regularexpression rules. It is used to catchcharacters that don’t match any ofthe earlier patterns and hence areprobably erroneous. Recall that ". "matches any single character (otherthan a newline). It is useful in acatch-all pattern. However, avoid apattern like .* which will consumeall characters up to the next newline.In JLex an unmatched character willcause a run-time error.

121CS 536 Spring 2006©

The operators and special symbolsmost commonly used in JLex aresummarized below. Note that asymbol sometimes has one meaningin a regular expression and an entirelydifferent meaning in a character class(i.e., within a pair of brackets). If youfind JLex behaving unexpectedly, it’sa good idea to check this table to besure of how the operators andsymbols you’ve used behave. Ordinaryletters and digits, and symbols notmentioned (like @) representthemselves. If you’re not sure if acharacter is special or not, you canalways escape it or make it part of aquoted string.

122CS 536 Spring 2006©

SymbolMeaning in RegularExpressions

Meaning inCharacterClasses

( Matches with ) to groupsub-expressions.

Representsitself.

) Matches with ( to groupsub-expressions.

Representsitself.

[ Begins a character class. Representsitself.

] Represents itself. Ends a charac-ter class.

{ Matches with } to signalmacro-expansion.

Representsitself.

} Matches with { to signalmacro-expansion.

Representsitself.

" Matches with " to delimitstrings(only \ is special withinstrings).

Representsitself.

\ Escapes individual charac-ters.Also used to specify acharacter by its octal code.

Escapes individ-ual characters.Also used tospecify a char-acter by its octalcode.

. Matches any one characterexcept \n.

Representsitself.

123CS 536 Spring 2006©

| Alternation (or) operator. Representsitself.

* Kleene closure operator(zero or more matches).

Representsitself.

+ Positive closure operator(one or more matches).

Representsitself.

? Optional choice operator(one or zero matches).

Representsitself.

/ Context sensitive matchingoperator.

Representsitself.

^ Matches only at beginningof a line.

Complementsremainingcharacters in theclass.

$ Matches only at end of aline.

Representsitself.

- Represents itself. Range of char-acters operator.

SymbolMeaning in RegularExpressions

Meaning inCharacterClasses

124CS 536 Spring 2006©

Potential Problems in UsingJLex

The following differences from“standard” Lex notation appear inJLex:• Escaped characters within quoted

strings are not recognized. Hence"\n" is not a new line character.Escaped characters outside of quotedstrings (\n ) and escaped characterswithin character classes ([\n] ) areOK.

• A blank should not be used within acharacter class (i.e., [ and ] ). Youmay use \040 (which is the charactercode for a blank).

125CS 536 Spring 2006©

• A doublequote must be escapedwithin a character class. Use [\"]instead of ["] .

• Unprintables are defined to be allcharacters before blank as well as thelast ASCII character. These can berepresented as: [\000-\037\177]

126CS 536 Spring 2006©

JLex ExamplesA JLex scanner that looks for fiveletter words that begin with “P” andend with “T”.This example is in

~cs536-1/public/jlex

127CS 536 Spring 2006©

The JLex specification file is:class Token {

String text;Token(String t){text = t;}

}%%Digit=[0-9]AnyLet=[A-Za-z]Others=[0-9’&.]WhiteSp=[\040\n]// Tell JLex to have yylex() return aToken%type Token// Tell JLex what to return when eof offile is hit%eofval{return new Token(null);%eofval}%%[Pp]{AnyLet}{AnyLet}{AnyLet}[Tt]{WhiteSp}+

{return new Token(yytext());}

({AnyLet}|{Others})+{WhiteSp}+{/*skip*/}

128CS 536 Spring 2006©

The Java program that uses thescanner is:import java.io.*;

class Main {

public static void main(String args[])throws java.io.IOException {

Yylex lex = new Yylex(System.in);Token token = lex.yylex();

while ( token.text != null ) {System.out.print("\t"+token.text);token = lex.yylex(); //get next token

}}}

129CS 536 Spring 2006©

In case you care, the words that arematched include:

Pabst

paint

petit

pilot

pivot

plant

pleat

point

posit

Pratt

print

131CS 536 Spring 2006©

The JLex specification file is:import java_cup.runtime.*;

/* Expand this into your solution forproject 2 */

class CSXToken {int linenum;int colnum;CSXToken(int line,int col){linenum=line;colnum=col;};

}

class CSXIntLitToken extends CSXToken {int intValue;CSXIntLitToken(int val,int line,

int col){super(line,col);intValue=val;};

}

class CSXIdentifierToken extendsCSXToken {String identifierText;CSXIdentifierToken(String text,int line,

int col){super(line,col);identifierText=text;};

}

132CS 536 Spring 2006©

class CSXCharLitToken extends CSXToken {char charValue;

CSXCharLitToken(char val,int line,int col){

super(line,col);charValue=val;};}

class CSXStringLitToken extends CSXToken{

String stringText;CSXStringLitToken(String text,

int line,int col){super(line,col);stringText=text; };

}

// This class is used to track line andcolumn numbers// Feel free to change to extend itclass Pos {static int linenum = 1;/* maintain this as line number current

token was scanned on */static int colnum = 1;

/* maintain this as column numbercurrent token began at */

static int line = 1;/* maintain this as line number after

scanning current token */

133CS 536 Spring 2006©

static int col = 1;/* maintain this as column number

after scanning current token */static void setpos() {

//set starting pos for current tokenlinenum = line;colnum = col;}

}

%%Digit=[0-9]

// Tell JLex to have yylex() return aSymbol, as JavaCUP will require

%type Symbol

// Tell JLex what to return when eof offile is hit%eofval{return new Symbol(sym.EOF,

new CSXToken(0,0));%eofval}

%%"+" {Pos.setpos(); Pos.col +=1;

return new Symbol(sym.PLUS,new CSXToken(Pos.linenum,

Pos.colnum));}

134CS 536 Spring 2006©

"!=" {Pos.setpos(); Pos.col +=2;return new Symbol(sym.NOTEQ,

new CSXToken(Pos.linenum,Pos.colnum));}

";" {Pos.setpos(); Pos.col +=1;return new Symbol(sym.SEMI,

new CSXToken(Pos.linenum,Pos.colnum));}

{Digit}+ {// This def doesn’t check// for overflow

Pos.setpos();Pos.col += yytext().length();return new Symbol(sym.INTLIT,

new CSXIntLitToken(new Integer(yytext()).intValue(),Pos.linenum,Pos.colnum));}

\n {Pos.line +=1; Pos.col = 1;}" " {Pos.col +=1;}

135CS 536 Spring 2006©

The Java program that uses thisscanner (P2) is:class P2 {

public static void main(String args[])throws java.io.IOException {

if (args.length != 1) {System.out.println("Error: Input file must be named on

command line." );System.exit(-1);

}java.io.FileInputStream yyin = null;try {

yyin =new java.io.FileInputStream(args[0]);

} catch (FileNotFoundExceptionnotFound){

System.out.println("Error: unable to open input file.”);

System.exit(-1);}

// lex is a JLex-generated scanner that// will read from yyin

Yylex lex = new Yylex(yyin);

136CS 536 Spring 2006©

System.out.println("Begin test of CSX scanner.");

/**********************************You should enter code here thatthoroughly test your scanner.

Be sure to test extreme cases,like very long symbols or lines,illegal tokens, unrepresentableintegers, illegals strings, etc.The following is only a starting point.

***********************************/Symbol token = lex.yylex();

while ( token.sym != sym.EOF ) {System.out.print(

((CSXToken) token.value).linenum+ ":"+ ((CSXToken) token.value).colnum+ " ");

switch (token.sym) {case sym.INTLIT:

System.out.println("\tinteger literal(" +((CSXIntLitToken)token.value).intValue + ")");

break;

137CS 536 Spring 2006©

case sym.PLUS:System.out.println("\t+");break;

case sym.NOTEQ:System.out.println("\t!=");break;

default:throw new RuntimeException();

}

token = lex.yylex(); // get next token}

System.out.println("End test of CSX scanner.");

}}}

138CS 536 Spring 2006©

Other Scanner IssuesWe will consider other practical issuesin building real scanners for realprogramming languages.Our finite automaton modelsometimes needs to be augmented.Moreover, error handling must beincorporated into any practicalscanner.

139CS 536 Spring 2006©

Identifiers vs. Reserved WordsMost programming languages containreserved words like if , while ,switch , etc. These tokens look likeordinary identifiers, but aren’t.It is up to the scanner to decide ifwhat looks like an identifier is really areserved word. This distinction is vitalas reserved words have differenttoken codes than identifiers and areparsed differently.How can a scanner decide whichtokens are identifiers and which arereserved words?• We can scan identifiers and reserved

words using the same pattern, andthen look up the token in a special“reserved word” table.

140CS 536 Spring 2006©

• It is known that any regularexpression may be complemented toobtain all strings not in the originalregular expression. Thus A, thecomplement of A, is regular if A is.Using complementation we can writea regular expression for nonreserved

identifiers:Since scanner generators don’tusually support complementation ofregular expressions, this approach ismore of theoretical than practicalinterest.

• We can give distinct regularexpression definitions for eachreserved word, and for identifiers.Since the definitions overlap (if willmatch a reserved word and thegeneral identifier pattern), we give

ident if while …( )

141CS 536 Spring 2006©

priority to reserved words. Thus atoken is scanned as an identifier if itmatches the identifier pattern anddoes not match any reserved wordpattern. This approach is commonlyused in scanner generators like Lexand JLex.

142CS 536 Spring 2006©

Converting Token ValuesFor some tokens, we may need toconvert from string form intonumeric or binary form.For example, for integers, we need totransform a string a digits into theinternal (binary) form of integers.We know the format of the token isvalid (the scanner checked this), but:• The string may represent an integer

too large to represent in 32 or 64 bitform.

• Languages like CSX and ML use anon-standard representation fornegative values (~123 instead of-123 )

143CS 536 Spring 2006©

We can safely convert from string tointeger form by first converting thestring to double form, checkingagainst max and min int, and thenconverting to int form if the value isrepresentable.Thus d = new Double(str) willcreate an object d containing thevalue of str in double form. If str istoo large or too small to berepresented as a double, plus or minusinfinity is automatically substituted.d.doubleValue() will give d’s valueas a Java double, which can becompared againstInteger.MAX_VALUE orInteger.MIN_VALUE .

144CS 536 Spring 2006©

If d.doubleValue() represents avalid integer,(int) d.doubleValue()will create the appropriate integervalue.If a string representation of aninteger begins with a “~” we can stripthe “~”, convert to a double and thennegate the resulting value.

145CS 536 Spring 2006©

Scanner TerminationA scanner reads input characters andpartitions them into tokens.What happens when the end of theinput file is reached? It may be usefulto create an Eof pseudo-characterwhen this occurs. In Java, forexample, InputStream.read() ,which reads a single byte, returns -1when end of file is reached. Aconstant, EOF, defined as -1 can betreated as an “extended” ASCIIcharacter. This character then allowsthe definition of an Eof token thatcan be passed back to the parser.An Eof token is useful because itallows the parser to verify that thelogical end of a program corresponds

146CS 536 Spring 2006©

to its physical end. Most parsersrequire an end of file token.Lex and Jlex automatically create anEof token when the scanner theybuild tries to scan an EOF character(or tries to scan when eof() is true).

147CS 536 Spring 2006©

Multi Character LookaheadWe may allow finite automata to lookbeyond the next input character.This feature is necessary to implementa scanner for FORTRAN.In FORTRAN, the statement

DO 10 J = 1,100specifies a loop, with index J rangingfrom 1 to 100 .The statement

DO 10 J = 1.100is an assignment to the variableDO10J. (Blanks are not significantexcept in strings.)A FORTRAN scanner decides whetherthe O is the last character of a DOtoken only after reading as far as thecomma (or period).

148CS 536 Spring 2006©

A milder form of extended lookaheadproblem occurs in Pascal and Ada.The token 10.50 is a real literal,whereas 10..50 is three differenttokens.We need two-character lookaheadafter the 10 prefix to decide whetherwe are to return 10 (an integerliteral) or 10.50 (a real literal).

149CS 536 Spring 2006©

Suppose we use the following FA.

Given 10..100 we scan threecharacters and stop in a non-accepting state.Whenever we stop reading in a non-accepting state, we back up alongaccepted characters until anaccepting state is found.Characters we back up over arerescanned to form later tokens. If noaccepting state is reached duringbackup, we have a lexical error.

.D

D D

D

..

150CS 536 Spring 2006©

Performance ConsiderationsBecause scanners do so muchcharacter-level processing, they canbe a real performance bottleneck inproduction compilers.Speed is not a concern in our project,but let’s see why scanning speed canbe a concern in production compilers.Let’s assume we want to compile at arate of 1000 lines/sec. (so that mostprograms compile in just a fewseconds).Assuming 30 characters/line (onaverage), we need to scan 30,000char/sec.

151CS 536 Spring 2006©

On a 30 SPECmark machine (30million instructions/sec.), we have1000 instructions per character tospend on all compiling steps.If we allow 25% of compiling to bescanning (a compiler has a lot moreto do than just scan!), that’s just 250instructions per character.A key to efficient scanning is togroup character-level operationswhenever possible. It is better to doone operation on n characters ratherthan n operations on singlecharacters.In our examples we’ve read input onecharacter as a time. A subroutine callcan cost hundreds or thousands ofinstructions to execute—far too muchto spend on a single character.

152CS 536 Spring 2006©

We prefer routines that do blockreads, putting an entire block ofcharacters directly into a buffer.Specialized scanner generators canproduce particularly fast scanners.The GLA scanner generator claimsthat the scanners it produces run asfast as:while(c != Eof) {

c = getchar();

}

153CS 536 Spring 2006©

Lexical Error RecoveryA character sequence that can’t bescanned into any valid token is alexical error.Lexical errors are uncommon, butthey still must be handled by ascanner. We won’t stop compilationbecause of so minor an error.Approaches to lexical error handlinginclude:• Delete the characters read so far and

restart scanning at the next unreadcharacter.

• Delete the first character read by thescanner and resume scanning at thecharacter following it.

154CS 536 Spring 2006©

Both of these approaches arereasonable.The first is easy to do. We just resetthe scanner and begin scanning anew.The second is a bit harder but also is abit safer (less is immediately deleted).It can be implemented using scannerbackup.Usually, a lexical error is caused bythe appearance of some illegalcharacter, mostly at the beginning ofa token.(Why at the beginning?)In these case, the two approaches areequivalent.

155CS 536 Spring 2006©

The effects of lexical error recoverymight well create a later syntax error,handled by the parser.Consider

...for$tnight.. .The $ terminates scanning of for .Since no valid token begins with $, itis deleted. Then tnight is scanned asan identifier. In effect we get

...for tnight.. .which will cause a syntax error. Such“false errors” are unavoidable, thougha syntactic error-repair may help.

156CS 536 Spring 2006©

Error TokensCertain lexical errors require specialcare. In particular, runaway stringsand runaway comments ought toreceive special error messages.In Java strings may not cross lineboundaries, so a runaway string isdetected when an end of a line is readwithin the string body. Ordinaryrecovery rules are inappropriate forthis error. In particular, deleting thefirst character (the double quotecharacter) and restarting scanning isa bad decision.It will almost certainly lead to acascade of “false” errors as the stringtext is inappropriately scanned asordinary input.

157CS 536 Spring 2006©

One way to handle runaway strings isto define an error token.An error token is not a valid token; itis never returned to the parser.Rather, it is a pattern for an errorcondition that needs special handling.We can define an error token thatrepresents a string terminated by anend of line rather than a doublequote character.For a valid string, in which internaldouble quotes and back slashes areescaped (and no other escapedcharacters are allowed), we can use" ( Not( " | Eol | \ ) | \" | \\ )* "For a runaway string we use" ( Not( " | Eol | \ ) | \" | \\ )* Eol(Eol is the end of line character.)

158CS 536 Spring 2006©

When a runaway string token isrecognized, a special error messageshould be issued.Further, the string may be “repaired”into a correct string by returning anordinary string token with the closingEol replaced by a double quote.This repair may or may not be“correct.” If the closing double quoteis truly missing, the repair will begood; if it is present on a succeedingline, a cascade of inappropriatelexical and syntactic errors willfollow.Still, we have told the programmerexactly what is wrong, and that is ourprimary goal.

159CS 536 Spring 2006©

In languages like C, C++, Java andCSX, which allow multiline comments,improperly terminated (runaway)comments present a similar problem.A runaway comment is not detecteduntil the scanner finds a closecomment symbol (possibly belongingto some other comment) or until theend of file is reached. Clearly aspecial, detailed error message isrequired.Let’s look at Pascal-style commentsthat begin with a { and end with a }.Comments that begin and end with apair of characters, like /* and */ inJava, C and C++, are a bit trickier.

160CS 536 Spring 2006©

Correct Pascal comments are definedquite simply:

{ Not( } )* }To handle comments terminated byEof , this error token can be used:

{ Not( } )* EofWe want to handle commentsunexpectedly closed by a closecomment belonging to anothercomment:{... missing close comment... { normal comment }...

We will issue a warning (this form ofcomment is lexically legal).Any comment containing an opencomment symbol in its body is mostprobably a missing } error.

161CS 536 Spring 2006©

We split our legal comment definitioninto two token definitions.The definition that accepts an opencomment in its body causes a warningmessage ("Possible unclosedcomment") to be printed.We now use:{ Not( { | } )* } and{ (Not( { | } )* { Not( { | } )* )+ }The first definition matches correctcomments that do not contain anopen comment in their body.The second definition matchescorrect, but suspect, comments thatcontain at least one open comment intheir body.

162CS 536 Spring 2006©

Single line comments, found in Java,CSX and C++, are terminated by Eol.They can fall prey to a more subtleerror—what if the last line has no Eolat its end?The solution?Another error token for single linecomments:

// Not(Eol) *

This rule will only be used forcomments that don’t end with an Eol,since scanners always match thelongest rule possible.

163CS 536 Spring 2006©

Regular Expressions and FiniteAutomata

Regular expressions are fullyequivalent to finite automata.The main job of a scanner generatorlike JLex is to transform a regularexpression definition into anequivalent finite automaton.First it transforms a regularexpression into a nondeterministicfinite automaton (NFA).Unlike an ordinary deterministic finiteautomaton, an NFA need not make aunique (deterministic) choice of asuccessor state to visit. For example,as shown below, an NFA is allowed tohave a state that has two transitions(arrows) coming out of it, labeled by

164CS 536 Spring 2006©

the same symbol. An NFA may alsohave transitions labeled with λ.

Transitions are normally labeled withindividual characters in Σ, andalthough λ is a string (the string withno characters in it), it is definitely nota character. In the above example,when the automaton is in the state atthe left and the next input characteris a, it may choose to use the

a

a

a

λa

165CS 536 Spring 2006©

transition labeled a or first follow theλ transition (you can always find λwherever you look for it) and thenfollow an a transition. FAs thatcontain no λ transitions and thatalways have unique successor statesfor any symbol are deterministic.

166CS 536 Spring 2006©

Building Finite Automata FromRegular Expressions

We make an FA from a regularexpression in two steps:• Transform the regular expression into

an NFA.

• Transform the NFA into adeterministic FA.

The first step is easy.Regular expressions are all built outof the atomic regular expressions a(where a is a character in Σ) and λ byusing the three operationsA B and A | B and A*.

167CS 536 Spring 2006©

Other operations (like A+) are justabbreviations for combinations ofthese.NFAs for a and λ are trivial:

a

λ

168CS 536 Spring 2006©

Suppose we have NFAs for A and Band want one for A | B. We constructthe NFA shown below:

The states labeled A and B were theaccepting states of the automata forA and B; we create a new acceptingstate for the combined automaton.A path through the top automatonaccepts strings in A, and a paththrough the bottom automationaccepts strings in B, so the wholeautomaton matches A | B .

A

B

FiniteAutomaton

for A

FiniteAutomaton

for B

λ

λ

λ

λ

169CS 536 Spring 2006©

As shown below, the construction forA B is even easier. The acceptingstate of the combined automaton isthe same state that was the acceptingstate of B. We must follow a paththrough A’s automaton, then throughB’s automaton, so overall A B ismatched.We could also just merge theaccepting state of A with the initialstate of B. We chose not to onlybecause the picture would be moredifficult to draw.

AFinite

Automatonfor A

FiniteAutomaton

for B

λ

170CS 536 Spring 2006©

Finally, let’s look at the NFA for A*.The start state reaches an acceptingstate via λ, so λ is accepted.Alternatively, we can follow a paththrough the FA for A one or moretimes, so zero or more strings thatbelong to A are matched.

AFinite

Automatonfor A

λ

λ

λ

λ

171CS 536 Spring 2006©

Creating DeterministicAutomata

The transformation from an NFA N toan equivalent DFA D works by what issometimes called the subsetconstruction.Each state of D corresponds to a set ofstates of N.The idea is that D will be in state{x, y, z} after reading a given inputstring if and only if N could be in anyone of the states x, y, or z, dependingon the transitions it chooses. Thus Dkeeps track of all the possible routesN might take and runs themsimultaneously.Because N is a finite automaton, it hasonly a finite number of states. The

172CS 536 Spring 2006©

number of subsets of N’s states is alsofinite, which makes tracking varioussets of states feasible.An accepting state of D will be any setcontaining an accepting state of N,reflecting the convention that Naccepts if there is any way it couldget to its accepting state by choosingthe “right” transitions.The start state of D is the set of allstates that N could be in withoutreading any input characters—thatis, the set of states reachable fromthe start state of N following only λtransitions. Algorithm closecomputes those states that can bereached following only λ transitions.Once the start state of D is built, webegin to create successor states:

173CS 536 Spring 2006©

We take each state S of D, and eachcharacter c, and compute S’ssuccessor under c.S is identified with some set of N’sstates, {n1, n2,...}.

We find all the possible successorstates to {n1, n2,...} under c,obtaining a set {m1, m2,...}.

Finally, we computeT = CLOSE({ m1, m2,...}).T becomes a state in D, and atransition from S to T labeled with cis added to D.We continue adding states andtransitions to D until all possiblesuccessors to existing states areadded.

174CS 536 Spring 2006©

Because each state corresponds to afinite subset of N’s states, the processof adding new states to D musteventually terminate.Here is the algorithm for λ-closure,called close . It starts with a set ofNFA states, S, and adds to S all statesreachable from S using only λtransitions.void close(NFASet S) {

while (x in S and x →λ yand y notin S) {

S = S U {y}}}

Using close , we can define theconstruction of a DFA, D, from anNFA, N:

175CS 536 Spring 2006©

DFA MakeDeterministic(NFA N) {DFA D ; NFASet TD.StartState = { N.StartState }close(D.StartState)D.States = { D.StartState }while (states or transitions can be

added to D) {Choose any state S in D.States

and any character c in Alphabet

T = {y in N.States such that

x →c y for some x in S}close(T);

if (T notin D. States) {D.States = D.States U {T}

D.Transitions =

D.Transitions U

{the transition S →c T} } }

D.AcceptingStates ={ S in D.States such that an

accepting state of N in S}}

176CS 536 Spring 2006©

ExampleTo see how the subset constructionoperates, consider the following NFA:

We start with state 1, the start state ofN, and add state 2 its λ-successor.D’s start state is {1,2}.Under a, {1,2}’s successor is {3,4,5}.State 1 has itself as a successor underb. When state 1’s λ-successor, 2, isincluded, {1,2}’s successor is {1,2}.

aλ1 2

3 4

5

b

a

b

a

a | b

177CS 536 Spring 2006©

{3,4,5}’s successors under a and b are{5} and {4,5}.{4,5}’s successor under b is {5}.Accepting states of D are those statesets that contain N’s accepting statewhich is 5.The resulting DFA is:

b1,2

5

4,5

b

a

a | b

a3,4,5

5

178CS 536 Spring 2006©

It is not too difficult to establish thatthe DFA constructed byMakeDeterministic is equivalent to theoriginal NFA.The idea is that each path to anaccepting state in the original NFA hasa corresponding path in the DFA.Similarly, all paths through theconstructed DFA correspond to paths inthe original NFA.What is less obvious is the fact that theDFA that is built can sometimes bemuch larger than the original NFA.States of the DFA are identified withsets of NFA states.

If the NFA has n states, there are 2n

distinct sets of NFA states, and hencethe DFA may have as many as 2n

states. Certain NFAs actually exhibit

Documents

LectureAll - University of Wisconsin–Madisonpages.cs.wisc.edu/~fischer/cs536.s06/lectures/LectureAll.pdfall). This supports transportable code, that can run on a variety of computers