02 Lexical Analysis Part 1

Lexical Analysis - Part 1

Y.N. Srikant

Department of Computer Science and AutomationIndian Institute of Science

Bangalore 560 012

NPTEL Course on Principles of Compiler Design

Y.N. Srikant Lexical Analysis - Part 1

Outline of the Lecture

What is lexical analysis?Why should LA be separated from syntax analysis?Tokens, patterns, and lexemesDifficulties in lexical analysisRecognition of tokens - finite automata and transitiondiagramsSpecification of tokens - regular expressions and regulardefinitionsLEX - A Lexical Analyzer Generator


Compiler Overview


What is Lexical Analysis?

The input is a high level language program, such as a Cprogram in the form of a sequence of charactersThe output is a sequence of tokens that is sent to theparser for syntax analysisStrips off blanks, tabs, newlines, and comments from thesource programKeeps track of line numbers and associates errormessages from various parts of a compiler with linenumbersPerforms some preprocessor functions such as #defineand #include in C


Separation of Lexical Analysis from Syntax Analysis

Simplification of design - software engineering reasonI/O issues are limited LA aloneMore compact and faster parser

Comments, blanks, etc., need not be handled by the parserA parser is more complicated than a lexical analyzer andshrinking the grammar makes the parser faster

No rules for numbers, names, comments, etc., are needed inthe parser

LA based on finite automata are more efficient toimplement than pushdown automata used for parsing (dueto stack)


Tokens, Patterns, and Lexemes

Running example: float abs_zero_Kelvin = -273;Token (also called word)

A string of characters which logically belong togetherfloat, identifier, equal, minus, intnum, semicolonTokens are treated as terminal symbols of the grammarspecifying the source language

PatternThe set of strings for which the same token is producedThe pattern is said to match each string in the setfloat, l(l+d+_)*, =, -, d+, ;

LexemeThe sequence of characters matched by a pattern to formthe corresponding tokenfloat, abs_zero_Kelvin, =, -, 273, ;


Tokens in Programming Languages

Keywords, operators, identifiers (names), constants, literalstrings, punctuation symbols such as parentheses,brackets, commas, semicolons, and colons, etc.A unique integer representing the token is passed by LA tothe parserAttributes for tokens (apart from the integer representingthe token)

identifier: the lexeme of the token, or a pointer into thesymbol table where the lexeme is stored by the LAintnum: the value of the integer (similarly for floatnum, etc.)string: the string itselfThe exact set of attributes are dependent on the compilerdesigner


Difficulties in Lexical Analysis

Certain languages do not have any reserved words, e.g.,while, do, if, else, etc., are reserved in C, but not in PL/1In FORTRAN, some keywords are context-dependent

In the statement, DO 10 I = 10.86, DO10I is an identifier,and DO is not a keywordBut in the statement, DO 10 I = 10, 86, DO is a keywordSuch features require substantial look ahead for resolution

Blanks are not significant in FORTRAN and can appear inthe midst of identifiers, but not so in CLA cannot catch any significant errors except for simpleerrors such as, illegal symbols, etc.In such cases, LA skips characters in the input until awell-formed token is found


Specification and Recognition of Tokens

Regular definitions, a mechansm based on regularexpressions are very popular for specification of tokens

Has been implemented in the lexical analyzer generatortool, LEXWe study regular expressions first, and then, tokenspecification using LEX

Transition diagrams, a variant of finite state automata, areused to implement regular definitions and to recognizetokens

Transition diagrams are usually used to model LA beforetranslating them to programs by handLEX automatically generates optimized FSA from regulardefinitionsWe study FSA and their generation from regularexpressions in order to understand transition diagrams andLEX


Languages

Symbol: An abstract entity, not definedExamples: letters and digits

String: A finite sequence of juxtaposed symbolsabcb, caba are strings over the symbols a,b, and c|w | is the length of the string w, and is the #symbols in it is the empty string and is of length 0

Alphabet: A finite set of symbolsLanguage: A set of strings of symbols from some alphabet

and {} are languagesThe set of palindromes over {0,1} is an infinite languageThe set of strings, {01, 10, 111} over {0,1} is a finitelanguage

If is an alphabet, is the set of all strings over


Language Representations

Each subset of is a languageThis set of languages over is uncountably infiniteEach language must have by a finite representation

A finite representation can be encoded by a finite stringThus, each string of can be thought of as representingsome language over the alphabet is countably infiniteHence, there are more languages than languagerepresentations

Regular expressions (type-3 or regular languages),context-free grammars (type-2 or context-freelanguages), context-sensitive grammars (type-1 orcontext-sensitive languages), and type-0 grammars arefinite representations of respective languagesRL

Examples of Languages

Let = {a,b, c}L1 = {ambn|m,n 0} is regularL2 = {anbn|n 0} is context-free but not regularL3 = {anbncn|n 0} is context-sensitive but neitherregular nor context-freeShowing a language that is type-0, but none of CSL, CFL,or RL is very intricate and is omitted


Automata

Automata are machines that accept languagesFinite State Automata accept RLs (corresponding to REs)Pushdown Automata accept CFLs (corresponding to CFGs)Linear Bounded Automata accept CSLs (corresponding toCSGs)Turing Machines accept type-0 languages (correspondingto type-0 grammars)

Applications of AutomataSwitching circuit designLexical analyzer in a compilerString processing (grep, awk), etc.State charts used in object-oriented designModelling control applications, e.g., elevator operationParsers of all typesCompilers


Finite State Automaton

An FSA is an acceptor or recognizer of regular languagesAn FSA is a 5-tuple, (Q,, ,q0,F ), where

Q is a finite set of states is the input alphabet is the transition function, : Q QThat is, (q,a) is a state for each state q and input symbol aq0 is the start stateF is the set of final or accepting states

In one move from some state q, an FSA reads an inputsymbol, changes the state based on , and gets ready toread the next input symbolAn FSA accepts its input string, if starting from q0, itconsumes the entire input string, and reaches a final stateIf the last state reached is not a final state, then the inputstring is rejected


FSA Example - 1


FSA Example -1 (Contd.)

Q = {q0,q1,q2,q3} = {a,b, c}q0 is the start state and F = {q0,q2}The transition function is defined by the table below

state symbola b c

q0 q1 q3 q3q1 q1 q1 q2q2 q3 q3 q3q3 q3 q3 q3

The accepted language is the set of all strings beginning withan a and ending with a c ( is also accepted)


FSA Example - 2

Q = {q0,q1,q2,q3},q0 is the start stateF = {q0}, is as in the figureLanguage accepted is the set of all strings of 0s and 1s, inwhich the no. of 0s and the no. of 1s are even numbers


Regular Languages

The language accepted by an FSA is the set of all stringsaccepted by it, i.e., (q0, x)FThis is a regular language or a regular setLater we will define regular expressions and regulargrammars which are generators of regular languagesIt can be shown that for every regular expression, an FSAcan be constructed and vice-versa


Nondeterministic FSA

NFAs are FSA which allow 0, 1, or more transitions from astate on a given input symbolAn NFA is a 5-tuple as before, but the transition function is different(q,a) = the set of all states p, such that there is atransition labelled a from q to p : Q 2QA string is accepted by an NFA if there exists a sequenceof transitions corresponding to the string, that leads fromthe start state to some final stateEvery NFA can be converted to an equivalent deterministicFA (DFA), that accepts the same language as the NFA


Nondeterministic FSA Example - 1


Documents

02 Lexical Analysis Part 1