02 Lexical Analysis Part 1

  • Upload
    kalai

  • View
    229

  • Download
    0

Embed Size (px)

Citation preview

  • Lexical Analysis - Part 1

    Y.N. Srikant

    Department of Computer Science and AutomationIndian Institute of Science

    Bangalore 560 012

    NPTEL Course on Principles of Compiler Design

    Y.N. Srikant Lexical Analysis - Part 1

  • Outline of the Lecture

    What is lexical analysis?Why should LA be separated from syntax analysis?Tokens, patterns, and lexemesDifficulties in lexical analysisRecognition of tokens - finite automata and transitiondiagramsSpecification of tokens - regular expressions and regulardefinitionsLEX - A Lexical Analyzer Generator

    Y.N. Srikant Lexical Analysis - Part 1

  • Compiler Overview

    Y.N. Srikant Lexical Analysis - Part 1

  • What is Lexical Analysis?

    The input is a high level language program, such as a Cprogram in the form of a sequence of charactersThe output is a sequence of tokens that is sent to theparser for syntax analysisStrips off blanks, tabs, newlines, and comments from thesource programKeeps track of line numbers and associates errormessages from various parts of a compiler with linenumbersPerforms some preprocessor functions such as #defineand #include in C

    Y.N. Srikant Lexical Analysis - Part 1

  • Separation of Lexical Analysis from Syntax Analysis

    Simplification of design - software engineering reasonI/O issues are limited LA aloneMore compact and faster parser

    Comments, blanks, etc., need not be handled by the parserA parser is more complicated than a lexical analyzer andshrinking the grammar makes the parser faster

    No rules for numbers, names, comments, etc., are needed inthe parser

    LA based on finite automata are more efficient toimplement than pushdown automata used for parsing (dueto stack)

    Y.N. Srikant Lexical Analysis - Part 1

  • Tokens, Patterns, and Lexemes

    Running example: float abs_zero_Kelvin = -273;Token (also called word)

    A string of characters which logically belong togetherfloat, identifier, equal, minus, intnum, semicolonTokens are treated as terminal symbols of the grammarspecifying the source language

    PatternThe set of strings for which the same token is producedThe pattern is said to match each string in the setfloat, l(l+d+_)*, =, -, d+, ;

    LexemeThe sequence of characters matched by a pattern to formthe corresponding tokenfloat, abs_zero_Kelvin, =, -, 273, ;

    Y.N. Srikant Lexical Analysis - Part 1

  • Tokens in Programming Languages

    Keywords, operators, identifiers (names), constants, literalstrings, punctuation symbols such as parentheses,brackets, commas, semicolons, and colons, etc.A unique integer representing the token is passed by LA tothe parserAttributes for tokens (apart from the integer representingthe token)

    identifier: the lexeme of the token, or a pointer into thesymbol table where the lexeme is stored by the LAintnum: the value of the integer (similarly for floatnum, etc.)string: the string itselfThe exact set of attributes are dependent on the compilerdesigner

    Y.N. Srikant Lexical Analysis - Part 1

  • Difficulties in Lexical Analysis

    Certain languages do not have any reserved words, e.g.,while, do, if, else, etc., are reserved in C, but not in PL/1In FORTRAN, some keywords are context-dependent

    In the statement, DO 10 I = 10.86, DO10I is an identifier,and DO is not a keywordBut in the statement, DO 10 I = 10, 86, DO is a keywordSuch features require substantial look ahead for resolution

    Blanks are not significant in FORTRAN and can appear inthe midst of identifiers, but not so in CLA cannot catch any significant errors except for simpleerrors such as, illegal symbols, etc.In such cases, LA skips characters in the input until awell-formed token is found

    Y.N. Srikant Lexical Analysis - Part 1

  • Specification and Recognition of Tokens

    Regular definitions, a mechansm based on regularexpressions are very popular for specification of tokens

    Has been implemented in the lexical analyzer generatortool, LEXWe study regular expressions first, and then, tokenspecification using LEX

    Transition diagrams, a variant of finite state automata, areused to implement regular definitions and to recognizetokens

    Transition diagrams are usually used to model LA beforetranslating them to programs by handLEX automatically generates optimized FSA from regulardefinitionsWe study FSA and their generation from regularexpressions in order to understand transition diagrams andLEX

    Y.N. Srikant Lexical Analysis - Part 1

  • Languages

    Symbol: An abstract entity, not definedExamples: letters and digits

    String: A finite sequence of juxtaposed symbolsabcb, caba are strings over the symbols a,b, and c|w | is the length of the string w, and is the #symbols in it is the empty string and is of length 0

    Alphabet: A finite set of symbolsLanguage: A set of strings of symbols from some alphabet

    and {} are languagesThe set of palindromes over {0,1} is an infinite languageThe set of strings, {01, 10, 111} over {0,1} is a finitelanguage

    If is an alphabet, is the set of all strings over

    Y.N. Srikant Lexical Analysis - Part 1

  • Language Representations

    Each subset of is a languageThis set of languages over is uncountably infiniteEach language must have by a finite representation

    A finite representation can be encoded by a finite stringThus, each string of can be thought of as representingsome language over the alphabet is countably infiniteHence, there are more languages than languagerepresentations

    Regular expressions (type-3 or regular languages),context-free grammars (type-2 or context-freelanguages), context-sensitive grammars (type-1 orcontext-sensitive languages), and type-0 grammars arefinite representations of respective languagesRL

  • Examples of Languages

    Let = {a,b, c}L1 = {ambn|m,n 0} is regularL2 = {anbn|n 0} is context-free but not regularL3 = {anbncn|n 0} is context-sensitive but neitherregular nor context-freeShowing a language that is type-0, but none of CSL, CFL,or RL is very intricate and is omitted

    Y.N. Srikant Lexical Analysis - Part 1

  • Automata

    Automata are machines that accept languagesFinite State Automata accept RLs (corresponding to REs)Pushdown Automata accept CFLs (corresponding to CFGs)Linear Bounded Automata accept CSLs (corresponding toCSGs)Turing Machines accept type-0 languages (correspondingto type-0 grammars)

    Applications of AutomataSwitching circuit designLexical analyzer in a compilerString processing (grep, awk), etc.State charts used in object-oriented designModelling control applications, e.g., elevator operationParsers of all typesCompilers

    Y.N. Srikant Lexical Analysis - Part 1

  • Finite State Automaton

    An FSA is an acceptor or recognizer of regular languagesAn FSA is a 5-tuple, (Q,, ,q0,F ), where

    Q is a finite set of states is the input alphabet is the transition function, : Q QThat is, (q,a) is a state for each state q and input symbol aq0 is the start stateF is the set of final or accepting states

    In one move from some state q, an FSA reads an inputsymbol, changes the state based on , and gets ready toread the next input symbolAn FSA accepts its input string, if starting from q0, itconsumes the entire input string, and reaches a final stateIf the last state reached is not a final state, then the inputstring is rejected

    Y.N. Srikant Lexical Analysis - Part 1

  • FSA Example - 1

    Y.N. Srikant Lexical Analysis - Part 1

  • FSA Example -1 (Contd.)

    Q = {q0,q1,q2,q3} = {a,b, c}q0 is the start state and F = {q0,q2}The transition function is defined by the table below

    state symbola b c

    q0 q1 q3 q3q1 q1 q1 q2q2 q3 q3 q3q3 q3 q3 q3

    The accepted language is the set of all strings beginning withan a and ending with a c ( is also accepted)

    Y.N. Srikant Lexical Analysis - Part 1

  • FSA Example - 2

    Q = {q0,q1,q2,q3},q0 is the start stateF = {q0}, is as in the figureLanguage accepted is the set of all strings of 0s and 1s, inwhich the no. of 0s and the no. of 1s are even numbers

    Y.N. Srikant Lexical Analysis - Part 1

  • Regular Languages

    The language accepted by an FSA is the set of all stringsaccepted by it, i.e., (q0, x)FThis is a regular language or a regular setLater we will define regular expressions and regulargrammars which are generators of regular languagesIt can be shown that for every regular expression, an FSAcan be constructed and vice-versa

    Y.N. Srikant Lexical Analysis - Part 1

  • Nondeterministic FSA

    NFAs are FSA which allow 0, 1, or more transitions from astate on a given input symbolAn NFA is a 5-tuple as before, but the transition function is different(q,a) = the set of all states p, such that there is atransition labelled a from q to p : Q 2QA string is accepted by an NFA if there exists a sequenceof transitions corresponding to the string, that leads fromthe start state to some final stateEvery NFA can be converted to an equivalent deterministicFA (DFA), that accepts the same language as the NFA

    Y.N. Srikant Lexical Analysis - Part 1

  • Nondeterministic FSA Example - 1

    Y.N. Srikant Lexical Analysis - Part 1