10
1 2005-10-25 2G1508-L01, Christian Schulte 1 2G1508-L01 Introduction Lexical Analysis Christian Schulte IMIT, KTH www.imit.kth.se/~schulte/ 2005-10-25 2G1508-L01, Christian Schulte 2 Overview Organizational Course overview Compiler structure Lexical analysis 2005-10-25 2G1508-L01, Christian Schulte 3 Organizational 2005-10-25 2G1508-L01, Christian Schulte 4 Textbook Andrew W. Appel, Modern Compiler Implementation in Java 2 nd edition, Cambridge University Press, 2002. 2005-10-25 2G1508-L01, Christian Schulte 5 Kursnämnd Two volunteers needed! 2005-10-25 2G1508-L01, Christian Schulte 6 Elect and Sign Up! Sign up on the list (most likely you'll have to write down all your details) Do not forget to elect the course

2G1508-L01-6 Intro to Lexical Analysis.pdf

Embed Size (px)

Citation preview

  • 12005-10-25 2G1508-L01, Christian Schulte 1

    2G1508-L01Introduction

    Lexical Analysis

    Christian SchulteIMIT, KTH

    www.imit.kth.se/~schulte/

    2005-10-25 2G1508-L01, Christian Schulte 2

    Overview Organizational Course overview Compiler structure Lexical analysis

    2005-10-25 2G1508-L01, Christian Schulte 3

    Organizational

    2005-10-25 2G1508-L01, Christian Schulte 4

    Textbook Andrew W. Appel, Modern Compiler

    Implementation in Java2nd edition, Cambridge University Press, 2002.

    2005-10-25 2G1508-L01, Christian Schulte 5

    Kursnmnd Two volunteers needed!

    2005-10-25 2G1508-L01, Christian Schulte 6

    Elect and Sign Up! Sign up on the list (most likely you'll have to

    write down all your details) Do not forget to elect the course

  • 22005-10-25 2G1508-L01, Christian Schulte 7

    No labs There will be no labs this time lab sessions are cancelled

    Lab part of course three assignments (10 points each) to be submitted corrected by Mikael Lagerkvist at least 15 points required to pass points valid as bonus points on exam if submitted in

    time (only this academic year)

    2005-10-25 2G1508-L01, Christian Schulte 8

    Examination course passed labs passed full exam 240 points

    2005-10-25 2G1508-L01, Christian Schulte 9

    Course Overview

    2005-10-25 2G1508-L01, Christian Schulte 10

    Reading Suggestion Chapters 1 and 2

    2005-10-25 2G1508-L01, Christian Schulte 11

    Compiler and Execution Environments General question: how to execute program

    written in some high-level programming language

    Two aspects compilation transform into language good for

    execution execution execute program

    2005-10-25 2G1508-L01, Christian Schulte 12

    Compiler Compiler translates program from one

    programming language into another language compiled from source language language compiled to target language

    Source language: for programming examples: Java, C, C++, Oz,

    Target language: for execution examples: assembler (x86, MIPS, ), JVM code

  • 32005-10-25 2G1508-L01, Christian Schulte 13

    Execution Environments Can be concrete hardware how to manage memory how to link and load programs take advantage of architectural features

    Can be abstract machine how to interpret abstract machine code efficiently how to further compile at runtime

    2005-10-25 2G1508-L01, Christian Schulte 14

    CompilationBasic structure and tasks

    2005-10-25 2G1508-L01, Christian Schulte 15

    Compilation Phases

    Frontend depends on source language Backend depends on target language Factorize dependencies

    frontend backendsource programtarget program

    intermediate representation

    2005-10-25 2G1508-L01, Christian Schulte 16

    Frontend: Tasks Lexical analysis how program is composed into tokens (words) typical token classes: identifier, number, keywords, creates token stream Syntax analysis phrasal structure of program (sentences) grammar rules describing how expressions, statements, etc

    are formed creates abstract syntax tree Semantic analysis perform identifier analysis (scope), type checking, creates intermediate representation trees after that: canonicalize and clean up

    2005-10-25 2G1508-L01, Christian Schulte 17

    Backend: Basic Tasks Optimization reduce execution time and program size typically independent of target architecture intermediate and complex component: "midend" Instruction selection which instruction for a certain abstract operation Register allocation which variables are kept in which registers? which variables go to memory More generic: memory allocation Code emission

    2005-10-25 2G1508-L01, Christian Schulte 18

    Optimization Common subexpression elimination (CSE) reuse intermediate results Dead-code elimination remove code that can never be executed Strength reduction make operations in loops cheaper: instead of multiplying

    with n, increment by n (array access) Constant/value propagation propagate information on values of variables Code motion move invariant code out of loops Many, many more,

  • 42005-10-25 2G1508-L01, Christian Schulte 19

    Lexical Analysis

    2005-10-25 2G1508-L01, Christian Schulte 20

    Overall Structure Compiler has two main phases analysis understand program

    "front end" synthesis put it together in different way

    "back end"

    Analysis typically broken up into lexical break into words or "tokens" syntax parse phrase structure of program semantic calculate program's meaning

    2005-10-25 2G1508-L01, Christian Schulte 21

    Lexical Analyzer Also: lexer Takes a stream of characters Produces a stream of tokens names

    keywords punctuation marks discards white space and comments

    Simple task2005-10-25 2G1508-L01, Christian Schulte 22

    Lexical Tokens Sequence of characters treated as unit in grammar

    of programming language Programming language classifies tokens into finite

    set of token types some tokens have semantic value attached (ID, NUM, ) Punctuation tokens such as IF, VOID, RETURN

    constructed from characters: reserved words cannot be used as identifiers Non-tokens comments, preprocessor directives, whitespace

    2005-10-25 2G1508-L01, Christian Schulte 23

    Example Token TypesID foo n14 lastNUM 73 0 00 5151REAL 3.75 .2 1e23 5.5e-10IF ifCOMMA ,NOTEQ !=LPAREN (RPAREN )

    2005-10-25 2G1508-L01, Christian Schulte 24

    Example Programfloat match0(char* s) {

    /* find a zero */if (!strncmp(s, "0.0", 3))

    return 0.;}

  • 52005-10-25 2G1508-L01, Christian Schulte 25

    Example Token StreamFLOAT ID(match0) LPARENCHAR STAR ID(s)RPAREN LBRACE IFLPAREN BANG ID(strncmp)LPAREN ID(s) COMMASTRING(0.0) COMMA NUM(3)RPAREN RPAREN RETURNREAL(0.0) SEMI RBRACEEOF

    2005-10-25 2G1508-L01, Christian Schulte 26

    Approach Specification of lexical tokens

    regular expression (regexp)

    Implementation of lexerdeterministic finite automaton (DFA)

    Computing DFA from regexpnondeterministic finite automaton (NFA)

    2005-10-25 2G1508-L01, Christian Schulte 27

    Regular Expressions Language: set of strings String: finite sequence of symbols symbols are taken from finite alphabet

    Example language of primes: decimal digit strings

    representing prime numbers alphabet is ASCII character set

    Regular expression: stands for set of strings possibly infinite set

    2005-10-25 2G1508-L01, Christian Schulte 28

    Regular Expressions Symbol a denotes language just containing string a Alternation M|N where M and N are regular expressions string in language of M|N, if string in language of M or

    in language of N Concatenation MN where M and N are regular expressions string in language of MN, if concatenation of

    strings and such that in language of M and in language of N

    2005-10-25 2G1508-L01, Christian Schulte 29

    Regular Expressions Epsilon denotes language just containing the empty string Repetition M* where M is regular expression called Kleene closure string in language of M*, if concatenation of zero or

    more strings in language of M

    2005-10-25 2G1508-L01, Christian Schulte 30

    Regular Expression Examples a|b {"a","b"} (a|b)a {"aa","ba"} (ab)| {"ab",""} ((a|b)a)* {"","aa","ba",

    "aaaa","aaba","baaa","baba",}

  • 62005-10-25 2G1508-L01, Christian Schulte 31

    Conventions Sometimes omit or

    ab means ab(a|) means (a|) Kleene closure binds tighter than

    concatenation ab* means a(b)* concatenation binds tighter than alternation

    ab|c means (ab)|c

    2005-10-25 2G1508-L01, Christian Schulte 32

    Lexical Specification Examples Even binary numbers

    (0|1)*0

    Strings of a's and b's with no consecutive a'sb*(abb*)*(a|)

    Strings of a's and b's with consecutive a's(a|b)*aa(a|b)*

    2005-10-25 2G1508-L01, Christian Schulte 33

    Abbreviations [abcd] means a | b | c | d [b-g] means [bcdefg] [a-cA-C01] means [abcABC01] M? means (M|) M+ means (MM*) . any character but newline "xyz+-*" stands for itself

    2005-10-25 2G1508-L01, Christian Schulte 34

    Programming Language Token Specificationsif IF[a-z][a-z0-9]* ID[0-9]+ NUM([0-9]+"."[0-9]*)|([0-9]*"."[0-9]+)REAL(" "|"\t"|"\n"|"\r") no token. error

    Lexical specification needs to be complete

    2005-10-25 2G1508-L01, Christian Schulte 35

    Disambiguation Does if8 match ID or IF NUM(8)? Disambiguation rules commonly used longest match longest initial substring that can

    match any regexp is token rule priority for particular longest initial

    substring, first matched regexpdetermines token-type;order is significant

    2005-10-25 2G1508-L01, Christian Schulte 36

    Finite Automata

  • 72005-10-25 2G1508-L01, Christian Schulte 37

    Finite Automata Regular expressions for specification Finite automata for implementation

    Finite automaton has finite set of states edges leading from state to state, labeled with symbol one start state set of final states

    2005-10-25 2G1508-L01, Christian Schulte 38

    Finite Automaton for IF

    Start state: 1 Final states: 3

    1 2 3

    i f

    2005-10-25 2G1508-L01, Christian Schulte 39

    Finite Automaton for ID

    Start state: 1 Final states: 2

    1 2

    a-za-z

    0-9

    2005-10-25 2G1508-L01, Christian Schulte 40

    Finite Automata Deterministic finite automaton (DFA) no edges leaving from same state have same symbol Otherwise: nondeterministic finite automation

    (NFA)

    2005-10-25 2G1508-L01, Christian Schulte 41

    Accepted Language DFA accepts or rejects a string start from start state for each input character, follow exactly one edge

    according to next character to next state no edge exists: reject after n transitions for an n character string: if in final

    state, accept string, otherwise reject

    Language accepted by DFA set of accepted strings

    2005-10-25 2G1508-L01, Christian Schulte 42

    Example DFA

    How does accepting a string work

    1

    2 3

    a

    4 5

    b

    b

    a

    b

    a

  • 82005-10-25 2G1508-L01, Christian Schulte 43

    Accepting abab

    String to process abab State 1 (start state)

    1

    2 3

    a

    4 5

    b

    b

    a

    b

    a

    2005-10-25 2G1508-L01, Christian Schulte 44

    Accepting abab

    String to process bab State 4

    1

    2 3

    a

    4 5

    b

    b

    a

    b

    a

    2005-10-25 2G1508-L01, Christian Schulte 45

    Accepting abab

    String to process ab State 5

    1

    2 3

    a

    4 5

    b

    b

    a

    b

    a

    2005-10-25 2G1508-L01, Christian Schulte 46

    Accepting abab

    String to process b State 4

    1

    2 3

    a

    4 5

    b

    b

    a

    b

    a

    2005-10-25 2G1508-L01, Christian Schulte 47

    Accepting abab

    String to process State 5 accept: final state!

    1

    2 3

    a

    4 5

    b

    b

    a

    b

    a

    2005-10-25 2G1508-L01, Christian Schulte 48

    Combining DFAs Formally: little later Idea: label final states of each DFA with

    token-type it accepts watch out for rule priority: label according to priority

    Implement as transition matrix state number character state number final states: bitvector, etc dead state: for no transition

  • 92005-10-25 2G1508-L01, Christian Schulte 49

    Recognizing Longest Match Keep track of longest match so far Remember last final state last final state position in string when at last final state When dead state entered last final state: which token matched position: where matching ended, where to start for

    next token

    2005-10-25 2G1508-L01, Christian Schulte 50

    Nondeterministic Finite Automata

    2005-10-25 2G1508-L01, Christian Schulte 51

    NFAs NFA can have multiple edges for same

    symbol NFA can have edges labeled with follow edge without eating any symbol

    How to accept? guessing is difficult to implement use trick: maintain all states that so far could have

    been reached!

    2005-10-25 2G1508-L01, Christian Schulte 52

    Example NFA

    To process: abbb

    1

    2 3

    b

    4 5

    b

    a

    a

    a

    2005-10-25 2G1508-L01, Christian Schulte 53

    Accepting abbb

    String to process abbb Set of states {1} (containing start state)

    1

    2 3

    b

    4 5

    b

    a

    a

    a

    2005-10-25 2G1508-L01, Christian Schulte 54

    Accepting abbb

    String to process bbb Set of states {2,4}

    1

    2 3

    b

    4 5

    b

    a

    a

    a

  • 10

    2005-10-25 2G1508-L01, Christian Schulte 55

    Accepting abbb

    String to process bb Set of states {2,3,5}

    1

    2 3

    b

    4 5

    b

    a

    a

    a

    2005-10-25 2G1508-L01, Christian Schulte 56

    Accepting abbb

    String to process b Set of states {2,3}

    1

    2 3

    b

    4 5

    b

    a

    a

    a

    2005-10-25 2G1508-L01, Christian Schulte 57

    Accepting abbb

    String to process Set of states {2,3} accepted: final state 3{2,3}

    1

    2 3

    b

    4 5

    b

    a

    a

    a

    2005-10-25 2G1508-L01, Christian Schulte 58

    NFA versus DFA NFA used for creating from regexp bad for processing: sets are expensive!

    DFA used for processing turn NFA into DFA: "subset" construction use idea as in example: sets of states, do transitions

    immediately

    2005-10-25 2G1508-L01, Christian Schulte 59

    Summary

    2005-10-25 2G1508-L01, Christian Schulte 60

    Summary Compilers translate from source to target language have frontend and backend Programs executed in Execution Environment Lexical analysis lexical structure: character stream to token stream specification: regular expressions computation: DFA transformation from regexp to DFA: NFA