Upload
andrej-andric
View
212
Download
0
Embed Size (px)
Citation preview
12005-10-25 2G1508-L01, Christian Schulte 1
2G1508-L01Introduction
Lexical Analysis
Christian SchulteIMIT, KTH
www.imit.kth.se/~schulte/
2005-10-25 2G1508-L01, Christian Schulte 2
Overview Organizational Course overview Compiler structure Lexical analysis
2005-10-25 2G1508-L01, Christian Schulte 3
Organizational
2005-10-25 2G1508-L01, Christian Schulte 4
Textbook Andrew W. Appel, Modern Compiler
Implementation in Java2nd edition, Cambridge University Press, 2002.
2005-10-25 2G1508-L01, Christian Schulte 5
Kursnmnd Two volunteers needed!
2005-10-25 2G1508-L01, Christian Schulte 6
Elect and Sign Up! Sign up on the list (most likely you'll have to
write down all your details) Do not forget to elect the course
22005-10-25 2G1508-L01, Christian Schulte 7
No labs There will be no labs this time lab sessions are cancelled
Lab part of course three assignments (10 points each) to be submitted corrected by Mikael Lagerkvist at least 15 points required to pass points valid as bonus points on exam if submitted in
time (only this academic year)
2005-10-25 2G1508-L01, Christian Schulte 8
Examination course passed labs passed full exam 240 points
2005-10-25 2G1508-L01, Christian Schulte 9
Course Overview
2005-10-25 2G1508-L01, Christian Schulte 10
Reading Suggestion Chapters 1 and 2
2005-10-25 2G1508-L01, Christian Schulte 11
Compiler and Execution Environments General question: how to execute program
written in some high-level programming language
Two aspects compilation transform into language good for
execution execution execute program
2005-10-25 2G1508-L01, Christian Schulte 12
Compiler Compiler translates program from one
programming language into another language compiled from source language language compiled to target language
Source language: for programming examples: Java, C, C++, Oz,
Target language: for execution examples: assembler (x86, MIPS, ), JVM code
32005-10-25 2G1508-L01, Christian Schulte 13
Execution Environments Can be concrete hardware how to manage memory how to link and load programs take advantage of architectural features
Can be abstract machine how to interpret abstract machine code efficiently how to further compile at runtime
2005-10-25 2G1508-L01, Christian Schulte 14
CompilationBasic structure and tasks
2005-10-25 2G1508-L01, Christian Schulte 15
Compilation Phases
Frontend depends on source language Backend depends on target language Factorize dependencies
frontend backendsource programtarget program
intermediate representation
2005-10-25 2G1508-L01, Christian Schulte 16
Frontend: Tasks Lexical analysis how program is composed into tokens (words) typical token classes: identifier, number, keywords, creates token stream Syntax analysis phrasal structure of program (sentences) grammar rules describing how expressions, statements, etc
are formed creates abstract syntax tree Semantic analysis perform identifier analysis (scope), type checking, creates intermediate representation trees after that: canonicalize and clean up
2005-10-25 2G1508-L01, Christian Schulte 17
Backend: Basic Tasks Optimization reduce execution time and program size typically independent of target architecture intermediate and complex component: "midend" Instruction selection which instruction for a certain abstract operation Register allocation which variables are kept in which registers? which variables go to memory More generic: memory allocation Code emission
2005-10-25 2G1508-L01, Christian Schulte 18
Optimization Common subexpression elimination (CSE) reuse intermediate results Dead-code elimination remove code that can never be executed Strength reduction make operations in loops cheaper: instead of multiplying
with n, increment by n (array access) Constant/value propagation propagate information on values of variables Code motion move invariant code out of loops Many, many more,
42005-10-25 2G1508-L01, Christian Schulte 19
Lexical Analysis
2005-10-25 2G1508-L01, Christian Schulte 20
Overall Structure Compiler has two main phases analysis understand program
"front end" synthesis put it together in different way
"back end"
Analysis typically broken up into lexical break into words or "tokens" syntax parse phrase structure of program semantic calculate program's meaning
2005-10-25 2G1508-L01, Christian Schulte 21
Lexical Analyzer Also: lexer Takes a stream of characters Produces a stream of tokens names
keywords punctuation marks discards white space and comments
Simple task2005-10-25 2G1508-L01, Christian Schulte 22
Lexical Tokens Sequence of characters treated as unit in grammar
of programming language Programming language classifies tokens into finite
set of token types some tokens have semantic value attached (ID, NUM, ) Punctuation tokens such as IF, VOID, RETURN
constructed from characters: reserved words cannot be used as identifiers Non-tokens comments, preprocessor directives, whitespace
2005-10-25 2G1508-L01, Christian Schulte 23
Example Token TypesID foo n14 lastNUM 73 0 00 5151REAL 3.75 .2 1e23 5.5e-10IF ifCOMMA ,NOTEQ !=LPAREN (RPAREN )
2005-10-25 2G1508-L01, Christian Schulte 24
Example Programfloat match0(char* s) {
/* find a zero */if (!strncmp(s, "0.0", 3))
return 0.;}
52005-10-25 2G1508-L01, Christian Schulte 25
Example Token StreamFLOAT ID(match0) LPARENCHAR STAR ID(s)RPAREN LBRACE IFLPAREN BANG ID(strncmp)LPAREN ID(s) COMMASTRING(0.0) COMMA NUM(3)RPAREN RPAREN RETURNREAL(0.0) SEMI RBRACEEOF
2005-10-25 2G1508-L01, Christian Schulte 26
Approach Specification of lexical tokens
regular expression (regexp)
Implementation of lexerdeterministic finite automaton (DFA)
Computing DFA from regexpnondeterministic finite automaton (NFA)
2005-10-25 2G1508-L01, Christian Schulte 27
Regular Expressions Language: set of strings String: finite sequence of symbols symbols are taken from finite alphabet
Example language of primes: decimal digit strings
representing prime numbers alphabet is ASCII character set
Regular expression: stands for set of strings possibly infinite set
2005-10-25 2G1508-L01, Christian Schulte 28
Regular Expressions Symbol a denotes language just containing string a Alternation M|N where M and N are regular expressions string in language of M|N, if string in language of M or
in language of N Concatenation MN where M and N are regular expressions string in language of MN, if concatenation of
strings and such that in language of M and in language of N
2005-10-25 2G1508-L01, Christian Schulte 29
Regular Expressions Epsilon denotes language just containing the empty string Repetition M* where M is regular expression called Kleene closure string in language of M*, if concatenation of zero or
more strings in language of M
2005-10-25 2G1508-L01, Christian Schulte 30
Regular Expression Examples a|b {"a","b"} (a|b)a {"aa","ba"} (ab)| {"ab",""} ((a|b)a)* {"","aa","ba",
"aaaa","aaba","baaa","baba",}
62005-10-25 2G1508-L01, Christian Schulte 31
Conventions Sometimes omit or
ab means ab(a|) means (a|) Kleene closure binds tighter than
concatenation ab* means a(b)* concatenation binds tighter than alternation
ab|c means (ab)|c
2005-10-25 2G1508-L01, Christian Schulte 32
Lexical Specification Examples Even binary numbers
(0|1)*0
Strings of a's and b's with no consecutive a'sb*(abb*)*(a|)
Strings of a's and b's with consecutive a's(a|b)*aa(a|b)*
2005-10-25 2G1508-L01, Christian Schulte 33
Abbreviations [abcd] means a | b | c | d [b-g] means [bcdefg] [a-cA-C01] means [abcABC01] M? means (M|) M+ means (MM*) . any character but newline "xyz+-*" stands for itself
2005-10-25 2G1508-L01, Christian Schulte 34
Programming Language Token Specificationsif IF[a-z][a-z0-9]* ID[0-9]+ NUM([0-9]+"."[0-9]*)|([0-9]*"."[0-9]+)REAL(" "|"\t"|"\n"|"\r") no token. error
Lexical specification needs to be complete
2005-10-25 2G1508-L01, Christian Schulte 35
Disambiguation Does if8 match ID or IF NUM(8)? Disambiguation rules commonly used longest match longest initial substring that can
match any regexp is token rule priority for particular longest initial
substring, first matched regexpdetermines token-type;order is significant
2005-10-25 2G1508-L01, Christian Schulte 36
Finite Automata
72005-10-25 2G1508-L01, Christian Schulte 37
Finite Automata Regular expressions for specification Finite automata for implementation
Finite automaton has finite set of states edges leading from state to state, labeled with symbol one start state set of final states
2005-10-25 2G1508-L01, Christian Schulte 38
Finite Automaton for IF
Start state: 1 Final states: 3
1 2 3
i f
2005-10-25 2G1508-L01, Christian Schulte 39
Finite Automaton for ID
Start state: 1 Final states: 2
1 2
a-za-z
0-9
2005-10-25 2G1508-L01, Christian Schulte 40
Finite Automata Deterministic finite automaton (DFA) no edges leaving from same state have same symbol Otherwise: nondeterministic finite automation
(NFA)
2005-10-25 2G1508-L01, Christian Schulte 41
Accepted Language DFA accepts or rejects a string start from start state for each input character, follow exactly one edge
according to next character to next state no edge exists: reject after n transitions for an n character string: if in final
state, accept string, otherwise reject
Language accepted by DFA set of accepted strings
2005-10-25 2G1508-L01, Christian Schulte 42
Example DFA
How does accepting a string work
1
2 3
a
4 5
b
b
a
b
a
82005-10-25 2G1508-L01, Christian Schulte 43
Accepting abab
String to process abab State 1 (start state)
1
2 3
a
4 5
b
b
a
b
a
2005-10-25 2G1508-L01, Christian Schulte 44
Accepting abab
String to process bab State 4
1
2 3
a
4 5
b
b
a
b
a
2005-10-25 2G1508-L01, Christian Schulte 45
Accepting abab
String to process ab State 5
1
2 3
a
4 5
b
b
a
b
a
2005-10-25 2G1508-L01, Christian Schulte 46
Accepting abab
String to process b State 4
1
2 3
a
4 5
b
b
a
b
a
2005-10-25 2G1508-L01, Christian Schulte 47
Accepting abab
String to process State 5 accept: final state!
1
2 3
a
4 5
b
b
a
b
a
2005-10-25 2G1508-L01, Christian Schulte 48
Combining DFAs Formally: little later Idea: label final states of each DFA with
token-type it accepts watch out for rule priority: label according to priority
Implement as transition matrix state number character state number final states: bitvector, etc dead state: for no transition
92005-10-25 2G1508-L01, Christian Schulte 49
Recognizing Longest Match Keep track of longest match so far Remember last final state last final state position in string when at last final state When dead state entered last final state: which token matched position: where matching ended, where to start for
next token
2005-10-25 2G1508-L01, Christian Schulte 50
Nondeterministic Finite Automata
2005-10-25 2G1508-L01, Christian Schulte 51
NFAs NFA can have multiple edges for same
symbol NFA can have edges labeled with follow edge without eating any symbol
How to accept? guessing is difficult to implement use trick: maintain all states that so far could have
been reached!
2005-10-25 2G1508-L01, Christian Schulte 52
Example NFA
To process: abbb
1
2 3
b
4 5
b
a
a
a
2005-10-25 2G1508-L01, Christian Schulte 53
Accepting abbb
String to process abbb Set of states {1} (containing start state)
1
2 3
b
4 5
b
a
a
a
2005-10-25 2G1508-L01, Christian Schulte 54
Accepting abbb
String to process bbb Set of states {2,4}
1
2 3
b
4 5
b
a
a
a
10
2005-10-25 2G1508-L01, Christian Schulte 55
Accepting abbb
String to process bb Set of states {2,3,5}
1
2 3
b
4 5
b
a
a
a
2005-10-25 2G1508-L01, Christian Schulte 56
Accepting abbb
String to process b Set of states {2,3}
1
2 3
b
4 5
b
a
a
a
2005-10-25 2G1508-L01, Christian Schulte 57
Accepting abbb
String to process Set of states {2,3} accepted: final state 3{2,3}
1
2 3
b
4 5
b
a
a
a
2005-10-25 2G1508-L01, Christian Schulte 58
NFA versus DFA NFA used for creating from regexp bad for processing: sets are expensive!
DFA used for processing turn NFA into DFA: "subset" construction use idea as in example: sets of states, do transitions
immediately
2005-10-25 2G1508-L01, Christian Schulte 59
Summary
2005-10-25 2G1508-L01, Christian Schulte 60
Summary Compilers translate from source to target language have frontend and backend Programs executed in Execution Environment Lexical analysis lexical structure: character stream to token stream specification: regular expressions computation: DFA transformation from regexp to DFA: NFA