Upload
vananh
View
218
Download
0
Embed Size (px)
Citation preview
Jurusan Teknik Informatika
Fakultas Teknik Universitas Islam Riau
Convert source file characters into token stream.
Remove content-free characters (comments, whitespace, ...)
Detect lexical errors (badly-formed literals, illegal characters, ...)
Output of lexical analysis is input to syntax analysis.
Could just do lexical analysis as part of syntax analysis.
But choose to handle separately for better modularity and portability, and to allow make syntax analysis easier.
Idea: Look for patterns in input character sequence, convert to tokens with attributes, and pass them to parser in stream.
Arbi Haza Nasution, B.IT(Hons), M.IT - Teknik Kompilasi Sem Ganjil 2013 2
Jurusan Teknik Informatika
Fakultas Teknik Universitas Islam Riau
What do we want to do? Example:
if (i == j)
Z = 0;
else
Z = 1;
The input is just a string of characters:
\tif (i == j)\n\t\tz = 0;\n\telse\n\t\tz = 1;
Goal: Partition input string into substrings Where the substrings are tokens
Arbi Haza Nasution, B.IT(Hons), M.IT - Teknik Kompilasi Sem Ganjil 2013 3
Jurusan Teknik Informatika
Fakultas Teknik Universitas Islam Riau
A syntactic category In English:
noun, verb, adjective, …
In a programming language:
Identifier, Integer, Keyword, Whitespace, …
Tokens correspond to sets of strings.
Identifier: strings of letters or digits, starting with a letter
Integer: a non-empty string of digits
Keyword: “else” or “if” or “begin” or …
Whitespace: a non-empty sequence of blanks, newlines, and tabs
Arbi Haza Nasution, B.IT(Hons), M.IT - Teknik Kompilasi Sem Ganjil 2013 4
Jurusan Teknik Informatika
Fakultas Teknik Universitas Islam Riau
Classify program substrings according to role
Output of lexical analysis is a stream of tokens . . .
. . . which is input to the parser
Parser relies on token distinctions An identifier is treated differently than a keyword
Arbi Haza Nasution, B.IT(Hons), M.IT - Teknik Kompilasi Sem Ganjil 2013 5
Jurusan Teknik Informatika
Fakultas Teknik Universitas Islam Riau
Define a finite set of tokens Tokens describe all items of interest
Choice of tokens depends on language, design of parser
Example:
\tif (i == j)\n\t\tz = 0;\n\telse\n\t\tz = 1;
Useful tokens for this expression:
Integer, Keyword, Relation, Identifier, Whitespace, (, ), =, ;
N.B., (, ), =, ; are tokens, not characters, here
Arbi Haza Nasution, B.IT(Hons), M.IT - Teknik Kompilasi Sem Ganjil 2013 6
Jurusan Teknik Informatika
Fakultas Teknik Universitas Islam Riau
Describe which strings belong to each token
Recall: Identifier: strings of letters or digits, starting with a letter
Integer: a non-empty string of digits
Keyword: “else” or “if” or “begin” or …
Whitespace: a non-empty sequence of blanks, newlines, and tabs
Arbi Haza Nasution, B.IT(Hons), M.IT - Teknik Kompilasi Sem Ganjil 2013 7
Jurusan Teknik Informatika
Fakultas Teknik Universitas Islam Riau
An implementation must do three things:
1. Recognize substrings corresponding to tokens
2. Return the value or lexeme of the token A lexeme is the string in the input that actually matched the pattern for some
token.
The lexeme is the substring
3. Identify attributes
Attributes represent lexemes converted to a more useful form
Eg. strings, symbols (like strings, but perhaps handled separately), numbers (integers, reals, ...), enumerations
Arbi Haza Nasution, B.IT(Hons), M.IT - Teknik Kompilasi Sem Ganjil 2013 8
Jurusan Teknik Informatika
Fakultas Teknik Universitas Islam Riau
The lexer usually discards “uninteresting” tokens that don’t contribute to parsing.
Examples: Whitespace (spaces, tabs, new lines, ...)
Comments
Arbi Haza Nasution, B.IT(Hons), M.IT - Teknik Kompilasi Sem Ganjil 2013 9
Jurusan Teknik Informatika
Fakultas Teknik Universitas Islam Riau
Source code:
if x>17 then count:= 2
else (* oops !*) print "bad!“
Arbi Haza Nasution, B.IT(Hons), M.IT - Teknik Kompilasi Sem Ganjil 2013 10
Jurusan Teknik Informatika
Fakultas Teknik Universitas Islam Riau
Lexeme Token Attribute
if IF
x ID "x"
> RELOP GT
17 NUM 17
then THEN
count ID "count"
:= ASSIGN
2 NUM 2
else ELSE
print PRINT
"bad!" STRING "bad!"
Arbi Haza Nasution, B.IT(Hons), M.IT - Teknik Kompilasi Sem Ganjil 2013 11
Jurusan Teknik Informatika
Fakultas Teknik Universitas Islam Riau
A way to describe the lexemes of each token
A way to resolve ambiguities
is if two variables i and f?
Is == two equal signs = =?
Example:
How can we formalize this pattern description?
“An identifier is a letter followed by any number of letters or digits.”
Arbi Haza Nasution, B.IT(Hons), M.IT - Teknik Kompilasi Sem Ganjil 2013 12
Jurusan Teknik Informatika
Fakultas Teknik Universitas Islam Riau
Exactly what is a letter?
LETTER → a | b | c | d | e | f | g | h | i | j
| k | l | m | n | o | p | q | r | s | t
| u | v | w | x | y | z | A | B | C | D
| E | F | G | H | I | J | K | L | M | N
| O | P | Q | R | S | T | U | V | W | X
| Y | Z
Exactly what is a digit?
DIGIT → 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
Arbi Haza Nasution, B.IT(Hons), M.IT - Teknik Kompilasi Sem Ganjil 2013 13
Jurusan Teknik Informatika
Fakultas Teknik Universitas Islam Riau
How can we express “letters or digits” ?
LORD → LETTER | DIGIT
How can we express “any number of” ?
LORDS → LORD*
How can we express “followed by” ?
IDENT → LETTER LORDS
Arbi Haza Nasution, B.IT(Hons), M.IT - Teknik Kompilasi Sem Ganjil 2013 14
Jurusan Teknik Informatika
Fakultas Teknik Universitas Islam Riau
A regular expression (R.E.) is a concise formal characterization and standard notation of a regular language (or regular set).
Example: The regular language containing all IDENTs is described by the regular expression
letter (letter | digit)*
where “| ” means “or” and “e*” means “zero or more copies of e.”
Regular languages are one particular kind of formal languages.
Arbi Haza Nasution, B.IT(Hons), M.IT - Teknik Kompilasi Sem Ganjil 2013 15
Jurusan Teknik Informatika
Fakultas Teknik Universitas Islam Riau
An alphabet is a set of symbols (e.g., the ASCII character set).
A language over an alphabet is a set of strings of symbols from that alphabet.
We write ϵ for the empty string (containing zero characters); some authors use λ instead.
Arbi Haza Nasution, B.IT(Hons), M.IT - Teknik Kompilasi Sem Ganjil 2013 16
Jurusan Teknik Informatika
Fakultas Teknik Universitas Islam Riau
If x and y are strings, then the concatenation xy is the string consisting of the the characters of x followed by the characters of y.
If L and M are languages, then their concatenation LM = {xy | x ∈ L, y ∈ M}.
The exponentiation of a language L is defined thus: L0 = {ϵ}, the language containing just the empty string, and Li = Li−1L for i > 0.
Arbi Haza Nasution, B.IT(Hons), M.IT - Teknik Kompilasi Sem Ganjil 2013 17
Jurusan Teknik Informatika
Fakultas Teknik Universitas Islam Riau
Each R.E. over an alphabet ∑ denotes a regular language over ∑, according to the following inductive definition:
Base rules:
The R.E. ϵ denotes {ϵ }.
For each a ∈ ∑, the R.E. a denotes {a}, the language containing the single string containing just a.
Arbi Haza Nasution, B.IT(Hons), M.IT - Teknik Kompilasi Sem Ganjil 2013 18
Jurusan Teknik Informatika
Fakultas Teknik Universitas Islam Riau
Inductive rules:
If the R.E. R denotes LR and the R.E. S denotes LS, then
R| S denotes LR ∪ LS.
R · S (or just RS) denotes LRLS.
R∗ denotes (LR)*, the “Kleene closure” (the concatenation of zero or more strings from LR).
(R) denotes LR.
Precedence rules: () before ∗ before · before | .
Arbi Haza Nasution, B.IT(Hons), M.IT - Teknik Kompilasi Sem Ganjil 2013 19
Jurusan Teknik Informatika
Fakultas Teknik Universitas Islam Riau
Over alphabet {a, b}
a∗ zero or more a’s
(a|b)∗ all strings of a’s and b’s of length ≥ 0
(a∗b∗)∗ all strings of a’s and b’s of length ≥ 0
(aa|ab|ba|bb)∗ all strings of a’s and b’s of even length
Arbi Haza Nasution, B.IT(Hons), M.IT - Teknik Kompilasi Sem Ganjil 2013 20
Jurusan Teknik Informatika
Fakultas Teknik Universitas Islam Riau
Consider [email protected]
Arbi Haza Nasution, B.IT(Hons), M.IT - Teknik Kompilasi Sem Ganjil 2013 21
Jurusan Teknik Informatika
Fakultas Teknik Universitas Islam Riau
Give names to R.E.’s and then use these as a shorthand.Must avoid recursive definitions!
Examples: ID → letter (letter| digit)*
NUM → digit (digit)* or this shorthand: digit+
RELOP → < | > | <= | >= | = or: <(ϵ | =)| >(ϵ | =)| = ASSGN → := STRING → "(nonquote)*" letter → a| b| . . . | z| A| . . . | Z or this shorthand: [a-zA-Z] digit → 0| 1| 2| 3| 4| 5| 6| 7| 8| 9 or this shorthand: [0-9] nonquote → letter| digit| !| $| %| . . .
Arbi Haza Nasution, B.IT(Hons), M.IT - Teknik Kompilasi Sem Ganjil 2013 22
Jurusan Teknik Informatika
Fakultas Teknik Universitas Islam Riau
Arbi Haza Nasution, B.IT(Hons), M.IT - Teknik Kompilasi Sem Ganjil 2013 23
Jurusan Teknik Informatika
Fakultas Teknik Universitas Islam Riau