23
Jurusan Teknik Informatika Fakultas Teknik Universitas Islam Riau

Jurusan Teknik Informatika Fakultas Teknik Universitas ...frdaus/PenelusuranInformasi/File-Pdf/Lecture 3.pdfA syntactic category In English: noun, verb, adjective, … In a programming

  • Upload
    vananh

  • View
    218

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Jurusan Teknik Informatika Fakultas Teknik Universitas ...frdaus/PenelusuranInformasi/File-Pdf/Lecture 3.pdfA syntactic category In English: noun, verb, adjective, … In a programming

Jurusan Teknik Informatika

Fakultas Teknik Universitas Islam Riau

Page 2: Jurusan Teknik Informatika Fakultas Teknik Universitas ...frdaus/PenelusuranInformasi/File-Pdf/Lecture 3.pdfA syntactic category In English: noun, verb, adjective, … In a programming

Convert source file characters into token stream.

Remove content-free characters (comments, whitespace, ...)

Detect lexical errors (badly-formed literals, illegal characters, ...)

Output of lexical analysis is input to syntax analysis.

Could just do lexical analysis as part of syntax analysis.

But choose to handle separately for better modularity and portability, and to allow make syntax analysis easier.

Idea: Look for patterns in input character sequence, convert to tokens with attributes, and pass them to parser in stream.

Arbi Haza Nasution, B.IT(Hons), M.IT - Teknik Kompilasi Sem Ganjil 2013 2

Jurusan Teknik Informatika

Fakultas Teknik Universitas Islam Riau

Page 3: Jurusan Teknik Informatika Fakultas Teknik Universitas ...frdaus/PenelusuranInformasi/File-Pdf/Lecture 3.pdfA syntactic category In English: noun, verb, adjective, … In a programming

What do we want to do? Example:

if (i == j)

Z = 0;

else

Z = 1;

The input is just a string of characters:

\tif (i == j)\n\t\tz = 0;\n\telse\n\t\tz = 1;

Goal: Partition input string into substrings Where the substrings are tokens

Arbi Haza Nasution, B.IT(Hons), M.IT - Teknik Kompilasi Sem Ganjil 2013 3

Jurusan Teknik Informatika

Fakultas Teknik Universitas Islam Riau

Page 4: Jurusan Teknik Informatika Fakultas Teknik Universitas ...frdaus/PenelusuranInformasi/File-Pdf/Lecture 3.pdfA syntactic category In English: noun, verb, adjective, … In a programming

A syntactic category In English:

noun, verb, adjective, …

In a programming language:

Identifier, Integer, Keyword, Whitespace, …

Tokens correspond to sets of strings.

Identifier: strings of letters or digits, starting with a letter

Integer: a non-empty string of digits

Keyword: “else” or “if” or “begin” or …

Whitespace: a non-empty sequence of blanks, newlines, and tabs

Arbi Haza Nasution, B.IT(Hons), M.IT - Teknik Kompilasi Sem Ganjil 2013 4

Jurusan Teknik Informatika

Fakultas Teknik Universitas Islam Riau

Page 5: Jurusan Teknik Informatika Fakultas Teknik Universitas ...frdaus/PenelusuranInformasi/File-Pdf/Lecture 3.pdfA syntactic category In English: noun, verb, adjective, … In a programming

Classify program substrings according to role

Output of lexical analysis is a stream of tokens . . .

. . . which is input to the parser

Parser relies on token distinctions An identifier is treated differently than a keyword

Arbi Haza Nasution, B.IT(Hons), M.IT - Teknik Kompilasi Sem Ganjil 2013 5

Jurusan Teknik Informatika

Fakultas Teknik Universitas Islam Riau

Page 6: Jurusan Teknik Informatika Fakultas Teknik Universitas ...frdaus/PenelusuranInformasi/File-Pdf/Lecture 3.pdfA syntactic category In English: noun, verb, adjective, … In a programming

Define a finite set of tokens Tokens describe all items of interest

Choice of tokens depends on language, design of parser

Example:

\tif (i == j)\n\t\tz = 0;\n\telse\n\t\tz = 1;

Useful tokens for this expression:

Integer, Keyword, Relation, Identifier, Whitespace, (, ), =, ;

N.B., (, ), =, ; are tokens, not characters, here

Arbi Haza Nasution, B.IT(Hons), M.IT - Teknik Kompilasi Sem Ganjil 2013 6

Jurusan Teknik Informatika

Fakultas Teknik Universitas Islam Riau

Page 7: Jurusan Teknik Informatika Fakultas Teknik Universitas ...frdaus/PenelusuranInformasi/File-Pdf/Lecture 3.pdfA syntactic category In English: noun, verb, adjective, … In a programming

Describe which strings belong to each token

Recall: Identifier: strings of letters or digits, starting with a letter

Integer: a non-empty string of digits

Keyword: “else” or “if” or “begin” or …

Whitespace: a non-empty sequence of blanks, newlines, and tabs

Arbi Haza Nasution, B.IT(Hons), M.IT - Teknik Kompilasi Sem Ganjil 2013 7

Jurusan Teknik Informatika

Fakultas Teknik Universitas Islam Riau

Page 8: Jurusan Teknik Informatika Fakultas Teknik Universitas ...frdaus/PenelusuranInformasi/File-Pdf/Lecture 3.pdfA syntactic category In English: noun, verb, adjective, … In a programming

An implementation must do three things:

1. Recognize substrings corresponding to tokens

2. Return the value or lexeme of the token A lexeme is the string in the input that actually matched the pattern for some

token.

The lexeme is the substring

3. Identify attributes

Attributes represent lexemes converted to a more useful form

Eg. strings, symbols (like strings, but perhaps handled separately), numbers (integers, reals, ...), enumerations

Arbi Haza Nasution, B.IT(Hons), M.IT - Teknik Kompilasi Sem Ganjil 2013 8

Jurusan Teknik Informatika

Fakultas Teknik Universitas Islam Riau

Page 9: Jurusan Teknik Informatika Fakultas Teknik Universitas ...frdaus/PenelusuranInformasi/File-Pdf/Lecture 3.pdfA syntactic category In English: noun, verb, adjective, … In a programming

The lexer usually discards “uninteresting” tokens that don’t contribute to parsing.

Examples: Whitespace (spaces, tabs, new lines, ...)

Comments

Arbi Haza Nasution, B.IT(Hons), M.IT - Teknik Kompilasi Sem Ganjil 2013 9

Jurusan Teknik Informatika

Fakultas Teknik Universitas Islam Riau

Page 10: Jurusan Teknik Informatika Fakultas Teknik Universitas ...frdaus/PenelusuranInformasi/File-Pdf/Lecture 3.pdfA syntactic category In English: noun, verb, adjective, … In a programming

Source code:

if x>17 then count:= 2

else (* oops !*) print "bad!“

Arbi Haza Nasution, B.IT(Hons), M.IT - Teknik Kompilasi Sem Ganjil 2013 10

Jurusan Teknik Informatika

Fakultas Teknik Universitas Islam Riau

Page 11: Jurusan Teknik Informatika Fakultas Teknik Universitas ...frdaus/PenelusuranInformasi/File-Pdf/Lecture 3.pdfA syntactic category In English: noun, verb, adjective, … In a programming

Lexeme Token Attribute

if IF

x ID "x"

> RELOP GT

17 NUM 17

then THEN

count ID "count"

:= ASSIGN

2 NUM 2

else ELSE

print PRINT

"bad!" STRING "bad!"

Arbi Haza Nasution, B.IT(Hons), M.IT - Teknik Kompilasi Sem Ganjil 2013 11

Jurusan Teknik Informatika

Fakultas Teknik Universitas Islam Riau

Page 12: Jurusan Teknik Informatika Fakultas Teknik Universitas ...frdaus/PenelusuranInformasi/File-Pdf/Lecture 3.pdfA syntactic category In English: noun, verb, adjective, … In a programming

A way to describe the lexemes of each token

A way to resolve ambiguities

is if two variables i and f?

Is == two equal signs = =?

Example:

How can we formalize this pattern description?

“An identifier is a letter followed by any number of letters or digits.”

Arbi Haza Nasution, B.IT(Hons), M.IT - Teknik Kompilasi Sem Ganjil 2013 12

Jurusan Teknik Informatika

Fakultas Teknik Universitas Islam Riau

Page 13: Jurusan Teknik Informatika Fakultas Teknik Universitas ...frdaus/PenelusuranInformasi/File-Pdf/Lecture 3.pdfA syntactic category In English: noun, verb, adjective, … In a programming

Exactly what is a letter?

LETTER → a | b | c | d | e | f | g | h | i | j

| k | l | m | n | o | p | q | r | s | t

| u | v | w | x | y | z | A | B | C | D

| E | F | G | H | I | J | K | L | M | N

| O | P | Q | R | S | T | U | V | W | X

| Y | Z

Exactly what is a digit?

DIGIT → 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

Arbi Haza Nasution, B.IT(Hons), M.IT - Teknik Kompilasi Sem Ganjil 2013 13

Jurusan Teknik Informatika

Fakultas Teknik Universitas Islam Riau

Page 14: Jurusan Teknik Informatika Fakultas Teknik Universitas ...frdaus/PenelusuranInformasi/File-Pdf/Lecture 3.pdfA syntactic category In English: noun, verb, adjective, … In a programming

How can we express “letters or digits” ?

LORD → LETTER | DIGIT

How can we express “any number of” ?

LORDS → LORD*

How can we express “followed by” ?

IDENT → LETTER LORDS

Arbi Haza Nasution, B.IT(Hons), M.IT - Teknik Kompilasi Sem Ganjil 2013 14

Jurusan Teknik Informatika

Fakultas Teknik Universitas Islam Riau

Page 15: Jurusan Teknik Informatika Fakultas Teknik Universitas ...frdaus/PenelusuranInformasi/File-Pdf/Lecture 3.pdfA syntactic category In English: noun, verb, adjective, … In a programming

A regular expression (R.E.) is a concise formal characterization and standard notation of a regular language (or regular set).

Example: The regular language containing all IDENTs is described by the regular expression

letter (letter | digit)*

where “| ” means “or” and “e*” means “zero or more copies of e.”

Regular languages are one particular kind of formal languages.

Arbi Haza Nasution, B.IT(Hons), M.IT - Teknik Kompilasi Sem Ganjil 2013 15

Jurusan Teknik Informatika

Fakultas Teknik Universitas Islam Riau

Page 16: Jurusan Teknik Informatika Fakultas Teknik Universitas ...frdaus/PenelusuranInformasi/File-Pdf/Lecture 3.pdfA syntactic category In English: noun, verb, adjective, … In a programming

An alphabet is a set of symbols (e.g., the ASCII character set).

A language over an alphabet is a set of strings of symbols from that alphabet.

We write ϵ for the empty string (containing zero characters); some authors use λ instead.

Arbi Haza Nasution, B.IT(Hons), M.IT - Teknik Kompilasi Sem Ganjil 2013 16

Jurusan Teknik Informatika

Fakultas Teknik Universitas Islam Riau

Page 17: Jurusan Teknik Informatika Fakultas Teknik Universitas ...frdaus/PenelusuranInformasi/File-Pdf/Lecture 3.pdfA syntactic category In English: noun, verb, adjective, … In a programming

If x and y are strings, then the concatenation xy is the string consisting of the the characters of x followed by the characters of y.

If L and M are languages, then their concatenation LM = {xy | x ∈ L, y ∈ M}.

The exponentiation of a language L is defined thus: L0 = {ϵ}, the language containing just the empty string, and Li = Li−1L for i > 0.

Arbi Haza Nasution, B.IT(Hons), M.IT - Teknik Kompilasi Sem Ganjil 2013 17

Jurusan Teknik Informatika

Fakultas Teknik Universitas Islam Riau

Page 18: Jurusan Teknik Informatika Fakultas Teknik Universitas ...frdaus/PenelusuranInformasi/File-Pdf/Lecture 3.pdfA syntactic category In English: noun, verb, adjective, … In a programming

Each R.E. over an alphabet ∑ denotes a regular language over ∑, according to the following inductive definition:

Base rules:

The R.E. ϵ denotes {ϵ }.

For each a ∈ ∑, the R.E. a denotes {a}, the language containing the single string containing just a.

Arbi Haza Nasution, B.IT(Hons), M.IT - Teknik Kompilasi Sem Ganjil 2013 18

Jurusan Teknik Informatika

Fakultas Teknik Universitas Islam Riau

Page 19: Jurusan Teknik Informatika Fakultas Teknik Universitas ...frdaus/PenelusuranInformasi/File-Pdf/Lecture 3.pdfA syntactic category In English: noun, verb, adjective, … In a programming

Inductive rules:

If the R.E. R denotes LR and the R.E. S denotes LS, then

R| S denotes LR ∪ LS.

R · S (or just RS) denotes LRLS.

R∗ denotes (LR)*, the “Kleene closure” (the concatenation of zero or more strings from LR).

(R) denotes LR.

Precedence rules: () before ∗ before · before | .

Arbi Haza Nasution, B.IT(Hons), M.IT - Teknik Kompilasi Sem Ganjil 2013 19

Jurusan Teknik Informatika

Fakultas Teknik Universitas Islam Riau

Page 20: Jurusan Teknik Informatika Fakultas Teknik Universitas ...frdaus/PenelusuranInformasi/File-Pdf/Lecture 3.pdfA syntactic category In English: noun, verb, adjective, … In a programming

Over alphabet {a, b}

a∗ zero or more a’s

(a|b)∗ all strings of a’s and b’s of length ≥ 0

(a∗b∗)∗ all strings of a’s and b’s of length ≥ 0

(aa|ab|ba|bb)∗ all strings of a’s and b’s of even length

Arbi Haza Nasution, B.IT(Hons), M.IT - Teknik Kompilasi Sem Ganjil 2013 20

Jurusan Teknik Informatika

Fakultas Teknik Universitas Islam Riau

Page 21: Jurusan Teknik Informatika Fakultas Teknik Universitas ...frdaus/PenelusuranInformasi/File-Pdf/Lecture 3.pdfA syntactic category In English: noun, verb, adjective, … In a programming

Consider [email protected]

Arbi Haza Nasution, B.IT(Hons), M.IT - Teknik Kompilasi Sem Ganjil 2013 21

Jurusan Teknik Informatika

Fakultas Teknik Universitas Islam Riau

Page 22: Jurusan Teknik Informatika Fakultas Teknik Universitas ...frdaus/PenelusuranInformasi/File-Pdf/Lecture 3.pdfA syntactic category In English: noun, verb, adjective, … In a programming

Give names to R.E.’s and then use these as a shorthand.Must avoid recursive definitions!

Examples: ID → letter (letter| digit)*

NUM → digit (digit)* or this shorthand: digit+

RELOP → < | > | <= | >= | = or: <(ϵ | =)| >(ϵ | =)| = ASSGN → := STRING → "(nonquote)*" letter → a| b| . . . | z| A| . . . | Z or this shorthand: [a-zA-Z] digit → 0| 1| 2| 3| 4| 5| 6| 7| 8| 9 or this shorthand: [0-9] nonquote → letter| digit| !| $| %| . . .

Arbi Haza Nasution, B.IT(Hons), M.IT - Teknik Kompilasi Sem Ganjil 2013 22

Jurusan Teknik Informatika

Fakultas Teknik Universitas Islam Riau

Page 23: Jurusan Teknik Informatika Fakultas Teknik Universitas ...frdaus/PenelusuranInformasi/File-Pdf/Lecture 3.pdfA syntactic category In English: noun, verb, adjective, … In a programming

Arbi Haza Nasution, B.IT(Hons), M.IT - Teknik Kompilasi Sem Ganjil 2013 23

Jurusan Teknik Informatika

Fakultas Teknik Universitas Islam Riau