Compiler Design Unit 2. LEXICAL ANALYSIS...XQFWLRQV &RPSXWHG )URP WKH 6\QWD[ 7UHH...

UNIT 2: LEXICAL ANALYSIS

Sadique NayeemAsst. ProfessorDept. of CSE

Sitamarhi Institute of Technology, Sitamarhi

Lexical Analysis

Being the first phase of a compiler, the main task of the lexical analyzeris to: Read the input characters of the source program, Group them into lexemes, and Produce as output a sequence of tokens for each lexeme in the

source program.source program.

The stream of tokens is sent to the parser for syntax analysis.

It is common for the lexical analyzer to interact with the symbol table aswell.

Another task of LA is stripping out comments and whitespace (blank,newline, tab).

Another task is correlating error messages generated by the compilerwith the source program.

getNextToken

Commonly, the interaction is implemented by having the parsercall the lexical analyzer. The call, suggested by thegetNextToken command, causes the lexical analyzer to readcharacters from its input until it can identify the next lexemeand produce for it the next token, which it returns to the parser.and produce for it the next token, which it returns to the parser.

Sometimes, lexical analyzers are divided into a cascade of twoprocesses:

a) Scanning consists of the simple processes that do not requiretokenization of the input, such as deletion of comments andcompaction of consecutive whitespace characters into one.compaction of consecutive whitespace characters into one.

b) Lexical analysis proper is the more complex portion, where thescanner produces the sequence of tokens as output.

All Program have

Keywords

Operator

Identifiers

Constants (number and strings)

Punctuation marks

A token is a pair consisting of a token name and an optionalattribute value.

The token name is an abstract symbol representing a kind oflexical unit, e.g., a particular keyword, or a sequence of inputlexical unit, e.g., a particular keyword, or a sequence of inputcharacters denoting an identifier.

The token names are the input symbols that the parserprocesses.

Pattern

A Pattern is a description of the form that the lexemes of a tokenmay take.

In the case of a keyword as a token, the pattern is just thesequence of characters that form the keyword. (Example: if)

For identifiers and some other tokens, the pattern is a more For identifiers and some other tokens, the pattern is a morecomplex structure that is matched by many strings. (Example: age)

Lexeme

A lexeme is a sequence of characters in the source programthat matches the pattern for a token and is identified by thelexical analyzer as an instance of that token.

#include<stdio.h> #include<stdio.h>#include<stdio.h>

void main()

printf(“SIT, Sitamarhi”);

#include<stdio.h>void main(){

int a=10, b=20, c;c = a + b;printf(“%d”, c);

Examples of Tokens

GATE 2000

printf("i = %d, &i = %x", i, &i);

Lexical Errors

These errors are mainly the spelling mistakes and accidentalinsertion of foreign character if the language does not allow it.

It is hard for a lexical analyzer to tell, without the aid of othercomponents, that there is a source-code error.

For instance, if the string fi is encountered for the first time in a C For instance, if the string fi is encountered for the first time in a Cprogram in the context:

fi ( a == 10 )

A lexical analyzer cannot tell whether fi is a misspelling of thekeyword if or an undeclared function identifier. Since fi is a validlexeme for the token id, the lexical analyzer must return thetoken id to the parser and let some other phase of the compiler— probably the parser in this case — handle an error due totransposition of the letters.

Suppose a situation arises in which the lexical analyzer is unableto proceed because none of the patterns for tokens matches anyprefix of the remaining input.

The simplest recovery strategy is "panic mode" recovery. Wedelete successive characters from the remaining input, until thedelete successive characters from the remaining input, until thelexical analyzer can find a well-formed token at the beginning ofwhat input is left.

Other possible error-recovery actions are:

1. Delete one character from the remaining input.

2. Insert a missing character into the remaining input.

3. Replace a character by another character.

4. Transpose two adjacent characters.

Specification of Tokens

Alphabet

String

Language

Operation on Language (U , . , * , +)

Kleen Closure and Positive Closure

Transition Table

ε- Closure

RE to ε- NFA

ε- NFA to NFA

NFA to DFAKleen Closure and Positive Closure

Regular Expression

Transition Diagram

Finite Automata

ε- NFA

NFA to DFA

DFA Minimizations

Regular Definitions

ε- NFA

NFA RE

Regular expression can be represented by its syntax tree,where the leaves correspond to operands and the interiornodes correspond to operators.

An interior node is called a cat-node, or-node, or star-node if itis labeled by the concatenation operator (dot), union operator

RE to DFA

is labeled by the concatenation operator (dot), union operator|, or star operator *, respectively.

Leaves in a syntax tree arelabeled by ε or by an alphabetsymbol. To each leaf not labeledε, we attach a unique integer.

We refer to this integer as theposition of the leaf and also as aposition of its symbol.

Construct Syntax tree

a(a|b)*#

(a|b)c*#

(a|b) (a|b)#

(a|b)*(a|b)# (a|b)*(a|b)#

Functions Computed From the Syntax Tree

To construct a DFA directly from a regular expression, we construct itssyntax tree and then compute four functions: nullable, firstpos, lastpos,and followpos, defined as follows. Each definition refers to the syntaxtree for a particular augmented regular expression ( r ) #.

1. nullable(n) is true for a syntax-tree node n if and only if thesubexpression represented by n has ε in its language. That is, thesubexpression represented by n has ε in its language. That is, thesubexpression can be "made null" or the empty string, even thoughthere may be other strings it can represent as well.

2. firstpos(n) is the set of positions in the subtree rooted at n thatcorrespond to the first symbol of at least one string in the languageof the subexpression rooted at n. (From where the starting positionelement of the sting is coming)

3. lastpos(n) is the set of positions in the subtree rooted at n thatcorrespond to the last symbol of at least one string in the languageof the subexpression rooted at n. (From where the last positionelement of the sting is coming)

4. followpos(p), is the set of position q that can match the first or lastsymbol of the string generated by a given subexpression of asymbol of the string generated by a given subexpression of aregular expression.

Computing nullable, firstpos, and lastpos

lastpos(n)

lastpos(c1) U lastpos(c2)

If (nullable(c2)) (lastpos(c1) U lastpos(c2)) else lastpos(c2)

lastpos(c1)

C2FP1 LP1 FP2 LP2

Computing followpos

Converting a Regular Expression Directly to a DFA

Step1. Construct a syntax tree T from the augmented regularexpression ( r ) #.

Step 2. Compute nullable, firstpos, lastpos, and followpos for T.

Step 3. Construct Dstates (set of states of DFA D) and Dtran (transitionfunction for D) by using following procedure.

The states of D are sets of positions in T.

Initially, each state is "unmarked," and a state becomes "marked"just before we consider its out-transitions.

The start state of D is firstpos(no), where node ‘no’ is the root of T.

The accepting states are those containing the position for theendmarker symbol #.

The value of firstpos for the root of the tree is {1,2,3}, so this set is the start state of D.

Let us Call this set of states A.

We must compute Dtran[A, a] and Dtran[A, b].

Among the positions of A, leaf 1 and leaf 3 correspond to a, while leaf 2 correspondsto b. Thus,

Dtran[A,a] = followpos(l) U followpos(3) = {1,2,3,4} B

Dtran[A, b] = followpos{2) = {1,2,3} A

Dtran[B, a] = followpos(l) U followpos(3) = {1,2,3,4} B

Dtran[B, b] = followpos(2) U followpos(4) = {1,2,3,5} C

Dtran[C, a] = followpos(l) U followpos(3) = {1,2,3,4} B

Dtran[C, b] = followpos(2) U followpos(5) = {1,2,3,6} D

Dtran[D, a] = followpos(l) U followpos(3) = {1,2,3,4} B

Dtran[D, b] = followpos(2) = {1,2,3} A

A B C D

Note: We can also minimize the resultant DFA.

A B C D

Question Time

Q. Find DFA from following regular expression.

a(a|b)*#

(a|b)c*#

THANK YOU!

Compiler Design Unit 2. LEXICAL ANALYSIS...XQFWLRQV &RPSXWHG )URP WKH 6\QWD[ 7UHH...

Documents

Rhetorical Analysis Review Guide.pdf · version 1 · 2020. 4. 26. · orrniruvwuxfwxuh rujdql]dwlrq z rugfkrlfh vshflilfghylfhv v\ qwd[ dqdorjlhv frp sdulvrqv olvwv !65, m: kdwlv

Agriculture & Horticulture Tree Guards · Agriculture & Horticulture Tree Guards 3URWHFWLYH 9LVXDOO\3OHDVLQJ )OH[LEOH 6WURQJ 7UHHXDUGUROOV 3UH IRUPHGXDUGV 6SUD\6KLHOG (;32 1(7'DQPDUN$

C13 1 spinal cord - Student Resources Home Page · )xqfwlrqv ri wkh 6slqdo &rug &rqgxfwlrq ±exqgohv ri ilehuv sdvvlqj lqirupdwlrq xs dqg grzq vslqdo frug frqqhfwlqj gliihuhqw ohyhov

EDFN - Amazon Web Servicesaperturent-static.s3.amazonaws.com/documents/Canon... · 4xlfn 6wduw *xlgh 3uhsdudwlrq &xvwrp )xqfwlrqv ,qvhuw wkh edwwhu\ )ols rxw wkh /&' prqlwru ,pdjh

M. Shakeel vs. M. Maroofajksupremecourt.gok.pk/wp-content/uploads/2019/02/Muhammad_S… · c TTYW V QWD~® WQWD~ FÌ >#® WD8g 9 G = 4¡WV8D 8 W uV ³XW6u8 ÅXDWW7=8 Qi¸ q8DÅVk)e

S10GX SI C - intel.com · &orfn 7uhh %dqn 8vdjh &rs\uljkw f ,qwho &rusrudwlrq $oo 5ljkwv 5hvhuyhg ,qwho &rusrudwlrq %ler 5g 6kdqjkdl &klqd 7lwoh

Curtis Associates - MJ Marketing Group...ma ducius accatem pelecaborum ut labor masaw eqd dqd dqd qwd qwdqw qwd qwdqwqwqw dqwdqwd qwdqwdqd wqwqdq wdqwd qwd qwd wdqwd wdwd wdw dwd wdwdw

Dr Nick Hayward Spring Semester 2020 - Week 4csteach441.github.io/assets/docs/2020/comp441-week4.pdf · 8VHUV 0HQWDO0RGHOV LQWHUIDFHFRQFHSWV V\QWD[ JHQHUDOUXOHV« application is designed

K A R I N M C K I EK A R I N M C K I E '3&&-"/$&83*5&3 E D U C A T I O N W O R K .DULQ#7UHH)DOOV FRP.DULQ0F.LH FRP 0)$ &UHDWLYH:ULWLQJ 6$1-26(67$7(81,9(56,7< %6 &RPPXQLFDWLRQV-$0(60$',62181,9(56,7

I I · ~Tied that ~e:"did not ~k f~r any ~ppoint- . we were somewhat apprebenslve. But soon ~~ (as · qwd pro quo), .and DO appomtment . learned that they were dead serious. ,t~s

jiyg;gpw;F fPNo cs;s ypq;if fpspf; nra;J FOtpy; ,izaTk;! ) · 10.09.2019 · 0dwkv¬ ¬ &odvv;¬ ¬ 5hodwlrqvdqg)xqfwlrqv¬ ,i ghilqhge\ wkhqwkhsuh lpdjhvri duh¬ ¬ v ,i wkhq ¢¢¢¢¢¢¢¢¢

13517023 Makalah Matdis - informatika.stei.itb.ac.idinformatika.stei.itb.ac.id/~rinaldi.munir/Matdis/2018-2019/Makalah... · $ %lqdu\ 7uhh lv d w\sh ri q du\ wuhh zklfk kdv d pd[lpxp

* 321 6HULHV - eoptolink.com · *esv qp &rqwlqxrxv 0rgh 7udqvplwwhu zlwk (0/ odvhu ... ,qwhuqdo , & 'hod\ 7s xv 'ljlwdo 'ldjqrvwlf )xqfwlrqv (rswrolqn¶v

Drumming: How Risky Is It To Yor Hearing? - Etymotic · 2016-10-26 · pavžq co -qwd Æq pvp: poon pzaq Inq -O"OCC UEO stag puv -to) r.azq aqqnd 40 aq .4qqo,w 02 dn

6XJJHVWLHV X X X X X +6 - Brasserie Paname · 6xjjhvwlhv x x x x x +6 6xjjhvwlh *lq ydqdi )loolhuv 3lqh %orvvrp vhuyhg zlwk 0hglwhudqhdq )hyhu 7uhh 6xjjhvwlh elhuhq ydqdi /hiih :lqwhuelhu

Send CE Reports with Scanned Signature · 2020. 10. 19. · 2q wkh (ohfwurqlf 5hfrugv ([suhvv +rph sdjh xqghu wkh (ylghqfh )xqfwlrqv khdglqj vhohfw 6hqg &( zlwk 6fdqqhg 6ljqdwxuh

$VWURQRPLFDO - JMT Controljmtcontrol.com/wp-content/uploads/2017/10/SELEC... · 7huplqdo&rqqhfwlrqv ;8 6l]h [ pp (6 6l]h [ pp ;& 6l]h [ pp 3oxj 3dqho0rxqwlqj7 lphuv )xqfwlrqv 7 lph5dqjhv

SUPER · 2.4.2 SIM1U Slot Locations .....2-8 Chapter 3: Software Application and Usage ... XQFWLRQV /LVWHG 2Q WKH +RPH 3DJH

ASCII Commands for RS232 Kudos 2 Interface · &’ 56 $6&,,,qwhuidfh6shflilfdwlrq$qg&rppdqgv 9huvlrq 7deoh ri &rqwhqwv,1752’8&7,21 ˘ 0(66$*(3 5272&2/ ˇ 2yhuylhz ˇ 0hvvdjh 6\qwd[

E-MANUALstatic.highspeedbackbone.net/pdf/Samsung UN32EH5300... · &kdqqho 0hqx 0dqdjlqj &kdqqhov dqg &kdqqho )xqfwlrqv 0hprul]lqj &kdqqhov 2wkhu )hdwxuhv %dvlf )hdwxuh &kdqjlqj wkh