59
Lexical and Syntax Analysis (of Programming Languages) Introduction

Lexical and Syntax Analysis - University of York analysis converts a stream of symbols to a parse tree: –a proof that a given input is valid according to the language syntax;

Embed Size (px)

Citation preview

Lexical and Syntax Analysis(of Programming Languages)

Introduction

Lexical and Syntax Analysis(of Programming Languages)

Introduction

What is Parsing?

String ofcharacters

Easy for humansto write

Easy for programsto process

Parser

A parser also checks that the input stringis well-formed, and if not, rejects it.

Data structure

What is Parsing?

String ofcharacters

Easy for humansto write

Easy for programsto process

Parser

A parser also checks that the input stringis well-formed, and if not, rejects it.

Data structure

Example 1

Charlton, 49

Lineker, 48

Beckham, 17

CSV (Comma Separated Value)

Array of pairs

Parser

“Charlton”

49

“Lineker”

48

“Beckham”

17

Example 1

Charlton, 49

Lineker, 48

Beckham, 17

CSV (Comma Separated Value)

Array of pairs

Parser

“Charlton”

49

“Lineker”

48

“Beckham”

17

Data structure?

typedef struct {char* name;int goals;

} Player;

A data structure is typically avalue of a data type in someprogramming language, e.g. C.

typedef struct {Player* players;int size;

} Squad;

The type of a player:

The type of a squad of players:

Data structure?

typedef struct {char* name;int goals;

} Player;

A data structure is typically avalue of a data type in someprogramming language, e.g. C.

typedef struct {Player* players;int size;

} Squad;

The type of a player:

The type of a squad of players:

Why data structures?

int total(Squad s){

int i, sum = 0;for (i = 0; i < s.size; i++)

sum += s.players[i].goals;return sum;

}

Data structures are convenientto process by a computerprogram.

The total goals scored by allplayers in a squad:

Why data structures?

int total(Squad s){

int i, sum = 0;for (i = 0; i < s.size; i++)

sum += s.players[i].goals;return sum;

}

Data structures are convenientto process by a computerprogram.

The total goals scored by allplayers in a squad:

Example 1: the problem

Squad parse(char* input){

...}

We want to be able to parseCSV files to values of typeSquad so we can process themconveniently.

LSA will teach you how to fill inthe dots. (This is a rather easyexample, though!)

Example 1: the problem

Squad parse(char* input){

...}

We want to be able to parseCSV files to values of typeSquad so we can process themconveniently.

LSA will teach you how to fill inthe dots. (This is a rather easyexample, though!)

Everyday parsing

Our email clients parse emailheaders, allowing search by to& from address etc.

Our web browsers parse HTML,JavaScript, CSS, etc.

Our copies of Call of Duty parseconfiguration files and saved-game states.

Everyday parsing

Our email clients parse emailheaders, allowing search by to& from address etc.

Our web browsers parse HTML,JavaScript, CSS, etc.

Our copies of Call of Duty parseconfiguration files and saved-game states.

LSA of PLs

In LSA, we are interested inparsing in general.

But we have a special interest inparsing programming languages(PLs). Why?

“If we can parse a PL, we canparse anything.”

In practice, we often want toparse PL-like languages.

Preparation for CGO.

LSA of PLs

In LSA, we are interested inparsing in general.

But we have a special interest inparsing programming languages(PLs). Why?

“If we can parse a PL, we canparse anything.”

In practice, we often want toparse PL-like languages.

Preparation for CGO.

Example 2

foo := 20 + bar

A pascal statement An abstractsyntax tree

Parser

ASSIGN

PLUS

NUM VAR20

“foo”

“bar”

We will return to this example later!

Example 2

foo := 20 + bar

A pascal statement An abstractsyntax tree

Parser

ASSIGN

PLUS

NUM VAR20

“foo”

“bar”

We will return to this example later!

LSA & CGO

The connection between Lexical and SyntaxAnalysis (2nd year module) and CodeGeneration and Optimisation (3rd yearmodule).

LSA & CGO

The connection between Lexical and SyntaxAnalysis (2nd year module) and CodeGeneration and Optimisation (3rd yearmodule).

LSA & CGO

TargetProgram

Easy for machinesto execute

Source Program (String)

Easy for humansto write andunderstand

Easy for compilerto process

LSA

Source Program

(Data structure)

CGO

LSA & CGO

TargetProgram

Easy for machinesto execute

Source Program (String)

Easy for humansto write andunderstand

Easy for compilerto process

LSA

Source Program

(Data structure)

CGO

OTHER MOTIVATIONS

For studying parsing

OTHER MOTIVATIONS

For studying parsing

To understand how computers work

Computer

Architecture

LSA & CGO

Programming

Languages

To understand how computers work

Computer

Architecture

LSA & CGO

Programming

Languages

To practice applyingyour knowledge

LSA

TAD

POPTOC

To practice applyingyour knowledge

LSA

TAD

POPTOC

PARSING

=

LEXICAL ANALYSIS+

SYNTAX ANALYSIS

PARSING

=

LEXICAL ANALYSIS+

SYNTAX ANALYSIS

Lexical Analysis

Identifies the lexemes in asentence.

Lexeme: a minimal meaningfulunit of a language.

Converts each lexeme to atoken.

Throws away ignorable textsuch as spaces, new-lines, andcomments.

(Also known as “scanning”)

Lexical Analysis

Identifies the lexemes in asentence.

Lexeme: a minimal meaningfulunit of a language.

Converts each lexeme to atoken.

Throws away ignorable textsuch as spaces, new-lines, andcomments.

(Also known as “scanning”)

What is a token?

Every token has an identifier,used to denote the kind oflexeme that it represents, e.g.

Token identifier denotes a

PLUS + operator

ASSIGN := operator

VAR variable

NUM number

Some tokens have a componentvalue, conventionally written inparenthesis after the identifier,e.g. VAR(foo), NUM(12).

What is a token?

Every token has an identifier,used to denote the kind oflexeme that it represents, e.g.

Token identifier denotes a

PLUS + operator

ASSIGN := operator

VAR variable

NUM number

Some tokens have a componentvalue, conventionally written inparenthesis after the identifier,e.g. VAR(foo), NUM(12).

Lexical Analysis

Stream of characters

Stream of tokens

Example input:

foo := 20 + bar

Example output:

VAR(foo), ASSIGN, NUM(20),PLUS, VAR(bar)

Lexical Analysis

Stream of characters

Stream of tokens

Example input:

foo := 20 + bar

Example output:

VAR(foo), ASSIGN, NUM(20),PLUS, VAR(bar)

Lexical Analysis

Lexemes are specified by regularexpressions. For example:

digit = 0 | ... | 9letter = a | ... | znumber = digit⋅ digit*

variable = letter⋅ (letter | digit)*

1443634

xfoofoo2x1y20

Example numbers: Example variables:

Lexical Analysis

Lexemes are specified by regularexpressions. For example:

digit = 0 | ... | 9letter = a | ... | znumber = digit⋅ digit*

variable = letter⋅ (letter | digit)*

1443634

xfoofoo2x1y20

Example numbers: Example variables:

Syntax Analysis

Syntax: the set of rules definingvalid strings of a language.

Syntax analysis converts a streamof symbols to a parse tree:

– a proof that a given input is validaccording to the language syntax;

– also a structure-rich representationof the input that is convenient toprocess.

(Also known as “parsing”)

Syntax Analysis

Syntax: the set of rules definingvalid strings of a language.

Syntax analysis converts a streamof symbols to a parse tree:

– a proof that a given input is validaccording to the language syntax;

– also a structure-rich representationof the input that is convenient toprocess.

(Also known as “parsing”)

Syntax Analysis

The syntax of a language isusually specified by a grammar.

Example:

stmt → VAR(v) ASSIGN expr

expr → VAR(v)| NUM(n)| expr PLUS expr

Where v represents any variablename and n any number.

Syntax Analysis

The syntax of a language isusually specified by a grammar.

Example:

stmt → VAR(v) ASSIGN expr

expr → VAR(v)| NUM(n)| expr PLUS expr

Where v represents any variablename and n any number.

Syntax Analysis

Example input:

Example output:

VAR(foo), ASSIGN, NUM(20),PLUS, VAR(bar)

stmt

NUM(20) VAR(bar)

VAR(foo) ASSIGN

PLUSexpr

expr

expr

Stream of symbols

Parse tree

Syntax Analysis

Example input:

Example output:

VAR(foo), ASSIGN, NUM(20),PLUS, VAR(bar)

stmt

NUM(20) VAR(bar)

VAR(foo) ASSIGN

PLUSexpr

expr

expr

Stream of symbols

Parse tree

Syntax Analysissubsumes

Lexical Analysis

Any language that can beaccepted by a regular expressioncan be accepted by a grammar.

But not vice-versa!*

Hence Syntax Analysis is morepowerful than Lexical Analysis.

* Can anyone give a simple example?

Syntax Analysissubsumes

Lexical Analysis

Any language that can beaccepted by a regular expressioncan be accepted by a grammar.

But not vice-versa!*

Hence Syntax Analysis is morepowerful than Lexical Analysis.

* Can anyone give a simple example?

Why bother withLexical Analysis?

Convenience: regular expressionsmore convenient than grammarsto define regular strings.

Efficiency: there are efficientalgorithms for matching regularexpressions that do not apply inthe more general setting ofgrammars.

Modularity: split a problem intotwo smaller problems.

Why bother withLexical Analysis?

Convenience: regular expressionsmore convenient than grammarsto define regular strings.

Efficiency: there are efficientalgorithms for matching regularexpressions that do not apply inthe more general setting ofgrammars.

Modularity: split a problem intotwo smaller problems.

THE LSA MODULE

Structure and content

THE LSA MODULE

Structure and content

Objectives

To learn how to implementefficient parsers

and to learn the theory behindparser generation from regularexpressions and grammars.

using a general purposeprogramming language (C);

using an automatic parsergenerator (Flex & Bison).

Objectives

To learn how to implementefficient parsers

and to learn the theory behindparser generation from regularexpressions and grammars.

using a general purposeprogramming language (C);

using an automatic parsergenerator (Flex & Bison).

Practicals

4 practicals in the lab.

Idea is to practice the techniquesdeveloped in the lectures.

Room CSE/069.

Spring weeks 9 & 10.

Summer weeks 4 & 5.

Two practical groups: A and B. Findyour group on the LSA web page.

Practical Topic

1 Lexical Analysis

2 Flex, a Lexical Analyser Generator

3 Recursive Descent Parsing

4 Bison, a Parser Generator

Organisation of practicals

Practicals

4 practicals in the lab.

Idea is to practice the techniquesdeveloped in the lectures.

Room CSE/069.

Spring weeks 9 & 10.

Summer weeks 4 & 5.

Two practical groups: A and B. Findyour group on the LSA web page.

Practical Topic

1 Lexical Analysis

2 Flex, a Lexical Analyser Generator

3 Recursive Descent Parsing

4 Bison, a Parser Generator

Organisation of practicals

Lecture contents

Chapter Title

1 Introduction

2 Abstract Syntax

3 Lexical Analysis

4 Flex, a Lexical Analyser Generator

5 Grammars

6 Recursive Descent Parsing

7 Bison, a Parser Generator

8 Top-Down Parsing

9 Bottom-Up Parsing

Organisation of Lectures

14 lectures, with notes arranged into9 chapters.

A single chapter may be covered inless than or more than one lecture.

Lecture contents

Chapter Title

1 Introduction

2 Abstract Syntax

3 Lexical Analysis

4 Flex, a Lexical Analyser Generator

5 Grammars

6 Recursive Descent Parsing

7 Bison, a Parser Generator

8 Top-Down Parsing

9 Bottom-Up Parsing

Organisation of Lectures

14 lectures, with notes arranged into9 chapters.

A single chapter may be covered inless than or more than one lecture.

Recommended books

Recommended books

Don’t break the flow!

You view lecture notesin advance

I explainin lecture

You doexercises