The Semantics and Pragmatics of Natural Language

Preview:

Citation preview

The Semantics and Pragmatics

of Natural Language

Daniela GÎFU

http://profs.info.uaic.ro/~daniela.gifu/

“ALEXANDRU IOAN CUZA” UNIVERSITATY OF IAŞI

FACULTY OF COMPUTER SCIENCE

Course 1

SPNL OVERVIEW

2

https://profs.info.uaic.ro/~daniela.gifu/

Who am I?

“Alexandru Ioan Cuza” University of Iași

T H E H A L L O F T H E L O S T S T E P S

Faculty of Computer Science

BE AMONG THE FIRST…..

Romanian Academy

What is this course about?

➢ Meaning and Natural Language Processing (NLP)

➢ Computational Semantics

➢ Computational Pragmatics

8

Familiarization

with relevant terminology

• Semantics

• Pragmatics

• Natural language

• Computational Linguistics

• Natural Language Processing

…9

Simulation of human (natural)

intelligence by machines

Interdisplinary field ~

Scientific study of

language from a

computational

perspective

A discipline that spans

theory and practice to

understand

computer systems and

networks at a deep level.10

Computational Linguistics (CL)

vs.

Natural Language Processing (CLP)

11

CL = gives theoretical background (computational

theories on language), linguistics models.

NLP = applied CL, including:

- natural language technology (NLT)

- human language technology (HLT)

12

Researches

Engineering techniques have to be underpinned by scientific

understanding…

Good performances in some

tasks when large amount of data

(with annotation) are available

Spoken language

- speech processing (from speech to text to syntax and

semantics to speech) - https://speechlogger.appspot.com/ro/

Ex: mobile

Written language – my area of interest

Language in correlation with other modalities

(multimodality)

- speech

- intonation

- image

Ex: GPS (Global Positioning System)13

Natural Language Technology

Document segmentation and interpretation

– cleaning (elimination of dots, enhancing contrast,

etc.)

– separation of text from image, curved lines...

– recognizing printed, semi-uncial characters, etc.

• Optical Character Recognition (OCR)

~ 100% accuracy in scanning printed Latin script

based material

Challenge in OCR

14

Written Language Technologies

Students?

15

OCR Handwriting – Why?

= presents some unique particularities

= many varieties of cursive writing

see: https://pdf.iskysoft.com/ocr-pdf/handwriting-ocr.html

16

OCR Handwriting very challenging

= the interpretation of physician handwriting (Rasmussen,

L.V. et al., 2012; Broda. B. & Piasecki, M., 2007)

= analysis of old handwritten documents (useful for linguists,

musicians, historians, etc.)

Document Image

Analysis

PR = a sub-topic of machine learning

(description or classification (recognition) of

measurements.

17

Differences between CL Approaches

•Analysis and understanding of written language

– sub-syntactic processing

• lexical units

• sentence splitting

• clause borders

• part of speech and morphological information

• lemmas

• entity names

• groups (nominal, verbal, prepositional, etc.)

and lexical attractions (collocations)

18

Written Language Technologies

• Language analysis and understanding

– semantic and discourse processing

• semantic disambiguation → word senses

• semantic roles labeling → NLTK

• rhetorical structure of discourse and dialogue →

RST (Rhetorical Structure Theory)

• anaphora resolution → StandfordCoreNLP

• text summarization → Machine Learning

19

Written Language Technologies

20

the study of mathematical structures and methods that are

of importance to linguistics.

→ Phonetics → Phonology → Morphology →

Syntax and → Semantics → and…

Sociolinguistics → Language Acquisition.

20

Mathematical Linguistics

Mathematical Linguistics before Computational Linguistics….

ML ⇔ CL?

= art of solving problems that need to analyze

(or generate) natural language text.

Find that metrics for a good solution to the

engineering problem…

NLP

Google Translate – Don’t blame!!!!

Romanian = Luceafărul de dimineață

English = The morning gentleman (bad answer)

= Morning star (good answer)

Why????

explains how human translators do their job...

21

Let’s try!

22

NLP – a subdomain of

Artificial Intelligence & Linguistics

Thematic Areas

- Linguistics - mathematical linguistics - computational

linguistics

- Formal Language

- Linguistic and Language Processing

- The grammatical structure of utterances: the sentence,

constituents, phrase, classifications and structural rules,

syntactic processing ...

- Parser or Syntax Analyzer

- Semantics & Pragmatics

= an area of Artificial Intelligence (AI) devoted to

creating computers that use NL as input and/or

output.

NLP

23

AI-hard problem

= machine reading

comprehension

= produces language

as output on the basis

of data input

= developing computational methods/models of human

linguistics behavior.

CL

▪ INFORMATION RETRIEVAL

▪ INFORMATION EXTRACTION

▪ MACHINE TRANSLATION

▪ QUESTION – ANSWERING

▪ SUMMARIZATION

▪ MACHINE READABLE DICTIONARIES

▪ SPELLING & GRAMMAR CHECKERS

24

Let’s describe and exemplify

2525

A discipline concerned with understanding written and spoken

language from a computational perspective.

- detecting synonymy (Grigonytė et al., 2010);

- developing WordNet (including Romanian - Gala et Mititelu,

2013), (Iftene and Balahur, 2007)...;

- WSD (Yang, H. et al. 2010), (Lefever et Hoste, 2010), (Tufiș,

2002)...;

- semantic annotation (Garcia et al., 2012)...;

- reconstructing a diachronic morphology (Cristea et al.,

2007/2012)

- diachronic text classification (Mihalcea and Năstase, 2012;

Popescu and Strapparava, 2015), etc.

- epoch detection (Gifu, 2015/2016/2017)...;

CL – Applications

Tools developed

by students…

26

Linguistic & Language Processing

1. Linguistics

- Science of language. Includes:

✓ Sounds (phonology)

✓ Word formation (morphology)

✓ Sentence structure (syntax)

✓ Meaning (semantics) and understanding

(pragmatics)…

2. Levels of linguistic analysis

- Higher level → Speech Recognition (SR)

- Lower levels → Natural Language Processing (NLP)

27

Levels of Linguistic Analysis

NLP

Letters - strings

Morphemes

Words

Phrases & sentences

Meaning out of context

Meaning in context

Phonemes

Acoustic signal

Speech

Recognition

Phonetics – production and perception of speech

Phonology – Sound patterns of language

Lexicon – Dictionary of words in a language

Morphology – Word formation and structure

Syntax – Sentence structure

Semantics – Intended meaning

Pragmatics – Understanding from external info

NLP Pipeline

Course purpose

28

29

MAIN CONCEPTS

1. Natural Language

- used by human beings for communication...

- sign, system, symbols, rule-set (or grammar)

2. Semantics

- literal meaning determined from a word, phrase,

sentence.

3. Pragmatics

- contextual meaning {situation, speaker, etc.}

30

Natural or ordinary language

• A system of speech symbols → (form criterion)

Types:

a) speech (spoken language)

b) signing (written language) - the representation of a spoken or

gestural language.

• The most important means of human communication →

(function criterion)

31

Natural Language…• Multiplicity of languages

32

Formal Language_I

1. Symbol

- a character, an abstract entity that has no meaning by

itself

Ex: lettters, digits and special characters

2. Alphabet

- finite set of symbols

- often denoted by Σ

Ex:

B = {0, 1} says B is an alphabet of two symbols, 0 and 1

C = {a, b, c} – C an alphabet of 3 symbols, a, b and c

* More about formal language:

http://www.its.caltech.edu/~matilde/FormalLanguageTheory.pdf

33

Formal Language_II

3. String or word

- a finite sequence of symbols from an alphabet

Ex: 01110 and 111 are strings from the alphabet B above

aaabccc and b are strings from the C above

4. Sentence

- a string of words.

Ex: I saw the gentleman with the hat.

String = a b c d e b f

34

Formal language_III

Define possible relations of parts of a string to each other?

A.

[I] saw the gentleman [with the binocular] = [a] b c d [e b f]

B.

I saw [the gentleman with the binocular] = a b [c d e b f ]

We can represent structures with trees…

I saw the gentleman with the binocular. I saw the gentleman with the binocular.

35

Formal Language_IV

5. Language

- a set of strings of symbols from an alphabet.

6. Natural Language or ordinary language

- open-ended = built on 3 different knowledge components: the

sound of words - phonology; the meaning of words -

semantics; the grammatical rules according to which words are

put together - syntax.

7. Formal language

- a set L of sequences/strings over some finite alphabet Σ

- described using formal grammars (a set of rules for strings,

specified to it).

- many application (e.g., Prognosis wearable system)

36

Formal Language_VContext-Free Grammars (CFG) - a finite set of grammar rules https://www.tutorialspoint.com/automata_theory/context_free_grammar_introduction.htm

= a quadruple (N, T, P, S) , where:

N = a finite set of non-terminal symbols (character or variable).

Note! Each n ∈ N = type of phrase/clause in the sentence.

T = a finite set of terminals (an alphabet, defined by the grammar) disjoint of N: N ∩ T = NULL.

P = a finite set of (rewrite) rules or productions of the grammar, from N to

P: N → (N ∪ T)*

Note! The left-hand side of the production rule P does have any right context or left

context. * = Kleene star operation = unary operation on sets of strings or sets of symbols or

characters → a set N is written as N* (used for regular expressions).

Ex: {"a", "b", "c"}* = {ε, "a", "b", "c", "aa", "ab",

"ac", "ba", "bb", "bc", "ca", "cb", "cc", "aaa", "aab",

...} - {ε} (the language consisting only of the empty string)

S = start symbol/start symbol, used to represent the whole sentence.

37

Main Concepts - IICONCLUSIONS

Computational semantics and pragmatics:

➢ automatic construction of semantic representations for NL

expressions (in context).

➢ automatic inferences over the representations.

Major Issues:

➢Ambiguity of various levels:

lexical, syntactic, semantic, pragmatic

➢ Interface between LF from linguistic form and context of use

(essential for modelling anaphora).

Tools used include:

➢ Information: syntax, world knowledge, lexical semantics,

corpora…

➢ Inference: logic (model checkers and theorem proving), machine

learning, statistics…

38

Semester Homework:

1. Each student has to present a paper about

his/her SEMEVAL task that guide final project

- https://aclweb.org/anthology/

between 2018-2021

EMNLP (Empirical Methods on Natural Language

Processing)

ACL (Association of Computational Linguistics)

EACL (European Association of Computational

Linguistics)

COLING (International Conference on

Computational Linguistics) …

39

Final project: SEMEVAL 2022

Groups structured by 2-3 students:

- 1-2 humanists & 1 computer scientists prepare a paper

at the SEMEVAL-2022 based to their research

supervised constantly -

https://semeval.github.io/SemEval2022/tasks

40

Projects steps – next time

1. Form a team...

2. Choose a task

3. Define the teamwork

4. Establish the modular structure

5. Edit the paper – a possible structure

41

5. Edit the paper – making and outline

* Choosing a Title

* Abstract (executive summary) & Keywords

* Introduction (the new approach; background

information; research problem/question; theoretical

framework)

* SOTA (citation tracking; content alert services;

evaluating sources; primary sources; secondary sources…)

* Methodology (qualitative methods; quantitative

methods)

* Results

* Discussion

* Conclusions and future work

* References

Thank you!

42

Recommended