45
NATURAL LANGUAGE PROCESSING (NLP) CS-724 Wondwossen Mulugeta (PhD) email: [email protected] 1

Natural Language Processing (NLP) CS724nlpcs724.weebly.com/uploads/6/6/1/2/66126761/cs724... · Course Title Natural Language Processing Course Description This course is an introduction

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Natural Language Processing (NLP) CS724nlpcs724.weebly.com/uploads/6/6/1/2/66126761/cs724... · Course Title Natural Language Processing Course Description This course is an introduction

NATURAL LANGUAGE

PROCESSING (NLP)

CS-724

WondwossenMulugeta (PhD) email: [email protected]

1

Page 2: Natural Language Processing (NLP) CS724nlpcs724.weebly.com/uploads/6/6/1/2/66126761/cs724... · Course Title Natural Language Processing Course Description This course is an introduction

2

What and Why

What is Language?

What is Natural Language?

What is Language Processing?

Why do we need to process Natural Language?

How is computer used for Language Processing?

Page 3: Natural Language Processing (NLP) CS724nlpcs724.weebly.com/uploads/6/6/1/2/66126761/cs724... · Course Title Natural Language Processing Course Description This course is an introduction

3

What is NLP

Natural Language Processing (NLP) is both a

modern computational technology and a method of

investigating and evaluating claims about human

language itself.

Also called Computational Linguistics which links to

Artificial Intelligence (AI), the general study of

cognitive function by computational processes,

normally with an emphasis on the role of knowledge

representations, that is to say the need for

representations of our knowledge of the world in

order to understand human language with computers.

Page 4: Natural Language Processing (NLP) CS724nlpcs724.weebly.com/uploads/6/6/1/2/66126761/cs724... · Course Title Natural Language Processing Course Description This course is an introduction

4

Scope

Language Technology

Knowledge Technology

Multimedia and Multimodal

Technology

Text

TechnologySpeech

Technology

Page 5: Natural Language Processing (NLP) CS724nlpcs724.weebly.com/uploads/6/6/1/2/66126761/cs724... · Course Title Natural Language Processing Course Description This course is an introduction

Course Code CS-724

Course Title Natural Language Processing

Course

Description

This course is an introduction to natural language

processing. The study of human language from a

computational perspective. The course is designed to

get students to the level with the current research in

the area. It covers topics such as morphology,

syntactic, semantic and pragmatic processing models,

emphasizing statistical or corpus-based methods and

algorithms. It also covers applications of these

methods and models in syntactic parsing, information

extraction, statistical machine translation, speech

systems, and summarization.

5

Page 6: Natural Language Processing (NLP) CS724nlpcs724.weebly.com/uploads/6/6/1/2/66126761/cs724... · Course Title Natural Language Processing Course Description This course is an introduction

Topics

Topics Subtopics

1:

Introduction

Definitions, Scope and Coverage of NLP, Application

of NLP, Approach to NLP, Levels of language

processing, NLP from the different views, Course

project Idea

2: Basics of

Linguistics

Description of language, Levels of description:

Phonetics, Phonology, Morphology, Syntax, Semantics,

Discourse.

6

Page 7: Natural Language Processing (NLP) CS724nlpcs724.weebly.com/uploads/6/6/1/2/66126761/cs724... · Course Title Natural Language Processing Course Description This course is an introduction

Topics

Topics Subtopics

3: Text Processing

Tokenization and word Segmentation, Stemming, Punctuation, Number, Lemmatization, part of speech tagging, Morphological Processing (Types of Morpheme, Morphological Types, Morphological Rules, Morphemes and Words, Inflectional and Derivational Morphology), Parsing (Introduction, Context Free Grammar, parsing)

Mid Term Exam (25%)

4: Language Modeling

Introduction to Language Modeling, Application, N-gram, Smoothing Techniques, evaluation

7

Page 8: Natural Language Processing (NLP) CS724nlpcs724.weebly.com/uploads/6/6/1/2/66126761/cs724... · Course Title Natural Language Processing Course Description This course is an introduction

Topics

Topics Subtopics5: Speech Processing Systems

Introduction, Challenges, Automatic Speech Recognition (Approaches, Acoustic Modeling, Lexical Modeling), Text to Speech System (Text Analysis, Wave form synthesis), Evaluations

6: Machine Translation

Application, Challenges, Approaches

7: Word Sense Disambiguation (WSD)

Practical Problem, Word Sense relations, Disambiguation Approaches, Knowledge based WSD, Machine readable Dictionary, Lesk Algorithm, Selectional Preference for WSD

8: Text Summarization

Application, Types, Single doc. Summarization, Sentiment Analysis

8

Page 9: Natural Language Processing (NLP) CS724nlpcs724.weebly.com/uploads/6/6/1/2/66126761/cs724... · Course Title Natural Language Processing Course Description This course is an introduction

Text Book and Reference

Text Book Jurafeski, Daniel, and James H. Martin (2009)

Speech and Language Processing: An introduction to

NLP, Speech recognition, and Computational

Linguistics, 2nd Ed

Reference

Materials

1.Christopher Manning and HinrichSchutze. Foundations

of Statistical NLP, MIT Press, 1999.

2.James Allen. Natural Language Understanding, 2nd

edition.

3.Roland R. Hausser. Foundations of Computational

Linguistics: Human-Computer Communication in

Natural Language, Springer Verlag, 2001.

4.Tom Mitchell. Machine Learning. McGraw Hill, 1997.

9

Page 10: Natural Language Processing (NLP) CS724nlpcs724.weebly.com/uploads/6/6/1/2/66126761/cs724... · Course Title Natural Language Processing Course Description This course is an introduction

Evaluation

Assessment

Method

• Mid Term Exam: 25%

• Project: 25%

• Final Exam: 50%

• Assignments and Attendance ….##%

10

Page 11: Natural Language Processing (NLP) CS724nlpcs724.weebly.com/uploads/6/6/1/2/66126761/cs724... · Course Title Natural Language Processing Course Description This course is an introduction

Language Technology

Coreference resolution

Question answering (QA)

Part-of-speech (POS) tagging

Word sense disambiguation (WSD)

Paraphrase

Named entity recognition (NER)

ParsingSummarization (Abstractive)

Information extraction (IE)

Machine translation (MT)Dialog

Sentiment analysis

mostly solved

making good progress

still challlenging

Spam detectionLet’s go to Agra!

Buy DraG…

✓✗

Colorless green ideas sleep furiously.ADJ ADJ NOUN VERB ADV

Einstein met with UN officials in PrincetonPERSON ORG LOC

You’re invited to our dinner party, Friday May 27 at 8:30

PartyMay 27add

Best roast chicken in San Francisco!

The waiter ignored us for 20 minutes.

Carter told Mubarak he shouldn’t run again.

I need new batteries for my mouse.

The 13th Shanghai International Film Festival…

第13届上海国际电影节开幕…

The Dow Jones is up

Housing prices rose

Economy is good

Q. How effective is ibuprofen in reducing fever in patients with acute febrile illness?

I can see Alcatraz from the window!

XYZ acquired ABC yesterday

ABC has been taken over by XYZ

Where is Citizen Kane playing in SF?

Castro Theatre at 7:30. Do you want a ticket?

The S&P500 jumped

Page 12: Natural Language Processing (NLP) CS724nlpcs724.weebly.com/uploads/6/6/1/2/66126761/cs724... · Course Title Natural Language Processing Course Description This course is an introduction

Background

Solving the language-related problems, is the main

concern of the fields known as Natural Language

Processing, Computational Linguistics, and Speech

Recognition and Synthesis

Few applications of language processing

spelling correction,

grammar checking,

information retrieval, and

machine translation,

speech processing, etc

Page 13: Natural Language Processing (NLP) CS724nlpcs724.weebly.com/uploads/6/6/1/2/66126761/cs724... · Course Title Natural Language Processing Course Description This course is an introduction

13

Knowledge in NLP

Tasks of being capable of analyzing an incoming

audio signal and recovering the exact sequence of

words and generating its response require knowledge

about phonetics and phonology, which can help

model how words are pronounced in colloquial

speech.

Producing and recognizing the variations of individual

words (e.g., recognizing that doors is plural) requires

knowledge about morphology, which captures

information about the shape and behavior of words

in context.

Page 14: Natural Language Processing (NLP) CS724nlpcs724.weebly.com/uploads/6/6/1/2/66126761/cs724... · Course Title Natural Language Processing Course Description This course is an introduction

Knowledge in NLP

Syntax: the knowledge needed to order and

group words together

I’m I do, sorry that afraid Dave I’m can’t.

(Dave, I’m sorry I’m afraid I can’t do that.)

ቤቴ ሄጄ እመጣለሁ vs እመጣለሁ ቤቴ ሄጄ vs ሄጄ እመጣለሁ ቤቴ

Lexical semantics: knowledge of the meanings of the

component words

Compositional semantics: knowledge of how these

components combine to form larger meanings

data + base = database ቤት + መጽሐፍ = ቤተመጽሐፍ

Page 15: Natural Language Processing (NLP) CS724nlpcs724.weebly.com/uploads/6/6/1/2/66126761/cs724... · Course Title Natural Language Processing Course Description This course is an introduction

Knowledge in NLP

Pragmatics: the appropriate use of the kind of

polite and indirect languageo No or

o No, I won’t open the door.

◼ I’m sorry, I’m afraid, I can’t.

◼ I won’t.

Discourse conventions: knowledge of correctly structuring

these such conversations (intonation, gesturer, style, speech

act, etc)

◼ Dave, I’m sorry I’m afraid I can’t do that.

The word “that” is referring to something which is not part of the

sentences

Page 16: Natural Language Processing (NLP) CS724nlpcs724.weebly.com/uploads/6/6/1/2/66126761/cs724... · Course Title Natural Language Processing Course Description This course is an introduction

16

Knowledge in Language Processing

Phonetics and Phonology — The study of linguistic sounds

Morphology —The study of the meaningful components of

words

Syntax —The study of the structural relationships between

words

Semantics — The study of meaning

Pragmatics — The study of how language is used to

accomplish goals

Discourse—The study of linguistic units larger than a single

utterance

Page 17: Natural Language Processing (NLP) CS724nlpcs724.weebly.com/uploads/6/6/1/2/66126761/cs724... · Course Title Natural Language Processing Course Description This course is an introduction

17

Technologies

Speech recognition

Spoken language is recognized

and transformed in into text as in

dictation systems, into commands as

in command control systems, or into

some other internal representation.

Speech synthesis

Utterances in spoken language are

produced from text (text-to-speech

systems) or from internal

representations of words or

sentences (concept-to-speech

systems)

17

Page 18: Natural Language Processing (NLP) CS724nlpcs724.weebly.com/uploads/6/6/1/2/66126761/cs724... · Course Title Natural Language Processing Course Description This course is an introduction

18

Technologies

Text categorization

This technology assigns texts to categories. Texts may belong to more than one category, categories may contain other categories.

◼ Eg: News Classification?

Text Summarization

The most relevant portions of a text are extracted as a summary. The task depends on the needed lengths of the summaries. Summarization is harder if the summary has to be specific to a certain query.

◼ Eg: News Summerization?

Page 19: Natural Language Processing (NLP) CS724nlpcs724.weebly.com/uploads/6/6/1/2/66126761/cs724... · Course Title Natural Language Processing Course Description This course is an introduction

19

Technologies

Text Indexing

As a precondition for document retrieval,

texts are stored in an indexed database.

Usually a text is indexed for all word forms

or – after lemmatization – for all lemmas.

Sometimes indexing is combined with

categorization and summarization.

Text Retrieval

Texts are retrieved from a database that

best match a given query or document. The

candidate documents are ordered with

respect to their expected relevance.

Indexing, categorization, summarization

and retrieval are often subsumed under the

term information retrieval.

Page 20: Natural Language Processing (NLP) CS724nlpcs724.weebly.com/uploads/6/6/1/2/66126761/cs724... · Course Title Natural Language Processing Course Description This course is an introduction

20

Technologies

Information Extraction

Relevant information pieces of

information are discovered and marked

for extraction. The extracted pieces can

be: the topic, named entities such as

company, place or person names, simple

relations such as prices, destinations,

functions etc. or complex relations

describing accidents, company mergers

or football matches.

◼ Eg: Named Entity Recognition

Data and Text Mining

Extracted pieces of information from

several sources are combined in one

database. Previously undetected

relationships may be discovered.

Page 21: Natural Language Processing (NLP) CS724nlpcs724.weebly.com/uploads/6/6/1/2/66126761/cs724... · Course Title Natural Language Processing Course Description This course is an introduction

21

Technologies

Question Answering

Natural language queries are used to access information in a database. The database may be a base of structured data or a repository of digital texts in which certain parts have been marked as potential answers.

Report Generation

A report in natural language is produced that describes the essential contents or changes of a database.

Page 22: Natural Language Processing (NLP) CS724nlpcs724.weebly.com/uploads/6/6/1/2/66126761/cs724... · Course Title Natural Language Processing Course Description This course is an introduction

22

Technologies

Spoken Dialogue Systems

The system can carry out a dialogue with a human user in which the user can solicit information or conduct purchases, reservations or other transactions.

Translation Technologies

Technologies that translate texts or assist human translators. Automatic translation is called machine translation. Translation memories use large amounts of texts together with existing translations for efficient look-up of possible translations for words, phrases and sentences.

◼What do we need to do?

Page 23: Natural Language Processing (NLP) CS724nlpcs724.weebly.com/uploads/6/6/1/2/66126761/cs724... · Course Title Natural Language Processing Course Description This course is an introduction

23

Methods and Resources

The methods of language technology come from

several disciplines:

computer science,

computational and theoretical linguistics,

mathematics,

electrical engineering and

psychology.

Page 24: Natural Language Processing (NLP) CS724nlpcs724.weebly.com/uploads/6/6/1/2/66126761/cs724... · Course Title Natural Language Processing Course Description This course is an introduction

24

Methods and Resources

Generic CS Methods

Programming languages, algorithms for generic data types, and

software engineering methods for structuring and organizing software

development and quality assurance.

Specialized Algorithms

Dedicated algorithms have been designed for parsing, generation and

translation, for morphological and syntactic processing with finite state

automata/transducers and many other tasks.

Nondiscrete Mathematical Methods

Statistical techniques have become especially successful in speech

processing, information retrieval, and the automatic acquisition of

language models. Other methods in this class are neural networks and

powerful techniques for optimization and search.

Page 25: Natural Language Processing (NLP) CS724nlpcs724.weebly.com/uploads/6/6/1/2/66126761/cs724... · Course Title Natural Language Processing Course Description This course is an introduction

25

Methods and Resources

Logical and Linguistic Formalisms

For deep linguistic processing, constraint based grammar formalisms

are employed. Complex formalisms have been developed for the

representation of semantic content and knowledge.

Linguistic Knowledge

Linguistic knowledge resources for many languages are utilized:

dictionaries, morphological and syntactic grammars, rules for semantic

interpretation, pronunciation and intonation.

Corpora and Corpus Tools

Large collections of application-specific or generic collections of spoken

and written language are exploited for the acquisition and testing of

statistical or rule-based language models.

Page 26: Natural Language Processing (NLP) CS724nlpcs724.weebly.com/uploads/6/6/1/2/66126761/cs724... · Course Title Natural Language Processing Course Description This course is an introduction

26

Approach to NLP

Rule Based (Hand Crafted Rules)

Develop the rules to process different types of

natural language data based on known facts, rules

and exceptions cases.

Machine Learning

Capture rules from examples (corpus which is

annotated or otherwise) and apply on new

instances

◼ Supervised: learn by comparing with expected output

◼Unsupervised: blind learning. Create knowledge by association

rather than predefined output

Page 27: Natural Language Processing (NLP) CS724nlpcs724.weebly.com/uploads/6/6/1/2/66126761/cs724... · Course Title Natural Language Processing Course Description This course is an introduction

27

Approach to NLP

There are various approaches but the most

important models are:

◼ state machines,

◼ formal rule systems,

◼ logic,

◼ probability theory and

◼ other machine learning tools

The most important algorithms of these models:

state space search algorithms and

dynamic programming algorithms

Page 28: Natural Language Processing (NLP) CS724nlpcs724.weebly.com/uploads/6/6/1/2/66126761/cs724... · Course Title Natural Language Processing Course Description This course is an introduction

28

Approach to NLP

State machines are

formal models that consist of states, transitions among

states, and an input representation.

Some of the variations of this basic model:

Deterministic and non-deterministic finite-state

automata,

finite-state transducers, which can write to an output

device,

Markov models, and hidden Markov models, which

have a probabilistic component.

Page 29: Natural Language Processing (NLP) CS724nlpcs724.weebly.com/uploads/6/6/1/2/66126761/cs724... · Course Title Natural Language Processing Course Description This course is an introduction

Finite State Transducer…Example

29

Finite state transducer can be used to generate a language as well as to

analyze a language.

Page 30: Natural Language Processing (NLP) CS724nlpcs724.weebly.com/uploads/6/6/1/2/66126761/cs724... · Course Title Natural Language Processing Course Description This course is an introduction

Generation using FST30

Page 31: Natural Language Processing (NLP) CS724nlpcs724.weebly.com/uploads/6/6/1/2/66126761/cs724... · Course Title Natural Language Processing Course Description This course is an introduction

Analysis using FST31

Page 32: Natural Language Processing (NLP) CS724nlpcs724.weebly.com/uploads/6/6/1/2/66126761/cs724... · Course Title Natural Language Processing Course Description This course is an introduction

32

Approach to NLP

Formal rule systems. regular grammars and regular relations, context-free grammars, as

well as probabilistic variants of these methods.

State machines and formal rule systems are the main tools used when dealing with knowledge of phonology, morphology, and syntax.

They deal with search through a space of states representing hypotheses about an input.

Among the algorithms that are often used for these tasks are well-known graph algorithms such as depth-first search, as well as heuristic variants such as best-first, and A* search.

The dynamic programming paradigm ensures that redundant computations are avoided.

Page 33: Natural Language Processing (NLP) CS724nlpcs724.weebly.com/uploads/6/6/1/2/66126761/cs724... · Course Title Natural Language Processing Course Description This course is an introduction

33

Approach to NLP

The third model for capturing knowledge of language

is logic.

first order logic,

feature-structures,

semantic networks, and

conceptual dependency.

These logical representations have traditionally been

the tool of choice when dealing with knowledge of

semantics, pragmatics, and discourse (used in

phonology, morphology, and syntax).

Page 34: Natural Language Processing (NLP) CS724nlpcs724.weebly.com/uploads/6/6/1/2/66126761/cs724... · Course Title Natural Language Processing Course Description This course is an introduction

34

Approach to NLP

state machines, formal rule systems, and logic can be

augmented with probabilities.

In these models probability theory is used to solve ambiguity

“given N choices for some ambiguous input, choose the most probable one”.

Eg: The word “Book” can be Noun or Verb based on a specific context. But what is the most probable choice?

Another major advantage of probabilistic models is that they are one of a class of machine learning models.

Machine learning models accelerates the learning process

Page 35: Natural Language Processing (NLP) CS724nlpcs724.weebly.com/uploads/6/6/1/2/66126761/cs724... · Course Title Natural Language Processing Course Description This course is an introduction

35

Approach to NLP

Machine learning research has focused on ways

to automatically learn the various representations described above;

automata, rule systems, search heuristics, classifiers.

These systems can be trained on large corpora and can be used as a powerful modeling technique, especially in places where we don’t yet have good causal models.

Page 36: Natural Language Processing (NLP) CS724nlpcs724.weebly.com/uploads/6/6/1/2/66126761/cs724... · Course Title Natural Language Processing Course Description This course is an introduction

36

Language, Thought, and Understanding

The effective use of language is intertwined with

our general cognitive abilities.

Turing Test by Alan Turing (1950)

He suggested an empirical test, a game, in which a

computer’s use of language would form the basis for

determining if it could think. If the machine could win the

game it would be judged intelligent.

Page 37: Natural Language Processing (NLP) CS724nlpcs724.weebly.com/uploads/6/6/1/2/66126761/cs724... · Course Title Natural Language Processing Course Description This course is an introduction

Language, Thought, and Understanding

How to measure Intelligence of a Machine?

Turing test – Alan Turing (1950)

A machine can be accepted to be intelligent if it can fool a

judge/person that its human over a tele-typing exercise.

ELIZA by Weizenbaum (1966)

Pretends to be a psychiatrist and converses with a user on

his/her problems.

Uses Keyword pattern matching

Many users thought the machine really understood their

problem.

Many such systems exist now. E.g. Alan, Alice,

Page 38: Natural Language Processing (NLP) CS724nlpcs724.weebly.com/uploads/6/6/1/2/66126761/cs724... · Course Title Natural Language Processing Course Description This course is an introduction

38

Language, Thought, and Understanding

ELIZA program (Weizenbaum, 1966)

ELIZA was an early natural language processing system capable of

carrying on a limited form of conversation with a user.

Consider the following session with a version of ELIZA that

imitated the responses of a Rogerian psychotherapist.

User1: You are like my father in some ways.ELIZA1: WHAT RESEMBLANCE DO YOU SEEUser2: You are not very aggressive but I think you don’t want me to notice that.

ELIZA2: WHAT MAKES YOU THINK I AM NOT AGGRESSIVEUser3: You don’t argue with me.ELIZA3: WHY DO YOU THINK I DON’T ARGUE WITH YOUUser4: You are afraid of me.ELIZA4: DOES IT PLEASE YOU TO BELIEVE I AM AFRAID OF YOU

Page 39: Natural Language Processing (NLP) CS724nlpcs724.weebly.com/uploads/6/6/1/2/66126761/cs724... · Course Title Natural Language Processing Course Description This course is an introduction

NLP Applications

Question answering

Who is the first Taiwanese president?

Text Categorization/Routing

e.g., customer e-mails sent to appropriate expert in a firm.

Text Mining

Find everything about Coffee Abyssinica .

Machine (Assisted) Translation

Language Teaching/Learning

Usage checking, Grammar, Spelling, etc

Spelling correction

It is not just dictionary lookup? Dictionary is Expensive

Page 40: Natural Language Processing (NLP) CS724nlpcs724.weebly.com/uploads/6/6/1/2/66126761/cs724... · Course Title Natural Language Processing Course Description This course is an introduction

Application areas

Text-to-Speech & Speech recognition

Natural Language Dialogue Interfaces to Databases

Information Retrieval

Information Extraction

Document Classification

Document Image Analysis

Automatic Summarization

Text Proofreading – Spelling & Grammar

Machine Translation

Story understanding systems

Plagiarism detection

Can you think of anything more ??

Page 41: Natural Language Processing (NLP) CS724nlpcs724.weebly.com/uploads/6/6/1/2/66126761/cs724... · Course Title Natural Language Processing Course Description This course is an introduction

Big Challenge

Language = Words + Rules + Exceptions.

Ambiguity is there at all levels..

Very many different languages..

language has a cultural element which creates more variation.

Languages are not equivalent in expression..

Highly systematic but also complex when done by machines.

Keeps changing.. New words, New rules and New exceptions..

Source : Electronic texts / Printed texts / Acoustic Speech Signal..

they are noisy..

Language looks obvious to us.. But it is a Big Deal for computer and

for artificial intelligence

Page 42: Natural Language Processing (NLP) CS724nlpcs724.weebly.com/uploads/6/6/1/2/66126761/cs724... · Course Title Natural Language Processing Course Description This course is an introduction

Project Idea42

Page 43: Natural Language Processing (NLP) CS724nlpcs724.weebly.com/uploads/6/6/1/2/66126761/cs724... · Course Title Natural Language Processing Course Description This course is an introduction

Project Idea

1. Select topic for your project. The topic should be in line with the

NLP application we have identified (On one of Ethiopian Local

Languages).

2. Gather sufficient number of reference literatures and review

them in detail

3. Produce your project proposal

4. After getting the approval, write literature review having at

least the following sections.

a) Approaches that exists to address the problem

b) The model of the selected approach

c) Detail description of the selected approach

d) Detailed description of the methodology you are planning to follow

e) Detail description of tools you are planning to use

43

Page 44: Natural Language Processing (NLP) CS724nlpcs724.weebly.com/uploads/6/6/1/2/66126761/cs724... · Course Title Natural Language Processing Course Description This course is an introduction

Project Idea

5.After completing the literature review proceed to:

a) Design your system

b) Do the project implementation

c) Test your systems

d) Prepare the report for Submission

e) Present your findings

f) Submit your data, tools used and implementation

together with the implementation guide

44

Page 45: Natural Language Processing (NLP) CS724nlpcs724.weebly.com/uploads/6/6/1/2/66126761/cs724... · Course Title Natural Language Processing Course Description This course is an introduction

End of Topic 1: Introduction