30
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Natural Language Processing aka Computational Linguistics aka Text Analytics: Introduction and overview Eric Atwell, Language Research Group (with thanks to Katja Markert, Marti Hearst, and other contributors)

Eric Atwell, Language Research Group

  • Upload
    kesia

  • View
    27

  • Download
    4

Embed Size (px)

DESCRIPTION

School of Computing FACULTY OF ENGINEERING. Natural Language Processing aka Computational Linguistics aka Text Analytics: Introduction and overview. Eric Atwell, Language Research Group (with thanks to Katja Markert, Marti Hearst, and other contributors). School of Computing - PowerPoint PPT Presentation

Citation preview

Page 1: Eric Atwell, Language Research Group

School of somethingFACULTY OF OTHER

School of ComputingFACULTY OF ENGINEERING

Natural Language Processing aka Computational Linguistics aka Text Analytics: Introduction and overview

Eric Atwell, Language Research Group

(with thanks to Katja Markert, Marti Hearst, and other contributors)

Page 2: Eric Atwell, Language Research Group

• Thanks to many others for much of the material; particularly…

• Katja Markert, Reader, School of Computing, Leeds University http://www.comp.leeds.ac.uk/markert http://www.comp.leeds.ac.uk/lng

• Marti Hearst, Associate Professor, School of Information, University of California at Berkeley http://www.ischool.berkeley.edu/people/faculty/martihearst http://courses.ischool.berkeley.edu/i256/f06/sched.html

School of ComputingFACULTY OF ENGINEERING

Page 3: Eric Atwell, Language Research Group

Today

Module Objectives

Why NLP is difficult: language is a complex system

How to solve it? Corpus-based machine-learning approaches

Motivation: applications of “The Language Machine”

Page 4: Eric Atwell, Language Research Group

Objectives

On completion of this module, students should be able to:- understand theory and terminology of empirical modelling of natural language;- understand and use algorithms, resources and techniques for implementing and evaluating NLP systems;- be familiar with some of the main language engineering and text analytics application areas;- appreciate why unrestricted natural language processing is still a major research task.

Page 5: Eric Atwell, Language Research Group

Goals of this Module

Learn about the problems and possibilities of natural language analysis:

• What are the major issues?

• What are the major solutions?

• How well do they work?

• How do they work?

At the end you should:

• Agree that language is subtle and interesting!

• Feel some ownership over the algorithms

• Be able to assess NLP problems

• Know which solutions to apply when, and how

• Be able to read research papers in the field

Page 6: Eric Atwell, Language Research Group

Why is NLP difficult?

Computers are not brains

• There is evidence that much of language understanding is built into the human brain

Computers do not socialize

• Much of language is about communicating with people

Key problems:

• Representation of meaning

• Language presupposes knowledge about the world

• Language is ambiguous: a message can have many interpretations

• Language presupposes communication between people

Page 7: Eric Atwell, Language Research Group

2001: A Space Odyssey (1968)

Dave Bowman: “Open the pod bay doors, HAL”

HAL 9000: “I’m sorry Dave. I’m afraid I can’t do that.”

Page 8: Eric Atwell, Language Research Group

Hidden Structure

English plural pronunciation

• Toy + s toyz ; add z

• Book + s books ; add s

• Church + s churchiz ; add iz

• Box + s boxiz ; add iz

• Sheep + s sheep ; add nothing

What about new words?

• Bach + ‘s baXs ; why not baXiz?

Page 9: Eric Atwell, Language Research Group

Language subtleties

Adjective order and placement

• A big black dog

• A big black scary dog

• A big scary dog

• A scary big dog

A black big dog

Antonyms

• Which sizes go together?

• Big and little

• Big and small

• Large and small

Large and little

Page 10: Eric Atwell, Language Research Group

World Knowledge is subtle

He arrived at the lecture.

He chuckled at the lecture.

He arrived drunk.

He chuckled drunk.

He chuckled his way through the lecture.

He arrived his way through the lecture.

Page 11: Eric Atwell, Language Research Group

Words are ambiguous: multiple functions and meanings

I know that.

I know that block.

I know that blocks the sun.

I know that block blocks the sun.

Page 12: Eric Atwell, Language Research Group

How can a machine understand these differences?

• Get the cat with the gloves.

Page 13: Eric Atwell, Language Research Group

How can a machine understand these differences?

• Get the sock from the cat with the gloves.

• Get the glove from the cat with the socks.

Page 14: Eric Atwell, Language Research Group

How can a machine understand these differences?

• Decorate the cake with the frosting.

• Decorate the cake with the kids.

• Throw out the cake with the frosting.

• Throw out the cake with the kids.

Page 15: Eric Atwell, Language Research Group

News Headline Ambiguity

Iraqi Head Seeks Arms

Juvenile Court to Try Shooting Defendant

Teacher Strikes Idle Kids

Kids Make Nutritious Snacks

British Left Waffles on Falkland Islands

Red Tape Holds Up New Bridges

Bush Wins on Budget, but More Lies Ahead

Hospitals are Sued by 7 Foot Doctors

(Headlines leave out punctuation and function-words)

Lynne Truss, 2003. Eats shoots and leaves:

The Zero Tolerance Approach to Punctuation

Page 16: Eric Atwell, Language Research Group

The Role of Memorization

Children learn words quickly

• Around age two they learn about 1 word every 2 hours.

• (Or 9 words/day)

• Often only need one exposure to associate meaning with word

• Can make mistakes, e.g., overgeneralization

“I goed to the store.”

• Exactly how they do this is still under study

Adult vocabulary

• Typical adult: about 60,000 words

• Literate adults: about twice that.

Page 17: Eric Atwell, Language Research Group

The Role of Memorization

Dogs can do word association too!

• Rico, a border collie in Germany

• Knows the names of each of 100 toys

• Can retrieve items called out to him with over 90% accuracy.

• Can also learn and remember the names of unfamiliar toys after just one encounter, putting him on a par with a three-year-old child.

http://www.nature.com/news/2004/040607/pf/040607-8_pf.html

Page 18: Eric Atwell, Language Research Group

But there is too much to memorize!

establish

establishment

the church of England as the official state church.

disestablishment

antidisestablishment

antidisestablishmentarian

antidisestablishmentarianism

is a political philosophy that is opposed to the separation of church and state.

MAYBE we don’t remember every word separately;

MAYBE we remember MORPHEMES and how to combine them

Page 19: Eric Atwell, Language Research Group

Rules and Memorization

Current thinking in psycholinguistics is that we use a combination of rules and memorization

• However, this is controversial

Mechanism:

• If there is an applicable rule, apply it

• However, if there is a memorized version, that takes precedence. (Important for irregular words.)

• Artists paint “still lifes”

• Not “still lives”

• Past tense of

• think thought

• blink blinked

This is a simplification…

Page 20: Eric Atwell, Language Research Group

Representation of Meaning

I know that block blocks the sun.

• How do we represent the meanings of “block”?

• How do we represent “I know”?

• How does that differ from “I know that…”?

• Who/what is “I”?

• How do we indicate that we are talking about earth’s sun vs. some other planet’s sun?

• When did this take place? What if I move the block? What if I move my viewpoint? How do we represent this?

Page 21: Eric Atwell, Language Research Group

How to tackle these problems?

The field was stuck for quite some time…

linguistic models for a specific example did not generalise

A new approach started around 1990: Corpus Linguistics

• Well, not really new, but in the 50’s to 80’s, they didn’t have the text, disk space, or GHz

Main idea: combine memorizing and rules, learn from data

How to do it:

• Get large text collection (a corpus; plural: several corpora)

• Compute statistics over the words in the text collection (corpus)

Surprisingly effective

• Even better now with the Web: Web-as-Corpus research

Page 22: Eric Atwell, Language Research Group

Example Problem

Grammar checking example:

Which word to use?

<principal> <principle>

Empirical solution: look at which words surround each use:

• I am in my third year as the principal of Anamosa High School.

• School-principal transfers caused some upset.

• This is a simple formulation of the quantum mechanical uncertainty principle.

• Power without principle is barren, but principle without power is futile. (Tony Blair)

Page 23: Eric Atwell, Language Research Group

Using Very Large Corpora

Keep track of which words are the neighbors of each spelling in well-edited text, e.g.:

• Principal: “high school”

• Principle: “rule”

At grammar-check time, choose the spelling best predicted by the probability of co-occurring with surrounding words.

No need to “understand the meaning” !?

Surprising results:

• Log-linear improvement even to a billion words!

• Getting more data is better than fine-tuning algorithms!

Page 24: Eric Atwell, Language Research Group

The Effects of LARGE Datasets

From Banko & Brill, 2001. Scaling to Very Very Large Corpora for Natural Language Disambiguation, Proc ACL

Page 25: Eric Atwell, Language Research Group

Motivation: Real-World Applications of NLP

Spelling Suggestions/Corrections

Grammar Checking

Synonym Generation

Information Extraction

Text Categorization

Automated Customer Service

Speech Recognition

Machine Translation

Question Answering

Chatbots

Improving Web Search Engine results

Automated Metadata Assignment

Online Dialogs

Page 26: Eric Atwell, Language Research Group

Machine Translation

Page 27: Eric Atwell, Language Research Group

Information Retrieval, e.g. Google … and scholar, books, products, AdWords, AdSense

Page 28: Eric Atwell, Language Research Group

Synonym Generation

Page 29: Eric Atwell, Language Research Group

Programming: Python and NLTK

Python: A suitable programming language

• Interpreted – easy to test ideas

• Object-oriented

• Easy to interface to other things (web, DBMS, TK)

• Data-structures, OO concepts etc from: java, lisp, tcl, perl

• Easy to learn, FUN! (?)

• Python NLTK: Natural Language Tool Kit with demos and tutorials

Suggested private study this week:

• Load python and NLTK onto your own PCs: http://www.nltk.org/

• Read “The Language Machine” http://www.comp.leeds.ac.uk/eric/atwell99bc.pdf

• Read NLTK “Getting Started” http://www.nltk.org/getting-started

Page 30: Eric Atwell, Language Research Group

Summary: Intro to NLP

Module Objectives: learn about NLP and how to apply it

Why NLP is difficult: language is a complex system

How to solve it? Corpus-based machine-learning approaches

Motivation: applications of “The Language Machine”