807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

807 - TEXT ANALYTICS

Massimo Poesio

Lecture 5: Named Entity Recognition

Text classification at Different Granularities

• Text Categorization: – Classify an entire document

• Information Extraction (IE):– Identify and classify small units within documents

• Named Entity Extraction (NE):– A subset of IE– Identify and classify proper names

• People, locations, organizations

Adapted from slide by William Cohen

What is Information ExtractionFilling slots in a database from sub-segments of text.As a task:

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

NAME TITLE ORGANIZATION

What is Information ExtractionFilling slots in a database from sub-segments of text.As a task:

October 14, 2002, 4:00 a.m. PT

NAME TITLE ORGANIZATIONBill Gates CEO MicrosoftBill Veghte VP MicrosoftRichard Stallman founder Free Soft..

What is Information ExtractionInformation Extraction = segmentation + classification + association

As a familyof techniques:

October 14, 2002, 4:00 a.m. PT

Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation

aka “named entity extraction”

A familyof techniques:

October 14, 2002, 4:00 a.m. PT

A familyof techniques:

October 14, 2002, 4:00 a.m. PT

INFORMATION EXTRACTION

• More general definition: extraction of structured information from unstructured documents

• IE Tasks:– Named entity extraction

• Named entity recognition• Coreference resolution• Relationship extraction

• Semi-structured IE– Table extraction

• Terminology extraction

Landscape of IE Tasks:Degree of Formatting

Grammatical sentencesand some formatting & links

Text paragraphswithout formatting

Astro Teller is the CEO and co-founder of BodyMedia. Astro holds a Ph.D. in Artificial Intelligence from Carnegie Mellon University, where he was inducted as a national Hertz fellow. His M.S. in symbolic and heuristic computation and B.S. in computer science are from Stanford University. His work in science, literature and business has appeared in international media from the New York Times to CNN to NPR.

Non-grammatical snippets,rich formatting & links Tables

Landscape of IE Tasks:Intended Breadth of Coverage

Web site specific Genre specific Wide, non-specific

Amazon.com Book Pages Resumes University Names

Formatting Layout Language

Landscape of IE Tasks”Complexity

Closed set

He was born in Alabama…

The big Wyoming sky…

U.S. states

Regular set

Phone: (413) 545-1323

The CALD main office can be reached at 412-268-1299

U.S. phone numbers

Complex pattern

University of ArkansasP.O. Box 140Hope, AR 71802

U.S. postal addresses

Headquarters:1128 Main Street, 4th FloorCincinnati, Ohio 45210

…was among the six houses sold by Hope Feldman that year.

Ambiguous patterns,needing context andmany sources of evidence

Person names

Pawel Opalinski, SoftwareEngineer at WhizBang Labs.

Landscape of IE Tasks:Single Field/Record

Single entity

Person: Jack Welch

Binary relationship

Relation: Person-TitlePerson: Jack WelchTitle: CEO

N-ary record

“Named entity” extraction

Jack Welch will retire as CEO of General Electric tomorrow. The top role at the Connecticut company will be filled by Jeffrey Immelt.

Relation: Company-LocationCompany: General ElectricLocation: Connecticut

Relation: SuccessionCompany: General ElectricTitle: CEOOut: Jack WelshIn: Jeffrey Immelt

Person: Jeffrey Immelt

Location: Connecticut

State of the Art Performance: a sample

• Named entity recognition from newswire text– Person, Location, Organization, …– F1 in high 80’s or low- to mid-90’s

• Binary relation extraction– Contained-in (Location1, Location2)

Member-of (Person1, Organization1)– F1 in 60’s or 70’s or 80’s

• Web site structure recognition– Extremely accurate performance obtainable– Human effort (~10min?) required on each site

Slide by Chris Manning, based on slides by several others

Three generations of IE systems

• Hand-Built Systems – Knowledge Engineering [1980s– ]– Rules written by hand– Require experts who understand both the systems and the domain– Iterative guess-test-tweak-repeat cycle

• Automatic, Trainable Rule-Extraction Systems [1990s– ]– Rules discovered automatically using predefined templates, using automated

rule learners– Require huge, labeled corpora (effort is just moved!)

• Statistical Models [1997 – ]– Use machine learning to learn which features indicate boundaries and types of

entities.– Learning usually supervised; may be partially unsupervised

Named Entity Recognition (NER)

Input:

Apple Inc., formerly Apple Computer, Inc., is an American multinational corporation headquartered in Cupertino, California that designs, develops, and sells consumer electronics, computer software and personal computers. It was established on April 1, 1976, by Steve Jobs, Steve Wozniak and Ronald Wayne.

Output:

Apple Inc., formerly Apple Computer, Inc., is an American multinational corporation headquartered in Cupertino, California that designs, develops, and sells consumer electronics, computer software and personal computers. It was established on April 1, 1976, by Steve Jobs, Steve Wozniak and Ronald Wayne.

Named Entity Recognition (NER)• Locate and classify atomic elements in text into

predefined categories (persons, organizations, locations, temporal expressions, quantities, percentages, monetary values, …)

• Input: a block of text– Jim bought 300 shares of Acme Corp. in 2006.

• Output: annotated block of text– <ENAMEX TYPE="PERSON">Jim</ENAMEX> bought

<NUMEX TYPE="QUANTITY">300</NUMEX> shares of <ENAMEX TYPE="ORGANIZATION">Acme Corp.</ENAMEX> in <TIMEX TYPE="DATE">2006</TIMEX>

– ENAMEX tags (MUC in the 1990s)

THE STANDARD NEWS DOMAIN

• Most work on NER focuses on – NEWS– Variants of repertoire of entity types first studied in

MUC and then in ACE:• PERSON• ORGANIZATION

– GPE

• LOCATION• TEMPORAL ENTITY• NUMBER

• Two tasks:– Identifying the part of text that mentions a text

(RECOGNITION)– Classifying it (CLASSIFICATION)

• The two tasks are reduced to a standard classification task by having the system classify WORDS

Basic Problems in NER

• Variation of NEs – e.g. John Smith, Mr Smith, John.• Ambiguity of NE types

– John Smith (company vs. person)– May (person vs. month)– Washington (person vs. location)– 1945 (date vs. time)

• Ambiguity with common words, e.g. “may”

Problems in NER

• Category definitions are intuitively quite clear, but there are many grey areas.

• Many of these grey area are caused by metonymy.Organisation vs. Location : “England won the World Cup” vs. “The World Cup took place in England”.Company vs. Artefact: “shares in MTV” vs. “watching MTV”Location vs. Organisation: “she met him at Heathrow” vs. “the Heathrow authorities”

Solutions• The task definition must be very clearly specified at the

outset.• The definitions adopted at the MUC conferences for each

category listed guidelines, examples, counter-examples, and “logic” behind the intuition.

• MUC essentially adopted simplistic approach of disregarding metonymous uses of words, e.g. “England” was always identified as a location. However, this is not always useful for practical applications of NER (e.g. football domain).

• Idealistic solutions, on the other hand, are not always practical to implement, e.g. making distinctions based on world knowledge.

807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

Documents

INTRODUCTION TO ARTIFICIAL INTELLIGENCE Massimo Poesio LECTURE 16: Unsupervised methods, IR, and lexical acquisition

INTRODUCTION TO ARTIFICIAL INTELLIGENCE Massimo Poesio LECTURE 7 Psychological theories of concepts: Prototype theory, Image schemas

INTRODUCTION TO ARTIFICIAL INTELLIGENCE Massimo Poesio LECTURE 4: Semantic Networks and Description Logics

INTRODUCTION TO ARTIFICIAL INTELLIGENCE Massimo Poesio LECTURE 10: Knowledge and The Social Web

COMPUTATIONAL APPROACHES TO REFERENCE Jeanette Gundel & Massimo Poesio LSA Summer Institute Massimo Poesio: Good morning. In this presentation I am going

Gundel & Poesio - Computational Approaches to Reference

INTRODUCTION TO ARTIFICIAL INTELLIGENCE Massimo Poesio Relation Extraction

CORPORA & CORPUS ANNOTATION Massimo Poesio Universita’ di Venezia 29 Settembre Massimo Poesio: Good morning. In this presentation I am going to summarize

Repetition and Variation in Child-Language Corpora to Children Sonja Eisenbeiss, Christoph Aurnhammer Massimo Poesio

ELERFED – End of Workshop Report Massimo Poesio (Trento / Essex) David Day (MITRE)

807 - TEXT ANALYTICS Massimo Poesio Lecture 10: Summarization

2004/05Modelli simulativi1 Modelli simulativi nelle Scienze Cognitive Il lessico: modelli linguistici, WordNet, acquisizione lessicale Massimo Poesio

July 2003LSA Summer School1 Gundel & Poesio - Computational Approaches to Reference Massimo Poesio (University of Essex) Lecture 2: More on the Interpretation

Completions and continuations in dialogue: a preliminary account Massimo Poesio (Uni Essex) Hannes Rieser (Uni Bielefeld) CATALOG Barcelona, July 2004

807 - TEXT ANALYTICS Massimo Poesio Lecture 7: Coreference (Anaphora resolution)

INTRODUCTION TO ARTIFICIAL INTELLIGENCE Massimo Poesio Unsupervised and Semi-Supervised Relation Extraction

INTRODUCTION TO ARTIFICIAL INTELLIGENCE Massimo Poesio LECTURE 11 (Lab): Probability reminder

SKU: 807 02 66 “Ballinasloe- Platinum” SKU: 807 02 34 807 02 Platinum” 807 02 The contemporary design of the “Ballinasloe-Platinum” upholstery collection beautifully combines

CROWDSOURCING Massimo Poesio Part 2: Games with a Purpose

CROWDSOURCING Massimo Poesio Part 4: Dealing with crowdsourced data