107
807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

Embed Size (px)

Citation preview

Page 1: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

807 - TEXT ANALYTICS

Massimo Poesio

Lecture 5: Named Entity Recognition

Page 2: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

Text classification at Different Granularities

• Text Categorization: – Classify an entire document

• Information Extraction (IE):– Identify and classify small units within documents

• Named Entity Extraction (NE):– A subset of IE– Identify and classify proper names

• People, locations, organizations

Page 3: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

Adapted from slide by William Cohen

What is Information ExtractionFilling slots in a database from sub-segments of text.As a task:

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

NAME TITLE ORGANIZATION

Page 4: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

Adapted from slide by William Cohen

What is Information ExtractionFilling slots in a database from sub-segments of text.As a task:

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

NAME TITLE ORGANIZATIONBill Gates CEO MicrosoftBill Veghte VP MicrosoftRichard Stallman founder Free Soft..

IE

Page 5: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

Adapted from slide by William Cohen

What is Information ExtractionInformation Extraction = segmentation + classification + association

As a familyof techniques:

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation

aka “named entity extraction”

Page 6: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

Adapted from slide by William Cohen

What is Information ExtractionInformation Extraction = segmentation + classification + association

A familyof techniques:

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation

Page 7: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

Adapted from slide by William Cohen

What is Information ExtractionInformation Extraction = segmentation + classification + association

A familyof techniques:

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation

Page 8: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

INFORMATION EXTRACTION

• More general definition: extraction of structured information from unstructured documents

• IE Tasks:– Named entity extraction

• Named entity recognition• Coreference resolution• Relationship extraction

• Semi-structured IE– Table extraction

• Terminology extraction

Page 9: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

Adapted from slide by William Cohen

Landscape of IE Tasks:Degree of Formatting

Grammatical sentencesand some formatting & links

Text paragraphswithout formatting

Astro Teller is the CEO and co-founder of BodyMedia. Astro holds a Ph.D. in Artificial Intelligence from Carnegie Mellon University, where he was inducted as a national Hertz fellow. His M.S. in symbolic and heuristic computation and B.S. in computer science are from Stanford University. His work in science, literature and business has appeared in international media from the New York Times to CNN to NPR.

Non-grammatical snippets,rich formatting & links Tables

Page 10: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

Adapted from slide by William Cohen

Landscape of IE Tasks:Intended Breadth of Coverage

Web site specific Genre specific Wide, non-specific

Amazon.com Book Pages Resumes University Names

Formatting Layout Language

Page 11: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

Landscape of IE Tasks”Complexity

Closed set

He was born in Alabama…

The big Wyoming sky…

U.S. states

Regular set

Phone: (413) 545-1323

The CALD main office can be reached at 412-268-1299

U.S. phone numbers

Complex pattern

University of ArkansasP.O. Box 140Hope, AR 71802

U.S. postal addresses

Headquarters:1128 Main Street, 4th FloorCincinnati, Ohio 45210

…was among the six houses sold by Hope Feldman that year.

Ambiguous patterns,needing context andmany sources of evidence

Person names

Pawel Opalinski, SoftwareEngineer at WhizBang Labs.

Page 12: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

Adapted from slide by William Cohen

Landscape of IE Tasks:Single Field/Record

Single entity

Person: Jack Welch

Binary relationship

Relation: Person-TitlePerson: Jack WelchTitle: CEO

N-ary record

“Named entity” extraction

Jack Welch will retire as CEO of General Electric tomorrow. The top role at the Connecticut company will be filled by Jeffrey Immelt.

Relation: Company-LocationCompany: General ElectricLocation: Connecticut

Relation: SuccessionCompany: General ElectricTitle: CEOOut: Jack WelshIn: Jeffrey Immelt

Person: Jeffrey Immelt

Location: Connecticut

Page 13: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

Adapted from slide by William Cohen

State of the Art Performance: a sample

• Named entity recognition from newswire text– Person, Location, Organization, …– F1 in high 80’s or low- to mid-90’s

• Binary relation extraction– Contained-in (Location1, Location2)

Member-of (Person1, Organization1)– F1 in 60’s or 70’s or 80’s

• Web site structure recognition– Extremely accurate performance obtainable– Human effort (~10min?) required on each site

Page 14: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

Slide by Chris Manning, based on slides by several others

Three generations of IE systems

• Hand-Built Systems – Knowledge Engineering [1980s– ]– Rules written by hand– Require experts who understand both the systems and the domain– Iterative guess-test-tweak-repeat cycle

• Automatic, Trainable Rule-Extraction Systems [1990s– ]– Rules discovered automatically using predefined templates, using automated

rule learners– Require huge, labeled corpora (effort is just moved!)

• Statistical Models [1997 – ]– Use machine learning to learn which features indicate boundaries and types of

entities.– Learning usually supervised; may be partially unsupervised

Page 15: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

Named Entity Recognition (NER)

Input:

Apple Inc., formerly Apple Computer, Inc., is an American multinational corporation headquartered in Cupertino, California that designs, develops, and sells consumer electronics, computer software and personal computers. It was established on April 1, 1976, by Steve Jobs, Steve Wozniak and Ronald Wayne.

Output:

Apple Inc., formerly Apple Computer, Inc., is an American multinational corporation headquartered in Cupertino, California that designs, develops, and sells consumer electronics, computer software and personal computers. It was established on April 1, 1976, by Steve Jobs, Steve Wozniak and Ronald Wayne.

Page 16: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

Named Entity Recognition (NER)• Locate and classify atomic elements in text into

predefined categories (persons, organizations, locations, temporal expressions, quantities, percentages, monetary values, …)

• Input: a block of text– Jim bought 300 shares of Acme Corp. in 2006.

• Output: annotated block of text– <ENAMEX TYPE="PERSON">Jim</ENAMEX> bought

<NUMEX TYPE="QUANTITY">300</NUMEX> shares of <ENAMEX TYPE="ORGANIZATION">Acme Corp.</ENAMEX> in <TIMEX TYPE="DATE">2006</TIMEX>

– ENAMEX tags (MUC in the 1990s)

Page 17: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

THE STANDARD NEWS DOMAIN

• Most work on NER focuses on – NEWS– Variants of repertoire of entity types first studied in

MUC and then in ACE:• PERSON• ORGANIZATION

– GPE

• LOCATION• TEMPORAL ENTITY• NUMBER

Page 18: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

HOW

• Two tasks:– Identifying the part of text that mentions a text

(RECOGNITION)– Classifying it (CLASSIFICATION)

• The two tasks are reduced to a standard classification task by having the system classify WORDS

Page 19: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

Basic Problems in NER

• Variation of NEs – e.g. John Smith, Mr Smith, John.• Ambiguity of NE types

– John Smith (company vs. person)– May (person vs. month)– Washington (person vs. location)– 1945 (date vs. time)

• Ambiguity with common words, e.g. “may”

Page 20: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

Problems in NER

• Category definitions are intuitively quite clear, but there are many grey areas.

• Many of these grey area are caused by metonymy.Organisation vs. Location : “England won the World Cup” vs. “The World Cup took place in England”.Company vs. Artefact: “shares in MTV” vs. “watching MTV”Location vs. Organisation: “she met him at Heathrow” vs. “the Heathrow authorities”

Page 21: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

Solutions• The task definition must be very clearly specified at the

outset.• The definitions adopted at the MUC conferences for each

category listed guidelines, examples, counter-examples, and “logic” behind the intuition.

• MUC essentially adopted simplistic approach of disregarding metonymous uses of words, e.g. “England” was always identified as a location. However, this is not always useful for practical applications of NER (e.g. football domain).

• Idealistic solutions, on the other hand, are not always practical to implement, e.g. making distinctions based on world knowledge.

Page 22: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

More complex problems in NER

• Issues of style, structure, domain, genre etc.– Punctuation, spelling, spacing, formatting, ….all have an

impact

Dept. of Computing and MathsManchester Metropolitan UniversityManchesterUnited Kingdom

> Tell me more about Leonardo> Da Vinci

Page 23: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

Approaches to NER: List Lookup

• System that recognises only entities stored in its lists (GAZETTEERS).

• Advantages - Simple, fast, language independent, easy to retarget

• Disadvantages – collection and maintenance of lists, cannot deal with name variants, cannot resolve ambiguity

Page 24: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

Approaches to NER: Shallow Parsing

• Names often have internal structure. These components can be either stored or guessed.

location: CapWord + {City, Forest, Center} e.g. Sherwood ForestCap Word + {Street, Boulevard, Avenue, Crescent, Road} e.g. Portobello Street

Page 25: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

Shallow Parsing Approach(E.g., Mikheev et al 1998)

• External evidence - names are often used in very predictive local contexts

Location:“to the” COMPASS “of” CapWord e.g. to the south of Loitokitok“based in” CapWord e.g. based in LoitokitokCapWord “is a” (ADJ)? GeoWord e.g. Loitokitok is a friendly city

Page 26: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

Difficulties in Shallow Parsing Approach

• Ambiguously capitalised words (first word in sentence) [All American Bank] vs. All [State Police]• Semantic ambiguity “John F. Kennedy” = airport (location) “Philip Morris” = organisation• Structural ambiguity [Cable and Wireless] vs. [Microsoft] and [Dell] [Center for Computational Linguistics] vs. message from

[City Hospital] for [John Smith].

Page 27: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

Machine learning approaches to NER

• NER as classification: the IOB representation• Supervised methods

– Support Vector Machines– Logistic regression (aka Maximum Entropy)– Sequence pattern learning– Hidden Markov Models– Conditional Random Fields

• Distant learning• Semi-supervised methods

Page 28: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

THE ML APPROACH TO NE: THE IOB REPRESENTATION

Page 29: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

THE ML APPROACH TO NE: FEATURES

Page 30: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

FEATURES

Page 31: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

FEATURES

Page 32: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

Supervised ML for NER

• Methods already seen– Decision trees– Support Vector Machines

• Sequence learning – Hidden Markov Models– Maximum Entropy Models– Conditional Random Fields

Page 33: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

NER as a SEQUENCE CLASSIFICATION TASK

Page 34: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

Sequence Labeling as Classification: POS Tagging

Classify each token independently but use as input features, information about the surrounding tokens (sliding window).

Slide from Ray Mooney

John saw the saw and decided to take it to the table.

classifier

NNP

Page 35: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

Sequence Labeling as Classification

Classify each token independently but use as input features, information about the surrounding tokens (sliding window).

John saw the saw and decided to take it to the table.

classifier

VBD

Slide from Ray Mooney

Page 36: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

Sequence Labeling as Classification

Classify each token independently but use as input features, information about the surrounding tokens (sliding window).

John saw the saw and decided to take it to the table.

classifier

DT

Slide from Ray Mooney

Page 37: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

Sequence Labeling as Classification

Classify each token independently but use as input features, information about the surrounding tokens (sliding window).

John saw the saw and decided to take it to the table.

classifier

NN

Slide from Ray Mooney

Page 38: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

Sequence Labeling as Classification

Classify each token independently but use as input features, information about the surrounding tokens (sliding window).

John saw the saw and decided to take it to the table.

classifier

CC

Slide from Ray Mooney

Page 39: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

Sequence Labeling as Classification

Classify each token independently but use as input features, information about the surrounding tokens (sliding window).

John saw the saw and decided to take it to the table.

classifier

VBD

Slide from Ray Mooney

Page 40: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

Sequence Labeling as Classification

Classify each token independently but use as input features, information about the surrounding tokens (sliding window).

John saw the saw and decided to take it to the table.

classifier

TO

Slide from Ray Mooney

Page 41: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

Sequence Labeling as Classification

Classify each token independently but use as input features, information about the surrounding tokens (sliding window).

John saw the saw and decided to take it to the table.

classifier

VB

Slide from Ray Mooney

Page 42: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

Sequence Labeling as Classification

Classify each token independently but use as input features, information about the surrounding tokens (sliding window).

John saw the saw and decided to take it to the table.

classifier

PRP

Slide from Ray Mooney

Page 43: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

Sequence Labeling as Classification

Classify each token independently but use as input features, information about the surrounding tokens (sliding window).

John saw the saw and decided to take it to the table.

classifier

IN

Slide from Ray Mooney

Page 44: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

Sequence Labeling as Classification

Classify each token independently but use as input features, information about the surrounding tokens (sliding window).

John saw the saw and decided to take it to the table.

classifier

DT

Slide from Ray Mooney

Page 45: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

Sequence Labeling as Classification

Classify each token independently but use as input features, information about the surrounding tokens (sliding window).

John saw the saw and decided to take it to the table.

classifier

NN

Slide from Ray Mooney

Page 46: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

Using Outputs as Inputs

Better input features are usually the categories of the surrounding tokens, but these are not available yet

Can use category of either the preceding or succeeding tokens by going forward or back and using previous output

Slide from Ray Mooney

Page 47: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

Forward Classification

John saw the saw and decided to take it to the table.

classifier

NNP

Slide from Ray Mooney

Page 48: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

Forward Classification

NNPJohn saw the saw and decided to take it to the table.

classifier

VBD

Slide from Ray Mooney

Page 49: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

Forward Classification

NNP VBDJohn saw the saw and decided to take it to the table.

classifier

DT

Slide from Ray Mooney

Page 50: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

Forward Classification

NNP VBD DTJohn saw the saw and decided to take it to the table.

classifier

NN

Slide from Ray Mooney

Page 51: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

Forward Classification

NNP VBD DT NNJohn saw the saw and decided to take it to the table.

classifier

CC

Slide from Ray Mooney

Page 52: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

Forward Classification

NNP VBD DT NN CCJohn saw the saw and decided to take it to the table.

classifier

VBD

Slide from Ray Mooney

Page 53: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

Forward Classification

NNP VBD DT NN CC VBDJohn saw the saw and decided to take it to the table.

classifier

TO

Slide from Ray Mooney

Page 54: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

Forward Classification

NNP VBD DT NN CC VBD TOJohn saw the saw and decided to take it to the table.

classifier

VB

Slide from Ray Mooney

Page 55: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

Backward Classification

• Disambiguating Disambiguating ““toto”” in this case would be even in this case would be even easier backward.easier backward.

DT NNJohn saw the saw and decided to take it to the table.

classifier

IN

Slide from Ray Mooney

Page 56: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

Backward Classification

• Disambiguating Disambiguating ““toto”” in this case would be even in this case would be even easier backward.easier backward.

IN DT NNJohn saw the saw and decided to take it to the table.

classifier

PRP

Slide from Ray Mooney

Page 57: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

Backward Classification

• Disambiguating Disambiguating ““toto”” in this case would be even in this case would be even easier backward.easier backward.

PRP IN DT NNJohn saw the saw and decided to take it to the table.

classifier

VB

Slide from Ray Mooney

Page 58: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

Backward Classification

• Disambiguating Disambiguating ““toto”” in this case would be even in this case would be even easier backward.easier backward.

VB PRP IN DT NNJohn saw the saw and decided to take it to the table.

classifier

TO

Slide from Ray Mooney

Page 59: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

Backward Classification

• Disambiguating Disambiguating ““toto”” in this case would be even in this case would be even easier backward.easier backward.

TO VB PRP IN DT NN John saw the saw and decided to take it to the table.

classifier

VBD

Slide from Ray Mooney

Page 60: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

Backward Classification

• Disambiguating Disambiguating ““toto”” in this case would be even in this case would be even easier backward.easier backward.

VBD TO VB PRP IN DT NN John saw the saw and decided to take it to the table.

classifier

CC

Slide from Ray Mooney

Page 61: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

Backward Classification

• Disambiguating Disambiguating ““toto”” in this case would be even in this case would be even easier backward.easier backward.

CC VBD TO VB PRP IN DT NN John saw the saw and decided to take it to the table.

classifier

VBD

Slide from Ray Mooney

Page 62: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

Backward Classification

• Disambiguating Disambiguating ““toto”” in this case would be even in this case would be even easier backward.easier backward.

VBD CC VBD TO VB PRP IN DT NNJohn saw the saw and decided to take it to the table.

classifier

DT

Slide from Ray Mooney

Page 63: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

Backward Classification

• Disambiguating Disambiguating ““toto”” in this case would be even in this case would be even easier backward.easier backward.

DT VBD CC VBD TO VB PRP IN DT NNJohn saw the saw and decided to take it to the table.

classifier

VBD

Slide from Ray Mooney

Page 64: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

Backward Classification

• Disambiguating Disambiguating ““toto”” in this case would be even in this case would be even easier backward.easier backward.

VBD DT VBD CC VBD TO VB PRP IN DT NN John saw the saw and decided to take it to the table.

classifier

NNP

Slide from Ray Mooney

Page 65: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

NER as Sequence Labeling

Page 66: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

Probabilistic Sequence Models

Probabilistic sequence models allow integrating uncertainty over multiple, interdependent classifications and collectively determine the most likely global assignment

Two standard models Hidden Markov Model (HMM) Conditional Random Field (CRF) Maximum Entropy Markov Model (MEMM) is a

simplified version of CRF

Page 67: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

Hidden Markov Models (HMMs)

• Generative – Find parameters to maximize P(X,Y)

• Assumes features are independent• When labeling Xi future observations are taken

into account (forward-backward)

Page 68: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

Conditional Random Fields (CRFs)

• Discriminative– Find parameters to maximize P(Y|X)

• Doesn’t assume that features are independent• When labeling Yi future observations are taken into

account The best of both worlds!

Page 69: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

86

PROBABILISTIC CLASSIFICATION: GENERATIVE VS DISCRIMINATIVE

• Let Y be the random variable for the class which takes values {y1,y2,…ym}.

• Let X be the random variable describing an instance consisting of a vector of values for n features <X1,X2…Xn>, let xk be a possible vector value for X and xij a possible value for Xi.

• For classification, we need to compute P(Y=yi | X=xk) for i = 1…m

• Could be done using joint distribution but this requires estimating an exponential number of parameters.

Page 70: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

Discriminative Vs. Generative

,( )p y x

|( )p y x

Generative Model: A model that generate observed data randomly

Naïve Bayes: once the class label is known, all the features are independent

Discriminative: Directly estimate the posterior probability; Aim at modeling the “discrimination” between different outputs

MaxEnt classifier: linear combination of feature function in the exponent,

Both generative models and discriminative models describe distributions over (y , x), but they work in different directions.

Page 71: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

Discriminative Vs. Generative

,( )p y x

|( )p y x

=unobservable=observable

Page 72: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

Simple Linear Chain CRF Features• Modeling the conditional distribution is similar to

that used in multinomial logistic regression.

• Create feature functions fk(Yt, Yt−1, Xt)

– Feature for each state transition pair i, j• fi,j(Yt, Yt−1, Xt) = 1 if Yt = i and Yt−1 = j and 0 otherwise

– Feature for each state observation pair i, o• fi,o(Yt, Yt−1, Xt) = 1 if Yt = i and Xt = o and 0 otherwise

• Note: number of features grows quadratically in the number of states (i.e. tags).

93

Page 73: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

Conditional Distribution forLinear Chain CRF

• Using these feature functions for a simple linear chain CRF, we can define:

94

T

t

K

ktttkk XYYf

XZXYP

1 11 )),,(exp(

)(

1)|(

Y

T

t

K

ktttkk XYYfXZ

1 11 )),,(exp()(

Page 74: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

Adding Token Features to a CRF• Can add token features Xi,j

95

…X1,1X1,m …X2,1

X2,m …XT,1XT,m…

• Can add additional feature functions for each token feature to model conditional distribution.

Y1 Y2 YT

Page 75: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

NER: EVALUATION

Page 76: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

TYPICAL PERFORMANCE

Page 77: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

NER Evaluation Campaigns

• English NER-- CoNLL 2003 - PER/ORG/LOC/MISC– Training set: 203.621 tokens– Development set: 51.362 tokens– Test set: 46.435 tokens

• Italian NER-- Evalita 2009 - PER/ORG/LOC/GPE– Development set: 223.706 tokens– Test set: 90.556 tokens

• Mention Detection-- ACE 2005– 599 documents

Page 78: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

CoNLL2003 shared task (1)

• English and German language• 4 types of NEs:

– LOC Location– MISC Names of miscellaneous entities– ORG Organization– PER Person

• Training Set for developing the system• Test Data for the final evaluation

Page 79: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

CoNLL2003 shared task (2)

• Data– columns separated by a single space– A word for each line– An empty line after each sentence – Tags in IOB format

• An exampleMilan NNP B-NP I-ORG's POS B-NP Oplayer NN I-NP OGeorge NNP I-NP I-PERWeah NNP I-NP I-PERmeet VBP B-VP O

Page 80: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

CoNLL2003 shared task (3)

English precision recall F [FIJZ03] 88.99% 88.54% 88.76%[CN03] 88.12% 88.51% 88.31%[KSNM03] 85.93% 86.21% 86.07%[ZJ03] 86.13% 84.88% 85.50%---------------------------------------------------[Ham03] 69.09% 53.26% 60.15%

baseline 71.91% 50.90% 59.61%

Page 81: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

CURRENT RESEARCH ON NER

• New domains• New approaches:

– Semi-supervised– Distant

• Handling many NE types• Integration with Machine Translation• Handling difficult linguistic phenomena such as

metonymy

Page 82: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

NEW DOMAINS

• BIOMEDICAL• CHEMISTRY• HUMANITIES: MORE FINE GRAINED TYPES

Page 83: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

Bioinformatics Named Entities

• Protein• DNA• RNA• Cell line• Cell type• Drug• Chemical

Page 84: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

NER IN THE HUMANITIES

LOC

SITE

CULTURE

Page 85: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

Semi-supervised learning

• Modest amounts of supervision– Small size of training data– Supervisor input sought when necessary

• Aims to match supervised learning performance, but with muchless human effort

• Bootstrapping– Seeds used to identify contextual clues– Contextual clues used to find more NEs

Page 86: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

• Examples: (Brin 1998); (Collins and Singer 1999); (Riloff and Jones 1999); (Cucchiarelli and Velardi 2001); (Pasca et al. 2006); (Heng and Grishman 2006); (Nadeau et al. 2006), and (Liao and Veeramachaneni, 2009)

Semi-supervised learning

Page 87: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

Input–A seed list of a few examples of a given NE type

• ‘Muhammad’ & ‘Obama’ can be used as seed examples for entity of type person.

Parameters–Number of iterations!

–Number of initial seeds!

–The ranking measure (Reliability measure)!

ASemiNER - Methodology

Page 88: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

– Sentences containing seed instance (retrieved)

– A number of tokens on each side of the seed. (extracted)

• Sentence boundaries

Pattern Induction - Initial Patterns

TP pair = (Token/POS) pair

Page 89: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

– Token (noun) inflected forms

Pattern Induction - Final Patterns

− Token (verb) stems

Page 90: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

– Lists of trigger nouns (e.g., alsayd `Mr.’, alraeys `President’, alduktur `Dr.’ )

• They will be used as Arabic NE indicators or trigger words in the training phase.

• Arabic Wikipedia articles are crawled randomly, prepared, and POS-tagged.

• The left and right nouns of the named entity are extracted and collected.

• The top most frequent nouns (inflected) are picked and stored as “trigger” nouns.

– Lists of trigger verbs (e.g., rasam `draw’, naHat `sculpture’, ...etc.)• The top most frequent verbs (stems) are picked and stored as “trigger” verbs

Pattern Induction - Final Patterns“Trigger” words

Page 91: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

• Generalization• TP pairs that contains nouns, and verbs are stripped of their ‘Token’ parts, unless these tokens are

in the corresponding lists of trigger words. alsayd/NN ‘Mr./NN’ alsayd/NN ‘Mr./NN’ as alsayd ‘Mr.’ is in the list of trigger nouns qalam/NN ‘pen/NN’ / NN as qalam ‘pen’ is not among trigger nouns.

• TP pairs that contain preposition are kept without changes.• TP pairs that contain other parts of speech categories (e.g., proper noun, adjective, coordinating

conjunction) are stripped of their ‘Token’ parts. mufyd/JJ ‘useful/JJ’ /JJ

• All POS tags used for verbs (e.g., VBP, VBD, VBN) are converted to one form VB.

• All POS tags used for nouns (e.g., NN, NNS) are converted to one form NN.

• All POS tags used for proper noun (e.g., NNP, NNPS) are converted to one form NNP.

• The seed instance is replaced with NE class tag (e.g., <PersonName>, <Location>, <organization>).

Pattern Induction

Page 92: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

Pattern Induction

• Producing Final Patterns

Initial Pattern

Final Pattern

English Gloss

Page 93: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

Pattern Induction

• Two more Final Patterns

Page 94: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

Pattern Induction

• Final Pattern Set (P) is modified and filtered every time a

new pattern is added .

• Repeated patterns are rejected.

• Pattern consisting of less than six TP pairs should contain at

least one ‘Token’ part. – /VB /NN <PersonName>/NNP /NNP

Page 95: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

ASemiNER - Methodology

Page 96: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

Instance Extraction• ASemiNER retrieves from the training corpus the set of instances (I) that match

any of the patterns in (P) using Regular Expressions (Regex).

• ASemiNER automatically generates regexes from final patterns without any

modification regardless of the correctness of the POS tags assigned to the proper

noun by POS tagger.

Page 97: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

Instance Extraction• ASemiNER automatically add the information of average NE length

to the produced regexes. (2 tokens)

Page 98: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

ASemiNER - Methodology

Page 99: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

Instance Ranking/Selection

• Extracted instances in (I) are ranked according to:– Distinct patterns used to extract them.

(pattern variety is better cue to semantic than absolute frequency)

– Pointwise Mutual Information (PMI) :

• |i,p|: the frequency of the instance i extracted by pattern p.

• |i|: the frequency of the instance i in the corpus.

• |p|: the frequency of the pattern p in the corpus.

Page 100: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

ASemiNER - Methodology

Top m instances

Where ( m ) is set to the number of instances in the previous iteration + 1

Page 101: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

Experiments & Results

• Several experiments – Different values of the parameters

• No. of iterations.

• No. of initial seeds.

• Ranking measure.

• Training data– ACE 2005 (Linguistic Data Consortium, LDC)

– ANERcorp training set (Benajiba et al. 2007)

• Test data– ANERcorp test corpus

Page 102: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

Experiments & Results

• Several experiments

– Standard NE types:

• Person

• Location

• Organization

– Specialised NE types:

• Politicians

• Sportspersons

• Artists

Page 103: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

Simple Models (standard NE types)

• ANERcorp (Training data)

• Without iterations.

• No. of Initial seeds : 5

Page 104: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

ASemiNER (Specialised NE types)

• Politicians, Artists, and Sportspersons

• Unlike supervised techniques, ASemiNER does not require additional

annotated training data or re-annotating the existing one.

• It requires only minor modification:

• For each new NE type, Generate new trigger nouns and verb lists. – Artists trigger nouns (e.g., actress, actor, painter…etc. )

– Politicians trigger nouns (e.g., president, party, king, …etc.)

– Sportsmen trigger nouns (e.g., player, football, athletic, …etc.)

Page 105: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

ASemiNER (Specialised NE types)

– ASemiNER performs as well as on recognizing standard person category

– ASemiNER proved to be easily adaptable when extracting new types of

NEs

Page 106: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

• Query:

WIKIPEDIA AND NER

• Wikipedia:

130May 2012 Truc-Vien T. Nguyen

Giotto was called to work in Padua, and also in Rimini

Page 107: 807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

THANKS

• I used slides from Bernardo Magnini, Chris Manning, Roberto Zanoli, Ray Mooney