807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

Preview:

Citation preview

807 - TEXT ANALYTICS

Massimo Poesio

Lecture 5: Named Entity Recognition

Text classification at Different Granularities

• Text Categorization: – Classify an entire document

• Information Extraction (IE):– Identify and classify small units within documents

• Named Entity Extraction (NE):– A subset of IE– Identify and classify proper names

• People, locations, organizations

Adapted from slide by William Cohen

What is Information ExtractionFilling slots in a database from sub-segments of text.As a task:

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

NAME TITLE ORGANIZATION

Adapted from slide by William Cohen

What is Information ExtractionFilling slots in a database from sub-segments of text.As a task:

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

NAME TITLE ORGANIZATIONBill Gates CEO MicrosoftBill Veghte VP MicrosoftRichard Stallman founder Free Soft..

IE

Adapted from slide by William Cohen

What is Information ExtractionInformation Extraction = segmentation + classification + association

As a familyof techniques:

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation

aka “named entity extraction”

Adapted from slide by William Cohen

What is Information ExtractionInformation Extraction = segmentation + classification + association

A familyof techniques:

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation

Adapted from slide by William Cohen

What is Information ExtractionInformation Extraction = segmentation + classification + association

A familyof techniques:

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation

INFORMATION EXTRACTION

• More general definition: extraction of structured information from unstructured documents

• IE Tasks:– Named entity extraction

• Named entity recognition• Coreference resolution• Relationship extraction

• Semi-structured IE– Table extraction

• Terminology extraction

Adapted from slide by William Cohen

Landscape of IE Tasks:Degree of Formatting

Grammatical sentencesand some formatting & links

Text paragraphswithout formatting

Astro Teller is the CEO and co-founder of BodyMedia. Astro holds a Ph.D. in Artificial Intelligence from Carnegie Mellon University, where he was inducted as a national Hertz fellow. His M.S. in symbolic and heuristic computation and B.S. in computer science are from Stanford University. His work in science, literature and business has appeared in international media from the New York Times to CNN to NPR.

Non-grammatical snippets,rich formatting & links Tables

Adapted from slide by William Cohen

Landscape of IE Tasks:Intended Breadth of Coverage

Web site specific Genre specific Wide, non-specific

Amazon.com Book Pages Resumes University Names

Formatting Layout Language

Landscape of IE Tasks”Complexity

Closed set

He was born in Alabama…

The big Wyoming sky…

U.S. states

Regular set

Phone: (413) 545-1323

The CALD main office can be reached at 412-268-1299

U.S. phone numbers

Complex pattern

University of ArkansasP.O. Box 140Hope, AR 71802

U.S. postal addresses

Headquarters:1128 Main Street, 4th FloorCincinnati, Ohio 45210

…was among the six houses sold by Hope Feldman that year.

Ambiguous patterns,needing context andmany sources of evidence

Person names

Pawel Opalinski, SoftwareEngineer at WhizBang Labs.

Adapted from slide by William Cohen

Landscape of IE Tasks:Single Field/Record

Single entity

Person: Jack Welch

Binary relationship

Relation: Person-TitlePerson: Jack WelchTitle: CEO

N-ary record

“Named entity” extraction

Jack Welch will retire as CEO of General Electric tomorrow. The top role at the Connecticut company will be filled by Jeffrey Immelt.

Relation: Company-LocationCompany: General ElectricLocation: Connecticut

Relation: SuccessionCompany: General ElectricTitle: CEOOut: Jack WelshIn: Jeffrey Immelt

Person: Jeffrey Immelt

Location: Connecticut

Adapted from slide by William Cohen

State of the Art Performance: a sample

• Named entity recognition from newswire text– Person, Location, Organization, …– F1 in high 80’s or low- to mid-90’s

• Binary relation extraction– Contained-in (Location1, Location2)

Member-of (Person1, Organization1)– F1 in 60’s or 70’s or 80’s

• Web site structure recognition– Extremely accurate performance obtainable– Human effort (~10min?) required on each site

Slide by Chris Manning, based on slides by several others

Three generations of IE systems

• Hand-Built Systems – Knowledge Engineering [1980s– ]– Rules written by hand– Require experts who understand both the systems and the domain– Iterative guess-test-tweak-repeat cycle

• Automatic, Trainable Rule-Extraction Systems [1990s– ]– Rules discovered automatically using predefined templates, using automated

rule learners– Require huge, labeled corpora (effort is just moved!)

• Statistical Models [1997 – ]– Use machine learning to learn which features indicate boundaries and types of

entities.– Learning usually supervised; may be partially unsupervised

Named Entity Recognition (NER)

Input:

Apple Inc., formerly Apple Computer, Inc., is an American multinational corporation headquartered in Cupertino, California that designs, develops, and sells consumer electronics, computer software and personal computers. It was established on April 1, 1976, by Steve Jobs, Steve Wozniak and Ronald Wayne.

Output:

Apple Inc., formerly Apple Computer, Inc., is an American multinational corporation headquartered in Cupertino, California that designs, develops, and sells consumer electronics, computer software and personal computers. It was established on April 1, 1976, by Steve Jobs, Steve Wozniak and Ronald Wayne.

Named Entity Recognition (NER)• Locate and classify atomic elements in text into

predefined categories (persons, organizations, locations, temporal expressions, quantities, percentages, monetary values, …)

• Input: a block of text– Jim bought 300 shares of Acme Corp. in 2006.

• Output: annotated block of text– <ENAMEX TYPE="PERSON">Jim</ENAMEX> bought

<NUMEX TYPE="QUANTITY">300</NUMEX> shares of <ENAMEX TYPE="ORGANIZATION">Acme Corp.</ENAMEX> in <TIMEX TYPE="DATE">2006</TIMEX>

– ENAMEX tags (MUC in the 1990s)

THE STANDARD NEWS DOMAIN

• Most work on NER focuses on – NEWS– Variants of repertoire of entity types first studied in

MUC and then in ACE:• PERSON• ORGANIZATION

– GPE

• LOCATION• TEMPORAL ENTITY• NUMBER

HOW

• Two tasks:– Identifying the part of text that mentions a text

(RECOGNITION)– Classifying it (CLASSIFICATION)

• The two tasks are reduced to a standard classification task by having the system classify WORDS

Basic Problems in NER

• Variation of NEs – e.g. John Smith, Mr Smith, John.• Ambiguity of NE types

– John Smith (company vs. person)– May (person vs. month)– Washington (person vs. location)– 1945 (date vs. time)

• Ambiguity with common words, e.g. “may”

Problems in NER

• Category definitions are intuitively quite clear, but there are many grey areas.

• Many of these grey area are caused by metonymy.Organisation vs. Location : “England won the World Cup” vs. “The World Cup took place in England”.Company vs. Artefact: “shares in MTV” vs. “watching MTV”Location vs. Organisation: “she met him at Heathrow” vs. “the Heathrow authorities”

Solutions• The task definition must be very clearly specified at the

outset.• The definitions adopted at the MUC conferences for each

category listed guidelines, examples, counter-examples, and “logic” behind the intuition.

• MUC essentially adopted simplistic approach of disregarding metonymous uses of words, e.g. “England” was always identified as a location. However, this is not always useful for practical applications of NER (e.g. football domain).

• Idealistic solutions, on the other hand, are not always practical to implement, e.g. making distinctions based on world knowledge.

More complex problems in NER

• Issues of style, structure, domain, genre etc.– Punctuation, spelling, spacing, formatting, ….all have an

impact

Dept. of Computing and MathsManchester Metropolitan UniversityManchesterUnited Kingdom

> Tell me more about Leonardo> Da Vinci

Approaches to NER: List Lookup

• System that recognises only entities stored in its lists (GAZETTEERS).

• Advantages - Simple, fast, language independent, easy to retarget

• Disadvantages – collection and maintenance of lists, cannot deal with name variants, cannot resolve ambiguity

Approaches to NER: Shallow Parsing

• Names often have internal structure. These components can be either stored or guessed.

location: CapWord + {City, Forest, Center} e.g. Sherwood ForestCap Word + {Street, Boulevard, Avenue, Crescent, Road} e.g. Portobello Street

Shallow Parsing Approach(E.g., Mikheev et al 1998)

• External evidence - names are often used in very predictive local contexts

Location:“to the” COMPASS “of” CapWord e.g. to the south of Loitokitok“based in” CapWord e.g. based in LoitokitokCapWord “is a” (ADJ)? GeoWord e.g. Loitokitok is a friendly city

Difficulties in Shallow Parsing Approach

• Ambiguously capitalised words (first word in sentence) [All American Bank] vs. All [State Police]• Semantic ambiguity “John F. Kennedy” = airport (location) “Philip Morris” = organisation• Structural ambiguity [Cable and Wireless] vs. [Microsoft] and [Dell] [Center for Computational Linguistics] vs. message from

[City Hospital] for [John Smith].

Machine learning approaches to NER

• NER as classification: the IOB representation• Supervised methods

– Support Vector Machines– Logistic regression (aka Maximum Entropy)– Sequence pattern learning– Hidden Markov Models– Conditional Random Fields

• Distant learning• Semi-supervised methods

THE ML APPROACH TO NE: THE IOB REPRESENTATION

THE ML APPROACH TO NE: FEATURES

FEATURES

FEATURES

Supervised ML for NER

• Methods already seen– Decision trees– Support Vector Machines

• Sequence learning – Hidden Markov Models– Maximum Entropy Models– Conditional Random Fields

NER as a SEQUENCE CLASSIFICATION TASK

Sequence Labeling as Classification: POS Tagging

Classify each token independently but use as input features, information about the surrounding tokens (sliding window).

Slide from Ray Mooney

John saw the saw and decided to take it to the table.

classifier

NNP

Sequence Labeling as Classification

Classify each token independently but use as input features, information about the surrounding tokens (sliding window).

John saw the saw and decided to take it to the table.

classifier

VBD

Slide from Ray Mooney

Sequence Labeling as Classification

Classify each token independently but use as input features, information about the surrounding tokens (sliding window).

John saw the saw and decided to take it to the table.

classifier

DT

Slide from Ray Mooney

Sequence Labeling as Classification

Classify each token independently but use as input features, information about the surrounding tokens (sliding window).

John saw the saw and decided to take it to the table.

classifier

NN

Slide from Ray Mooney

Sequence Labeling as Classification

Classify each token independently but use as input features, information about the surrounding tokens (sliding window).

John saw the saw and decided to take it to the table.

classifier

CC

Slide from Ray Mooney

Sequence Labeling as Classification

Classify each token independently but use as input features, information about the surrounding tokens (sliding window).

John saw the saw and decided to take it to the table.

classifier

VBD

Slide from Ray Mooney

Sequence Labeling as Classification

Classify each token independently but use as input features, information about the surrounding tokens (sliding window).

John saw the saw and decided to take it to the table.

classifier

TO

Slide from Ray Mooney

Sequence Labeling as Classification

Classify each token independently but use as input features, information about the surrounding tokens (sliding window).

John saw the saw and decided to take it to the table.

classifier

VB

Slide from Ray Mooney

Sequence Labeling as Classification

Classify each token independently but use as input features, information about the surrounding tokens (sliding window).

John saw the saw and decided to take it to the table.

classifier

PRP

Slide from Ray Mooney

Sequence Labeling as Classification

Classify each token independently but use as input features, information about the surrounding tokens (sliding window).

John saw the saw and decided to take it to the table.

classifier

IN

Slide from Ray Mooney

Sequence Labeling as Classification

Classify each token independently but use as input features, information about the surrounding tokens (sliding window).

John saw the saw and decided to take it to the table.

classifier

DT

Slide from Ray Mooney

Sequence Labeling as Classification

Classify each token independently but use as input features, information about the surrounding tokens (sliding window).

John saw the saw and decided to take it to the table.

classifier

NN

Slide from Ray Mooney

Using Outputs as Inputs

Better input features are usually the categories of the surrounding tokens, but these are not available yet

Can use category of either the preceding or succeeding tokens by going forward or back and using previous output

Slide from Ray Mooney

Forward Classification

John saw the saw and decided to take it to the table.

classifier

NNP

Slide from Ray Mooney

Forward Classification

NNPJohn saw the saw and decided to take it to the table.

classifier

VBD

Slide from Ray Mooney

Forward Classification

NNP VBDJohn saw the saw and decided to take it to the table.

classifier

DT

Slide from Ray Mooney

Forward Classification

NNP VBD DTJohn saw the saw and decided to take it to the table.

classifier

NN

Slide from Ray Mooney

Forward Classification

NNP VBD DT NNJohn saw the saw and decided to take it to the table.

classifier

CC

Slide from Ray Mooney

Forward Classification

NNP VBD DT NN CCJohn saw the saw and decided to take it to the table.

classifier

VBD

Slide from Ray Mooney

Forward Classification

NNP VBD DT NN CC VBDJohn saw the saw and decided to take it to the table.

classifier

TO

Slide from Ray Mooney

Forward Classification

NNP VBD DT NN CC VBD TOJohn saw the saw and decided to take it to the table.

classifier

VB

Slide from Ray Mooney

Backward Classification

• Disambiguating Disambiguating ““toto”” in this case would be even in this case would be even easier backward.easier backward.

DT NNJohn saw the saw and decided to take it to the table.

classifier

IN

Slide from Ray Mooney

Backward Classification

• Disambiguating Disambiguating ““toto”” in this case would be even in this case would be even easier backward.easier backward.

IN DT NNJohn saw the saw and decided to take it to the table.

classifier

PRP

Slide from Ray Mooney

Backward Classification

• Disambiguating Disambiguating ““toto”” in this case would be even in this case would be even easier backward.easier backward.

PRP IN DT NNJohn saw the saw and decided to take it to the table.

classifier

VB

Slide from Ray Mooney

Backward Classification

• Disambiguating Disambiguating ““toto”” in this case would be even in this case would be even easier backward.easier backward.

VB PRP IN DT NNJohn saw the saw and decided to take it to the table.

classifier

TO

Slide from Ray Mooney

Backward Classification

• Disambiguating Disambiguating ““toto”” in this case would be even in this case would be even easier backward.easier backward.

TO VB PRP IN DT NN John saw the saw and decided to take it to the table.

classifier

VBD

Slide from Ray Mooney

Backward Classification

• Disambiguating Disambiguating ““toto”” in this case would be even in this case would be even easier backward.easier backward.

VBD TO VB PRP IN DT NN John saw the saw and decided to take it to the table.

classifier

CC

Slide from Ray Mooney

Backward Classification

• Disambiguating Disambiguating ““toto”” in this case would be even in this case would be even easier backward.easier backward.

CC VBD TO VB PRP IN DT NN John saw the saw and decided to take it to the table.

classifier

VBD

Slide from Ray Mooney

Backward Classification

• Disambiguating Disambiguating ““toto”” in this case would be even in this case would be even easier backward.easier backward.

VBD CC VBD TO VB PRP IN DT NNJohn saw the saw and decided to take it to the table.

classifier

DT

Slide from Ray Mooney

Backward Classification

• Disambiguating Disambiguating ““toto”” in this case would be even in this case would be even easier backward.easier backward.

DT VBD CC VBD TO VB PRP IN DT NNJohn saw the saw and decided to take it to the table.

classifier

VBD

Slide from Ray Mooney

Backward Classification

• Disambiguating Disambiguating ““toto”” in this case would be even in this case would be even easier backward.easier backward.

VBD DT VBD CC VBD TO VB PRP IN DT NN John saw the saw and decided to take it to the table.

classifier

NNP

Slide from Ray Mooney

NER as Sequence Labeling

Probabilistic Sequence Models

Probabilistic sequence models allow integrating uncertainty over multiple, interdependent classifications and collectively determine the most likely global assignment

Two standard models Hidden Markov Model (HMM) Conditional Random Field (CRF) Maximum Entropy Markov Model (MEMM) is a

simplified version of CRF

Hidden Markov Models (HMMs)

• Generative – Find parameters to maximize P(X,Y)

• Assumes features are independent• When labeling Xi future observations are taken

into account (forward-backward)

Conditional Random Fields (CRFs)

• Discriminative– Find parameters to maximize P(Y|X)

• Doesn’t assume that features are independent• When labeling Yi future observations are taken into

account The best of both worlds!

86

PROBABILISTIC CLASSIFICATION: GENERATIVE VS DISCRIMINATIVE

• Let Y be the random variable for the class which takes values {y1,y2,…ym}.

• Let X be the random variable describing an instance consisting of a vector of values for n features <X1,X2…Xn>, let xk be a possible vector value for X and xij a possible value for Xi.

• For classification, we need to compute P(Y=yi | X=xk) for i = 1…m

• Could be done using joint distribution but this requires estimating an exponential number of parameters.

Discriminative Vs. Generative

,( )p y x

|( )p y x

Generative Model: A model that generate observed data randomly

Naïve Bayes: once the class label is known, all the features are independent

Discriminative: Directly estimate the posterior probability; Aim at modeling the “discrimination” between different outputs

MaxEnt classifier: linear combination of feature function in the exponent,

Both generative models and discriminative models describe distributions over (y , x), but they work in different directions.

Discriminative Vs. Generative

,( )p y x

|( )p y x

=unobservable=observable

Simple Linear Chain CRF Features• Modeling the conditional distribution is similar to

that used in multinomial logistic regression.

• Create feature functions fk(Yt, Yt−1, Xt)

– Feature for each state transition pair i, j• fi,j(Yt, Yt−1, Xt) = 1 if Yt = i and Yt−1 = j and 0 otherwise

– Feature for each state observation pair i, o• fi,o(Yt, Yt−1, Xt) = 1 if Yt = i and Xt = o and 0 otherwise

• Note: number of features grows quadratically in the number of states (i.e. tags).

93

Conditional Distribution forLinear Chain CRF

• Using these feature functions for a simple linear chain CRF, we can define:

94

T

t

K

ktttkk XYYf

XZXYP

1 11 )),,(exp(

)(

1)|(

Y

T

t

K

ktttkk XYYfXZ

1 11 )),,(exp()(

Adding Token Features to a CRF• Can add token features Xi,j

95

…X1,1X1,m …X2,1

X2,m …XT,1XT,m…

• Can add additional feature functions for each token feature to model conditional distribution.

Y1 Y2 YT

NER: EVALUATION

TYPICAL PERFORMANCE

NER Evaluation Campaigns

• English NER-- CoNLL 2003 - PER/ORG/LOC/MISC– Training set: 203.621 tokens– Development set: 51.362 tokens– Test set: 46.435 tokens

• Italian NER-- Evalita 2009 - PER/ORG/LOC/GPE– Development set: 223.706 tokens– Test set: 90.556 tokens

• Mention Detection-- ACE 2005– 599 documents

CoNLL2003 shared task (1)

• English and German language• 4 types of NEs:

– LOC Location– MISC Names of miscellaneous entities– ORG Organization– PER Person

• Training Set for developing the system• Test Data for the final evaluation

CoNLL2003 shared task (2)

• Data– columns separated by a single space– A word for each line– An empty line after each sentence – Tags in IOB format

• An exampleMilan NNP B-NP I-ORG's POS B-NP Oplayer NN I-NP OGeorge NNP I-NP I-PERWeah NNP I-NP I-PERmeet VBP B-VP O

CoNLL2003 shared task (3)

English precision recall F [FIJZ03] 88.99% 88.54% 88.76%[CN03] 88.12% 88.51% 88.31%[KSNM03] 85.93% 86.21% 86.07%[ZJ03] 86.13% 84.88% 85.50%---------------------------------------------------[Ham03] 69.09% 53.26% 60.15%

baseline 71.91% 50.90% 59.61%

CURRENT RESEARCH ON NER

• New domains• New approaches:

– Semi-supervised– Distant

• Handling many NE types• Integration with Machine Translation• Handling difficult linguistic phenomena such as

metonymy

NEW DOMAINS

• BIOMEDICAL• CHEMISTRY• HUMANITIES: MORE FINE GRAINED TYPES

Bioinformatics Named Entities

• Protein• DNA• RNA• Cell line• Cell type• Drug• Chemical

NER IN THE HUMANITIES

LOC

SITE

CULTURE

Semi-supervised learning

• Modest amounts of supervision– Small size of training data– Supervisor input sought when necessary

• Aims to match supervised learning performance, but with muchless human effort

• Bootstrapping– Seeds used to identify contextual clues– Contextual clues used to find more NEs

• Examples: (Brin 1998); (Collins and Singer 1999); (Riloff and Jones 1999); (Cucchiarelli and Velardi 2001); (Pasca et al. 2006); (Heng and Grishman 2006); (Nadeau et al. 2006), and (Liao and Veeramachaneni, 2009)

Semi-supervised learning

Input–A seed list of a few examples of a given NE type

• ‘Muhammad’ & ‘Obama’ can be used as seed examples for entity of type person.

Parameters–Number of iterations!

–Number of initial seeds!

–The ranking measure (Reliability measure)!

ASemiNER - Methodology

– Sentences containing seed instance (retrieved)

– A number of tokens on each side of the seed. (extracted)

• Sentence boundaries

Pattern Induction - Initial Patterns

TP pair = (Token/POS) pair

– Token (noun) inflected forms

Pattern Induction - Final Patterns

− Token (verb) stems

– Lists of trigger nouns (e.g., alsayd `Mr.’, alraeys `President’, alduktur `Dr.’ )

• They will be used as Arabic NE indicators or trigger words in the training phase.

• Arabic Wikipedia articles are crawled randomly, prepared, and POS-tagged.

• The left and right nouns of the named entity are extracted and collected.

• The top most frequent nouns (inflected) are picked and stored as “trigger” nouns.

– Lists of trigger verbs (e.g., rasam `draw’, naHat `sculpture’, ...etc.)• The top most frequent verbs (stems) are picked and stored as “trigger” verbs

Pattern Induction - Final Patterns“Trigger” words

• Generalization• TP pairs that contains nouns, and verbs are stripped of their ‘Token’ parts, unless these tokens are

in the corresponding lists of trigger words. alsayd/NN ‘Mr./NN’ alsayd/NN ‘Mr./NN’ as alsayd ‘Mr.’ is in the list of trigger nouns qalam/NN ‘pen/NN’ / NN as qalam ‘pen’ is not among trigger nouns.

• TP pairs that contain preposition are kept without changes.• TP pairs that contain other parts of speech categories (e.g., proper noun, adjective, coordinating

conjunction) are stripped of their ‘Token’ parts. mufyd/JJ ‘useful/JJ’ /JJ

• All POS tags used for verbs (e.g., VBP, VBD, VBN) are converted to one form VB.

• All POS tags used for nouns (e.g., NN, NNS) are converted to one form NN.

• All POS tags used for proper noun (e.g., NNP, NNPS) are converted to one form NNP.

• The seed instance is replaced with NE class tag (e.g., <PersonName>, <Location>, <organization>).

Pattern Induction

Pattern Induction

• Producing Final Patterns

Initial Pattern

Final Pattern

English Gloss

Pattern Induction

• Two more Final Patterns

Pattern Induction

• Final Pattern Set (P) is modified and filtered every time a

new pattern is added .

• Repeated patterns are rejected.

• Pattern consisting of less than six TP pairs should contain at

least one ‘Token’ part. – /VB /NN <PersonName>/NNP /NNP

ASemiNER - Methodology

Instance Extraction• ASemiNER retrieves from the training corpus the set of instances (I) that match

any of the patterns in (P) using Regular Expressions (Regex).

• ASemiNER automatically generates regexes from final patterns without any

modification regardless of the correctness of the POS tags assigned to the proper

noun by POS tagger.

Instance Extraction• ASemiNER automatically add the information of average NE length

to the produced regexes. (2 tokens)

ASemiNER - Methodology

Instance Ranking/Selection

• Extracted instances in (I) are ranked according to:– Distinct patterns used to extract them.

(pattern variety is better cue to semantic than absolute frequency)

– Pointwise Mutual Information (PMI) :

• |i,p|: the frequency of the instance i extracted by pattern p.

• |i|: the frequency of the instance i in the corpus.

• |p|: the frequency of the pattern p in the corpus.

ASemiNER - Methodology

Top m instances

Where ( m ) is set to the number of instances in the previous iteration + 1

Experiments & Results

• Several experiments – Different values of the parameters

• No. of iterations.

• No. of initial seeds.

• Ranking measure.

• Training data– ACE 2005 (Linguistic Data Consortium, LDC)

– ANERcorp training set (Benajiba et al. 2007)

• Test data– ANERcorp test corpus

Experiments & Results

• Several experiments

– Standard NE types:

• Person

• Location

• Organization

– Specialised NE types:

• Politicians

• Sportspersons

• Artists

Simple Models (standard NE types)

• ANERcorp (Training data)

• Without iterations.

• No. of Initial seeds : 5

ASemiNER (Specialised NE types)

• Politicians, Artists, and Sportspersons

• Unlike supervised techniques, ASemiNER does not require additional

annotated training data or re-annotating the existing one.

• It requires only minor modification:

• For each new NE type, Generate new trigger nouns and verb lists. – Artists trigger nouns (e.g., actress, actor, painter…etc. )

– Politicians trigger nouns (e.g., president, party, king, …etc.)

– Sportsmen trigger nouns (e.g., player, football, athletic, …etc.)

ASemiNER (Specialised NE types)

– ASemiNER performs as well as on recognizing standard person category

– ASemiNER proved to be easily adaptable when extracting new types of

NEs

• Query:

WIKIPEDIA AND NER

• Wikipedia:

130May 2012 Truc-Vien T. Nguyen

Giotto was called to work in Padua, and also in Rimini

THANKS

• I used slides from Bernardo Magnini, Chris Manning, Roberto Zanoli, Ray Mooney

Recommended