81
NAME TAGGING Heng Ji [email protected] September 23, 2014

NAME TAGGING

  • Upload
    lixue

  • View
    45

  • Download
    0

Embed Size (px)

DESCRIPTION

NAME TAGGING. Heng Ji [email protected] September 23, 2014. Introduction. IE Overview Supervised Name Tagging Features Models Advanced Techniques and Trends Re-ranking and Global Features ( Ji and Grishman , 05;06) Data Sparsity Reduction ( Ratinov and Roth, 2009; Ji and Lin, 2009). - PowerPoint PPT Presentation

Citation preview

Page 1: NAME TAGGING

NAME TAGGING

Heng Ji

[email protected] 23, 2014

Page 2: NAME TAGGING

Introduction IE Overview Supervised Name Tagging

Features Models

Advanced Techniques and Trends Re-ranking and Global Features (Ji and Grishman, 05;06) Data Sparsity Reduction (Ratinov and Roth, 2009; Ji and

Lin, 2009)

Page 3: NAME TAGGING

Barry Diller on Wednesday quit as chief of Vivendi Universal Entertainment.

Trigger Quit (a “Personnel/End-Position” event)

ArgumentsRole = Person Barry Diller

Role = Organization Vivendi Universal Entertainment

Role = Position Chief

Role = Time-within Wednesday (2003-03-04)

Vivendi Universal EntertainmentBarry Diller

Using IE to Produce “Food” or Watson (In this talk) Information Extraction (IE) =Identifying the instances of facts names/entities , relations and events from semi-structured or unstructured text; and convert them into structured representations (e.g. databases)

Page 4: NAME TAGGING

Supervised Learning based IE

‘Pipeline’ style IE Split the task into several components Prepare data annotation for each component Apply supervised machine learning methods to address each

component separately Most state-of-the-art ACE IE systems were developed in this way Provide great opportunity to applying a wide range of learning models

and incorporating diverse levels of linguistic features to improve each component

Large progress has been achieved on some of these components such as name tagging and relation extraction

Page 5: NAME TAGGING

Major IE Components

Relation Extraction

Time Identification and Normalization

Name/Nominal Extraction

Event Mention Extraction and Event Coreference Resolution

“Barry Diller”, “chief”

“Barry Diller” = “chief”

“Vivendi Universal Entertainment” is located in “France”

“Barry Diller” is the person of the end-position event

trigged by “quit”

Entity Coreference Resolution

Wednesday (2003-03-04)

Page 6: NAME TAGGING

Introduction IE Overview Supervised Name Tagging

Features Models

Advanced Techniques and Trends Re-ranking and Global Features (Ji and Grishman, 05;06) Data Sparsity Reduction (Ratinov and Roth, 2009; Ji and

Lin, 2009)

Page 7: NAME TAGGING

7/40

Name Tagging• Recognition x Classification

• “Name Identification and Classification”• NER as:

• as a tool or component of IE and IR• as an input module for a robust shallow parsing engine• Component technology for other areas

• Question Answering (QA)• Summarization• Automatic translation• Document indexing• Text data mining • Genetics• …

Page 8: NAME TAGGING

8/40

Name Tagging• NE Hierarchies

• Person• Organization• Location• But also:

• Artifact• Facility• Geopolitical entity• Vehicle• Weapon• Etc.

• SEKINE & NOBATA (2004) • 150 types• Domain-dependent

• Abstract Meaning Representation (amr.isi.edu)• 200+ types

Page 9: NAME TAGGING

9/40

• Handcrafted systems• Knowledge (rule) based

• Patterns• Gazetteers

• Automatic systems• Statistical• Machine learning• Unsupervised• Analyze: char type, POS, lexical info, dictionaries

• Hybrid systems

Name Tagging

Page 10: NAME TAGGING

10/40

• Handcrafted systems• LTG

• F-measure of 93.39 in MUC-7 (the best)• Ltquery, XML internal representation• Tokenizer, POS-tagger, SGML transducer

• Nominator (1997)• IBM• Heavy heuristics• Cross-document co-reference resolution• Used later in IBM Intelligent Miner

Name Tagging

Page 11: NAME TAGGING

11/40

• Handcrafted systems• LaSIE (Large Scale Information Extraction)

• MUC-6 (LaSIE II in MUC-7)• Univ. of Sheffield’s GATE architecture (General

Architecture for Text Engineering )• JAPE language

• FACILE (1998)• NEA language (Named Entity Analysis)• Context-sensitive rules

• NetOwl (MUC-7)• Commercial product• C++ engine, extraction rules

Name Tagging

Page 12: NAME TAGGING

12/40

Automatic approaches• Learning of statistical models or symbolic rules

• Use of annotated text corpus • Manually annotated• Automatically annotated

• “BIO” tagging• Tags: Begin, Inside, Outside an NE• Probabilities:

• Simple: • P(tag i | token i)

• With external evidence:• P(tag i | token i-1, token i, token i+1)

• “OpenClose” tagging• Two classifiers: one for the beginning, one for the end

Page 13: NAME TAGGING

13/40

• Decision trees• Tree-oriented sequence of tests in every word

• Determine probabilities of having a BIO tag

• Use training corpus• Viterbi, ID3, C4.5 algorithms

• Select most probable tag sequence

• SEKINE et al (1998)• BALUJA et al (1999)

• F-measure: 90%

Automatic approaches

Page 14: NAME TAGGING

14/40

• HMM• Markov models, Viterbi• Separate statistical model for each NE category + model

for words outside NEs• Nymble (1997) / IdentiFinder (1999)

• Maximum Entropy (ME)• Separate, independent probabilities for every evidence

(external and internal features) are merged multiplicatively

• MENE (NYU - 1998)• Capitalization, many lexical features, type of text• F-Measure: 89%

Automatic approaches

Page 15: NAME TAGGING

15/40

• Hybrid systems• Combination of techniques

• IBM’s Intelligent Miner: Nominator + DB/2 data mining• WordNet hierarchies

• MAGNINI et al. (2002)• Stacks of classifiers

• Adaboost algorithm• Bootstrapping approaches

• Small set of seeds• Memory-based ML, etc.

Automatic approaches

Page 16: NAME TAGGING

NER in various languages• Arabic

• TAGARAB (1998)• Pattern-matching engine + morphological analysis • Lots of morphological info (no differences in ortographic case)

• Bulgarian• OSENOVA & KOLKOVSKA (2002) • Handcrafted cascaded regular NE grammar• Pre-compiled lexicon and gazetteers

• Catalan• CARRERAS et al. (2003b) and MÁRQUEZ et al. (2003) • Extract catalan NEs with spanish resources (F-measure 93%)• Bootstrap using catalan texts

Page 17: NAME TAGGING

NER in various languages• Chinese & Japanese

• Many works• Special characteristics

• Character or word-based• No capitalization

• CHINERS (2003)• Sports domain• Machine learning • Shallow parsing technique

• ASAHARA & MATSMUTO (2003) • Character-based method • Support Vector Machine• 87.2% F-measure in the IREX (outperformed most word-based

systems)

Page 18: NAME TAGGING

NER in various languages• Dutch

• DE MEULDER et al. (2002) • Hybrid system

• Gazetteers, grammars of names• Machine Learning Ripper algorithm

• French• BÉCHET et al. (2000)

• Decision trees• Le Monde news corpus

• German• Non-proper nouns also capitalized• THIELEN (1995)

• Incremental statistical approach • 65% of corrected disambiguated proper names

Page 19: NAME TAGGING

NER in various languages• Greek

• KARKALETSIS et al. (1998) • English – Greek GIE (Greek Information Extraction) project • GATE platform

• Italian• CUCCHIARELLI et al. (1998)

• Merge rule-based and statistical approaches• Gazetteers • Context-dependent heuristics • ECRAN (Extraction of Content: Research at Near Market) • GATE architecture• Lack of linguistic resources: 20% of NEs undetected

• Korean• CHUNG et al. (2003)

• Rule-based model, Hidden Markov Model, boosting approach over unannotated data

Page 20: NAME TAGGING

NER in various languages

• Portuguese• SOLORIO & LÓPEZ (2004, 2005)

• Adapted CARRERAS et al. (2002b) spanish NER• Brazilian newspapers

• Serbo-croatian• NENADIC & SPASIC (2000)

• Hand-written grammar rules• Highly inflective language

• Lots of lexical and lemmatization pre-processing

• Dual alphabet (Cyrillic and Latin)• Pre-processing stores the text in an independent format

Page 21: NAME TAGGING

NER in various languages• Spanish

• CARRERAS et al. (2002b) • Machine Learning, AdaBoost algorithm • BIO and OpenClose approaches

• Swedish• SweNam system (DALIANIS & ASTROM, 2001)

• Perl • Machine Learning techniques and matching rules

• Turkish• TUR et al (2000)

• Hidden Markov Model and Viterbi search • Lexical, morphological and context clues

Page 22: NAME TAGGING

Exercise Finding name identification errors in http://nlp.cs.rpi.edu/course/fall14/nameerrors.html

Page 23: NAME TAGGING

George W. Bush discussed Iraq

B-GPEOB-PER I-PER I-PER

<PER>George W. Bush</PER> discussed <GPE>Iraq</GPE>

Name Tagging: Task Person (PER): named person or family Organization (ORG): named corporate, governmental, or other organizational entity Geo-political entity (GPE): name of politically or geographically defined location

(cities, provinces, countries, international regions, bodies of water, mountains, etc.)

But also: Location, Artifact, Facility, Vehicle, Weapon, Product, etc. Extended name hierarchy, 150 types, domain-dependent (Sekine and Nobata, 2004)

Convert it into a sequence labeling problem – “BIO” tagging:

Page 24: NAME TAGGING

Quiz Time!• Faisalabad's Catholic Bishop John Joseph, who had been

campaigning against the law, shot himself in the head outside a court in Sahiwal district when the judge convicted Christian Ayub Masih under the law in 1998.

• Next, film clips featuring Herrmann’s Hollywood music mingle with a suite from “Psycho,” followed by “La Belle Dame sans Merci,” which he wrote in 1934 during his time at CBS Radio. 

Page 25: NAME TAGGING

• Maximum Entropy Models (Borthwick, 1999; Chieu and Ng 2002; Florian et al., 2007)

• Decision Trees (Sekine et al., 1998)• Class-based Language Model (Sun et al., 2002, Ratinov and Roth,

2009)• Agent-based Approach (Ye et al., 2002)• Support Vector Machines (Takeuchi and Collier, 2002)• Sequence Labeling Models

• Hidden Markov Models (HMMs) (Bikel et al., 1997; Ji and Grishman, 2005)• Maximum Entropy Markov Models (MEMMs) (McCallum and Freitag, 2000)• Conditional Random Fields (CRFs) (McCallum and Li, 2003)

Supervised Learning for Name Tagging

Page 26: NAME TAGGING

• N-gram: Unigram, bigram and trigram token sequences in the context window of the current token

• Part-of-Speech: POS tags of the context words• Gazetteers: person names, organizations, countries and cities, titles, idioms,

etc.• Word clusters: to reduce sparsity, using word clusters such as Brown clusters

(Brown et al., 1992)• Case and Shape: Capitalization and morphology analysis based features• Chunking: NP and VP Chunking tags• Global feature: Sentence level and document level features. For example,

whether the token is in the first sentence of a document• Conjunction: Conjunctions of various features

Typical Name Tagging Features

Page 27: NAME TAGGING

Markov Chain for a Simple Name Tagger

STARTEND

PER

X

0.30.2

0.2

0.2 0.3

0.6

0.5

George:0.3

W.:0.3

W.:0.3

discussed:0.7

$:1.0

LOC

0.5

0.2

0.1

0.30.3 0.1

0.2

Bush:0.3

Iraq:0.1

George:0.2

Iraq:0.8

Transition Probability

Emission Probability

Page 28: NAME TAGGING

Viterbi Decoding of Name Tagger

START

PER

George Bush discussed

LOC

X

0

t=0

1

0

0

t=1 t=2 t=3 t=4

0

0.09

0.004

0

0 0 0

0

$

0.0162

0

0.0004

0.003 0

0.0003

0

W. Iraq

t=5 t=6

END 0 0 0 0

0.000008

1

0

0.000032

0

0

0

0

0

0.0012

0.00540.0036

0

00.00000016

0.0000096

1*0.3*0.3

Current = Previous * Transition * Emission

Page 29: NAME TAGGING

Limitations of HMMs• Joint probability distribution p(y, x) • Assume independent features• Cannot represent overlapping features or long range

dependences between observed elements• Need to enumerate all possible observation sequences• Very strict independence assumptions on the observations

• Toward discriminative/conditional models• Conditional probability P(label sequence y | observation sequence x)

rather than joint probability P(y, x)• Allow arbitrary, non-independent features on the observation

sequence X• The probability of a transition between labels may depend on past and

future observations• Relax strong independence assumptions in generative models

Page 30: NAME TAGGING

30

Maximum Entropy• Why maximum entropy?• Maximize entropy = Minimize commitment

• Model all that is known and assume nothing about what is unknown. • Model all that is known: satisfy a set of constraints that must

hold

• Assume nothing about what is unknown: choose the most “uniform” distribution choose the one with maximum entropy

Page 31: NAME TAGGING

31

Why Try to be Uniform?

Most Uniform = Maximum Entropy

By making the distribution as uniform as possible, we don’t make any additional assumptions to what is supported by the data

Abides by the principle of Occam’s Razor

(least assumption = simplest explanation)

Less generalization errors (less over-fitting)

more accurate predictions on test data

Page 32: NAME TAGGING

32

Learning Coreference by Maximum Entropy Model

Suppose that if the feature “Capitalization” = “Yes”

for token t, then

P (t is the beginning of a Name | (Captalization = Yes)) = 0.7

How do we adjust the distribution?

P (t is not the beginning of a name | (Capitalization = Yes)) = 0.3

If we don’t observe “Has Title = Yes” samples?

P (t is the beginning of a name | (Has Title = Yes)) = 0.5

P (t is not the beginning of a name | (Has Title = Yes)) = 0.5

Page 33: NAME TAGGING

33

The basic idea

• Goal: estimate p

• Choose p with maximum entropy (or “uncertainty”) subject to the constraints (or “evidence”).

BAx

xpxppH )(log)()(

BbAawherebax ),,(

Page 34: NAME TAGGING

34

Setting• From training data, collect (a, b) pairs:

• a: thing to be predicted (e.g., a class in a classification problem)• b: the context• Ex: Name tagging:

• a=person• b=the words in a window and previous two tags

• Learn the prob of each (a, b): p(a, b)

Page 35: NAME TAGGING

35

Ex1: Coin-flip example(Klein & Manning 2003)

• Toss a coin: p(H)=p1, p(T)=p2.• Constraint: p1 + p2 = 1• Question: what’s your estimation of p=(p1, p2)?• Answer: choose the p that maximizes H(p)

p1

H

p1=0.3

x

xpxppH )(log)()(

Page 36: NAME TAGGING

36

Coin-flip example (cont)

p1 p2

H

p1 + p2 = 1

p1+p2=1.0, p1=0.3

Page 37: NAME TAGGING

37

Ex2: An MT example(Berger et. al., 1996)

Possible translation for the word “in” is:

Constraint:

Intuitive answer:

Page 38: NAME TAGGING

38

An MT example (cont)Constraints:

Intuitive answer:

Page 39: NAME TAGGING

39

Why ME?• Advantages

• Combine multiple knowledge sources• Local

• Word prefix, suffix, capitalization (POS - (Ratnaparkhi, 1996))• Word POS, POS class, suffix (WSD - (Chao & Dyer, 2002))• Token prefix, suffix, capitalization, abbreviation (Sentence Boundary - (Reynar

& Ratnaparkhi, 1997))• Global

• N-grams (Rosenfeld, 1997)

• Word window• Document title (Pakhomov, 2002)

• Structurally related words (Chao & Dyer, 2002)

• Sentence length, conventional lexicon (Och & Ney, 2002)

• Combine dependent knowledge sources

Page 40: NAME TAGGING

40

Why ME?• Advantages

• Add additional knowledge sources• Implicit smoothing

• Disadvantages• Computational

• Expected value at each iteration• Normalizing constant

• Overfitting• Feature selection

• Cutoffs• Basic Feature Selection (Berger et al., 1996)

Page 41: NAME TAGGING

Maximum Entropy Markov Models (MEMMs)

n

iiii xsspxspxsp

2111 ),|()|()|(

• Have all the advantages of Conditional Models• No longer assume that features are independent• Do not take future observations into account (no forward-backward)• Subject to Label Bias Problem: Bias toward states with fewer outgoing transitions

A conditional model that representing the probability of reaching a state given an observation and the previous state

Consider observation sequences to be events to be conditioned upon.

Page 42: NAME TAGGING

Conditional Random Fields (CRFs)• Conceptual Overview

• Each attribute of the data fits into a feature function that associates the attribute and a possible label• A positive value if the attribute appears in the data• A zero value if the attribute is not in the data

• Each feature function carries a weight that gives the strength of that feature function for the proposed label• High positive weights: a good association between the feature and the

proposed label• High negative weights: a negative association between the feature and

the proposed label• Weights close to zero: the feature has little or no impact on the identity

of the label• CRFs have all the advantages of MEMMs without label bias problem

• MEMM uses per-state exponential model for the conditional probabilities of next states given the current state

• CRF has a single exponential model for the joint probability of the entire sequence of labels given the observation sequence

• Weights of different features at different states can be traded off against each other

• CRFs provide the benefits of discriminative models

Page 43: NAME TAGGING

43/39

Example of CRFs

Page 44: NAME TAGGING

Sequential Model Trade-offs

SpeedDiscriminative vs.

GenerativeNormalization

HMM very fast generative local

MEMM mid-range discriminative local

CRF relatively slow discriminative global

Page 45: NAME TAGGING

• State-of-the-art Performance• On ACE data sets: about 89% F-measure (Florian et al., 2006; Ji and

Grishman, 2006; Nguyen et al., 2010; Zitouni and Florian, 2008)• On CONLL data sets: about 91% F-measure (Lin and Wu, 2009; Ratinov

and Roth, 2009)

• Remaining Challenges• Identification, especially on organizations

• Boundary: “Asian Pulp and Paper Joint Stock Company , Lt. of Singapore”• Need coreference resolution or context event features: “FAW has also utilized the

capital market to directly finance, and now owns three domestic listed companies” (FAW = First Automotive Works)

• Classification• “Caribbean Union”: ORG or GPE?

State-of-the-art and Remaining Challenges

Page 46: NAME TAGGING

Introduction IE Overview Supervised Name Tagging

Features Models

Advanced Techniques and Trends Re-ranking and Global Features (Ji and Grishman,05;06) Data Sparsity Reduction (Ratinov and Roth, 2009; Ji and

Lin, 2009)

Page 47: NAME TAGGING

NLP Re-Ranking FrameworkUseful for:Name Tagging

Parsing

Coreference Resolution

Machine Translation

Semantic Role Labeling

InputSentence

BestHypothesis

Baseline System

N-BestHypotheses

Re-Ranking Feature Extractor

Re-Ranker

Page 48: NAME TAGGING

Name Re-Ranking

Sentence: Slobodan Milosevic was born in Serbia .

Generate N-Best Hypotheses• Hypo0: Slobodan Milosevic was born in <PER>Serbia</PER> .

• Hypo1: <PER>Slobodan</PER> Milosevic was born in <LOC>Serbia</LOC> .

• Hypo2: <PER>Slobodan Milosevic</PER> was born in <LOC>Serbia</LOC> . …

Goal: Re-Rank Hypotheses and Get the New Best One

• Hypo0’: <PER>Slobodan Milosevic</PER> was born in <LOC>Serbia</LOC> .

• Hypo1’: <PER>Slobodan</PER> Milosevic was born in <LOC>Serbia</LOC> .

• Hypo2’: Slobodan Milosevic was born in <PER>Serbia</PER> .

Page 49: NAME TAGGING

Enhancing Name Tagging

Prior WorkTask Use Joint

inference ModelName Identification

NameClassification

Roth and Yih, 02 & 04

No YesYes

(relation)Linear

Programming

Collins, 02 Yes No No RankBoost

Ji and Grishman, 05

Yes YesYes

(coref, relation)

MaxEnt-Rank

Page 50: NAME TAGGING

Supervised Name Re-Ranking: Learning Function

Hypo F-Measure(%)

Direct Re-Ranking by Classification (ex. our ’05 work)

Direct Re-Ranking by Score(ex.Boosting style)

Indirect Re-Ranking + Confidence based decoding using multi-class solution(ex. MaxEnt, SVM style)

h1

h2

h3

809060

f(h1) = -1f(h2) = 1f(h3) = -1

f(h1) = 0.8f(h2) = 0.9f(h3) = 0.6

f(h1, h2) = -1f(h1, h3) = 1f(h2, h1) = 1f(h2, h3) = 1f(h3, h1) = -1f(h3, h2) = -1

Page 51: NAME TAGGING

Supervised Name Re-Ranking: Algorithms

MaxEnt-Rank Indirect Re-Ranking + Decoding Compute a reliable ranking probability for each hypothesis

SVM-Rank Indirect Re-Ranking + Decoding Theoretically achieves low generalization error Using linear kernel in this paper

p-Norm Push Ranking (Rudin, 06) Direct Re-Ranking by Score New boosting-style supervised ranking algorithm, generalizes RankBoost “p” represents how much to concentrate at the top of the hypothesis list Theoretically achieves low generalization error

Page 52: NAME TAGGING

Wider Context than Traditional NLP Re-Ranker

Sent 0: -----------------------------Sent 1: -----------------------------------Sent i: ----------------------------------------------------------------------------------

Sent 0: -----------------------------Sent 1: -----------------------------------Sent i: ----------------------------------------------------------------------------------

ith InputSentence

BestHypothesis

Baseline System

N-BestHypotheses

Re-Ranking Feature Extractor

Re-Ranker

Use only current sentence vs. use cross-sentence info

Use only baseline system vs. use joint inference

Page 53: NAME TAGGING

Name Re-Ranking in IE Viewith Input

Sentence

BestHypothesis

Baseline HMM

N-BestHypotheses

Re-Ranking Feature Extractor

Re-Ranker

RelationTagger

Event Tagger

CoreferenceResolver

? ? ?

Page 54: NAME TAGGING

Look Outside of Stage and Sentence

RelationTagger

Event Tagger

CoreferenceResolver

<PERSON>

<PERSON>

<PERSON>

Name Structure Parsing

<ORGANIZATION> (ORG Modifier + ORG Suffix)

Text 1

Text 2

S 11

S 12

S 22

The previous president Milosevic was beaten by the Opposition Committee candidate Kostunica.

This famous leader and his wife Zuolica live in an apartment in Belgrade.

Milosevic was born in Serbia’s central industrial city.

Page 55: NAME TAGGING

Take Wider Context

Sent 0: -----------------------------Sent 1: -----------------------------------Sent i: ----------------------------------------------------------------------------------

Sent 0: -----------------------------Sent 1: -----------------------------------Sent i: ----------------------------------------------------------------------------------

Cross-Sentence

Cross-Sentence

Cross-Stage (Joint Inference)

Cross-Stage (Joint Inference)

ith InputSentence

BestHypothesis

Baseline HMM

N-BestHypotheses

Re-Ranking Feature Extractor

Re-Ranker

RelationTagger

Event Tagger

CoreferenceResolver

Page 56: NAME TAGGING

ith InputSentence

BestHypothesis

Baseline HMM

N-BestHypotheses

Re-Ranking Feature Extractor

Re-Ranker

RelationTagger

Event Tagger

Cross-docCoreference

Resolver

Sent 0: ------------------------- Docn Sent 1: -----------------------------------Sent i: ----------------------------------------------------------------------------------

Sent 0: ------------------------- Docn Sent 1: -----------------------------------Sent i: ----------------------------------------------------------------------------------

Sent 0: ----------------------------Doc2 Sent 1: -----------------------------------Sent i: -----------------------------------------------------------------------

Sent 0: ----------------------------Doc2 Sent 1: -----------------------------------Sent i: -----------------------------------------------------------------------

Sent 0: ---------------------Doc1-------Sent 1: -----------------------------------Sent i: ----------------------------------------------------------------------------------

Sent 0: ---------------------Doc1-------Sent 1: -----------------------------------Sent i: ----------------------------------------------------------------------------------

Even Wider Context

Cross-Stage(Joint Inference)

Cross-Stage(Joint Inference)

…Cross-Doc

Cross-Doc

Page 57: NAME TAGGING

Feature Encoding Example

<PERSON>Text 1

Text 2

“CorefNum” Feature: the names in h are referred to by CorefNum other noun phrases (CorefNum larger, h is more likely to be correct)

S 11

S 13

S 21

<ORGANIZATION> <PERSON>

: 2

: 1 : 1

For the current name hypothesis of S11: CorefNum = 2+1+1=4

The previous president Milosevic was beaten by the Opposition Committee candidate Kostunica.

Kostunica joined the committee in 1992.

Milosevic was born in Serbia’s central industrial city.

Page 58: NAME TAGGING

Experiment on Chinese Name Tagging: Results

Model Precision Recall F-Measure

HMM Baseline 87.4 87.6 87.5

Oracle (N=30) (96.2) (97.4) (96.8)

MaxEnt-Rank 91.8 90.6 91.2

SVMRank 89.5 90.1 89.8

p-Norm Push Ranking (p =16) 91.2 90.8 91.0

RankBoost (p-Norm Push Ranking, p = 1)

89.3 89.7 89.5

All re-ranking models achieved significantly better P, R, F than the baselineRecall and F-Measure values for MaxEnt-Rank and p-Norm are within the confidence interval

Page 59: NAME TAGGING

Introduction IE Overview Supervised Name Tagging

Features Models

Advanced Techniques and Trends Re-ranking and Global Features (Ji and Grishman,05;06) Data Sparsity Reduction (Ratinov and Roth, 2009, Ji and

Lin, 2009)

Page 60: NAME TAGGING

NLP Words words words ---> Statistics Words words words ---> Statistics Words words words ---> Statistics Data sparsity in NLP:

“I have bought a pre-owned car” “I have purchased a used automobile”

How do we represent (unseen) words?

60

Page 61: NAME TAGGING

NLP Not so well... We do well when we see the words we have already

seen in training examples and have enough statistics about them.

When we see a word we haven't seen before, we try: Part of speech abstraction Prefixes/suffixes/number/capitalized abstraction.

We have a lot of text! Can we do better?

61

Page 62: NAME TAGGING

Word Class Models (Brown1992)

• Can be seen either as:• A hierarchical distributional clustering• Iteratively reduce the number of states and assign words to hidden

states such that the joint probability of the data and the assigned hidden states is maximized.

62

Page 63: NAME TAGGING

Gazetteers• Weakness of Brown clusters and word embeddings:

representing the word “Bill” in• The Bill of Congress• Bill Clinton

• We either need context-sensitive embeddings or embeddings of multi-token phrases• Current simple solution: gazetteers• Wikipedia category structure ~4M typed expressions.

63

Page 64: NAME TAGGING

ResultsNER is a knowledge-intensive task;Surprisingly, the knowledge was particularly useful(*)

on out-of-domain data, even though it was not used for C&W induction or Brown clusters induction.

64

Page 65: NAME TAGGING

• Achieving really high performance for name tagging requires • deep semantic knowledge• large costly hand-labeled data

• Many systems also exploited lexical gazetteers• but knowledge is relatively static, expensive to

construct, and doesn’t include any probabilistic information.

Obtaining Gazetteers Automatically?

Page 66: NAME TAGGING

Obtaining Gazetteers Automatically?

• Data is Power• Web is one of the largest text corpora: however, web search is

slooooow (if you have a million queries).

• N-gram data: compressed version of the web• Already proven to be useful for language modeling

• Tools for large N-gram data sets are not widely available• What are the uses of N-grams beyond language models?

Page 67: NAME TAGGING

Counts on the Web

Page 68: NAME TAGGING

died in (a|an) _ accidentcar 13966, automobile 2954, road 1892, auto 1650, traffic 1549, tragic 1480, motorcycle 1399, boating 823, freak 733, drowning 438, vehicle 417, hunting 304, helicopter 289, skiing 281, mining 254, train 250, airplane 236, plane 234, climbing 231, bus 208, motor 198, industrial 187, swimming 180, training 170, motorbike 155, aircraft 152, terrible 137, riding 136, bicycle 132, diving 127, tractor 115, construction 111, farming 107, horrible 105, one-car 104, flying 103, hit-and-run 99, similar 89, racing 89, hiking 89, truck 86, farm 81, bike 78, mine 75, carriage 73, logging 72, unfortunate 71, railroad 71, work-related 70, snowmobile 70, mysterious 68, fishing 67, shooting 66, mountaineering 66, highway 66, single-car 63, cycling 62, air 59, boat 59, horrific 56, sailing 55, fatal 55, workplace 50, skydiving 50, rollover 50, one-vehicle 48, <UNK> 48, work 47, single-vehicle 47, vehicular 45, kayaking 43, surfing 42, automobile 41, car 40, electrical 39, ATV 39, railway 38, Humvee 38, skating 35, hang-gliding 35, canoeing 35, 0000 35, shuttle 34, parachuting 34, jeep 34, ski 33, bulldozer 31, aviation 30, van 30, bizarre 30, wagon 27, two-vehicle 27, street 27, glider 26, " 25, sawmill 25, horse 25, bomb-making 25, bicycling 25, auto 25, alcohol-related 24, snowboarding 24, motoring 24, early-morning 24, trucking 23, elevator 22, horse-riding 22, fire 22, two-car 21, strange 20, mountain-climbing 20, drunk-driving 20, gun 19, rail 18, snowmobiling 17, mill 17, forklift 17, biking 17, river 16, motorcyle 16, lab 16, gliding 16, bonfire 16, apparent 15, aeroplane 15, testing 15, sledding 15, scuba-diving 15, rock-climbing 15, rafting 15, fiery 15, scooter 14, parachute 14, four-wheeler 14, suspicious 13, rodeo 13, mountain 13, laboratory 13, flight 13, domestic 13, buggy 13, horrific 12, violent 12, trolley 12, three-vehicle 12, tank 12, sudden 12, stupid 12, speedboat 12, single 12, jousting 12, ferry 12, airplane 12, unrelated 11, transporter 11, tram 11, scuba 11, common 11, canoe 11, skateboarding 10, ship 10, paragliding 10, paddock 10, moped 10, factory 10

Page 69: NAME TAGGING

• Name labeled corpora: 1375 documents, about 16,500 name mentions

• Manually constructed name gazetteer including 245,615 names• Census data including 5,014 person-gender pairs.

A Typical Name Tagger

Page 70: NAME TAGGING

Patterns for Gender and Animacy DiscoveryProperty Name target [#] context Pronoun Example

Gender Conjunction-Possessive

noun[292,212] |capitalized [162,426]

conjunction his|her|its|their John and his

Nominative-Predicate

noun [53,587] am|is|are|was|were|be

he|she|it|they he is John

Verb-Nominative noun [116,607] verb he|she|it|they John thought he

Verb-Possessive noun [88,577]|capitalized [52,036]

verb his|her|its|their John bought his

Verb-Reflexive noun [18,725] verb himself|herself|itself|themselves

John explainedhimself

Animacy Relative-Pronoun (noun|adjective)& not after(preposition|noun|adjective)[664,673]

comma|empty

who|which|where|when

John, who

Page 71: NAME TAGGING

Lexical Property Mapping

Property Pronoun Value

Gender his|he|himself masculine

her|she|herself feminine

its|it|itself neutral

their|they|themselves plural

Animacy who animate

which|where|when non-animate

Page 72: NAME TAGGING

• If a mention indicates male and female with high confidence, it’s likely to be a person mention

Gender Discovery Examples

Patterns for candidate mentions

male female neutral plural

John Joseph bought/… his/… 32 0 0 0

Haifa and its/… 21 19 92 15

screenwriter published/… his/…

144 27 0 0

it/… is/… fish 22 41 1741 1186

Page 73: NAME TAGGING

• If a mention indicates animacy with high confidence, it’s likely to be a person mention

Animacy Discovery Examples

Patterns for candidate mentions

Animate

Non-Animate

who when where which

supremo 24 0 0 0

shepherd 807 24 0 56

prophet 7372 1066 63 1141

imam 910 76 0 57

oligarchs 299 13 0 28

sheikh 338 11 0 0

Page 74: NAME TAGGING

74

Overall Procedure

Test doc

Token Scanning&Stop-word Filtering

CandidateName

Mentions

CandidateNominalMentions

Fuzzy Matching

Person Mentions

GoogleN-Grams

Online Processing Offline Processing

Gender & AnimacyKnowledge Discovery

Confidence Estimation

Confidence (noun,masculine/feminine/animate)

Page 75: NAME TAGGING

75

• Candidate mention detection• Name: capitalized sequence of <=3 words; filter stop words,

nationality words, dates, numbers and title words• Nominal: un-capitalized sequence of <=3 words without stop

words

• Margin Confidence Estimation freq (best property) – freq (second best property) freq (second best property)

• Confidence (candidate, Male/Female/Animate) > • Full matching: candidate = full string• Composite matching: candidate = each token in the string• Relaxed matching: Candidate = any two tokens in the string

Unsupervised Mention Detection Using Gender and Animacy Statistics

Page 76: NAME TAGGING

76

Property Matching Examples

Mention candidate

Matching Method

String for matching

Property Frequency

masculine feminine neutral plural

John Joseph Full Matching

John Joseph

32 0 0 0

Ayub Masih Composite Matching

Ayub 87 0 0 0

Masih 117 0 0 0

Mahmoud Salim

Qawasmi

Relaxed Matching

Mahmoud 159 13 0 0

Salim 188 13 0 0

Qawasmi 0 0 0 0

Page 77: NAME TAGGING

77

Separate Wheat from Chaff:Confidence Estimation

1

1

k

ii

fpercentage

f

1 2

2

argf f

m inf

11

2

arg & log( )f

m in frequency ff

Rank the properties for each noun according to their frequencies: f1 > f2 > … > fk

Page 78: NAME TAGGING

78

• Candidate mention detection• Name: capitalized sequence of <=3 words; filter stop words,

nationality words, dates, numbers and title words• Nominal: un-capitalized sequence of <=3 words without stop

words

• Margin Confidence Estimation freq (best property) – freq (second best property) freq (second best property)

• Confidence (candidate, Male/Female/Animate) > • Full matching: candidate = full string• Composite matching: candidate = each token in the string• Relaxed matching: Candidate = any two tokens in the string

Experiments: Data

Page 79: NAME TAGGING

79

Impact of Knowledge Sources on Mention Detection for Dev Set

Patterns applied to ngrams for Nominal Mentions P(%) R(%) F(%)

Conjunction-Possessive writer and his 78.57 10.28 18.18

+Predicate He is a writer 78.57 20.56 32.59

+Verb-Nominate writer thought he 65.85 25.23 36.49

+Verb-Possessive writer bought his 55.71 36.45 44.07

+Verb-Reflexive writer explained himself 64.41 35.51 45.78

+Animacy writer, who 63.33 71.03 66.96

Patterns applied to ngrams for Name Mentions P(%) R(%) F(%)

Conjunction-Possessive John and his 68.57 64.86 66.67

+Verb-Nominate John thought he 69.23 72.97 71.05

+Animacy John, who 85.48 81.96 83.68

Page 80: NAME TAGGING

80

Impact of Confidence Metrics

3 1 3 1.5

3 2

3 5

3 10

2 5

2 10

1 0.5

•Why some metrics don’t work

•High Percentage (The) = 95.9%The: F:112 M:166 P:12

•High Margin&Freq (Under) =16Under:F:30 M:233 N:15 P:49

Name Gender (conjunction) confidence metric tuning on dev set

Page 81: NAME TAGGING

Exercise Suggest solutions to remove the name identification

errors in http://nlp.cs.rpi.edu/course/fall14/nameerrors.html