16
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information Extraction) 2011-04-01 JongHeum Yeon

A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information

Embed Size (px)

Citation preview

Page 1: A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information

A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition

Utku Irmak(Yahoo! Labs)Reiner Kraft(Yahoo! Inc.)

WWW 2010(Information Extraction)

2011-04-01JongHeum Yeon

Page 2: A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information

2

INTRODUCTION

• Named Entity Recognition (NER)– Locate and classify parts of free text into a set of predefined categories

• Semi-Structured Named Entities (SSNE)– Entity string may syntactically contain

1. digits and white spaces2. a finite set of non-letter characters3. a finite set of domain-specific terms (tokens)

– X: The person, location and organization– O: phone or fax numbers, date, time, monetary, weight, length, percentage

• These entities are different between languages and regions.• Traditional approaches for SSNE recognition

– significant amount of manual effort– in the form of handcrafting rules– in the form of labeling examples,

• they do not scale when many entity types, languages and regions need to be supported.

Page 3: A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information

3

GOAL

• Provide a scalable solution for the detection of semi-structured named entities for many entity types, languages and regions

• A novel three-level framework– Text mining techniques– A novel two-step bootstrapping algorithm

Page 4: A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information

4

Applications

• User-centric entity detections systems• Web search engines• Data extraction, integration and classification• Content de-identification systems

Page 5: A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information

5

Existing Solutions and Their Limitations

• Rule-based approaches– Regular expressions– Significant manual effort to generate– The regular expression used for phone number detection in Y! Mail con-

tains more than 600 characters(large coverage, ambiguous patterns)– Differences: languages, regions

• Editorial study• Development• Maintenance

– Date and time detection• Ambiguity problems

– “1/18”– “We will meet this may” / “This may not be ok”– “2000 is a nice year” / “Windows 2000 is an operating system”

• Machine-learning approaches– Supervised approaches

• A large annotated corpus

Page 6: A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information

6

BOOTSTRAPPING ALGORITHM

• Bootstrapping algorithm steps– The system starts with a small number of seed examples, which are provided by the user– The system then finds occurrences of these examples in a large set of documents.– By analyzing these occurrences, the system generates contextual extraction patterns

(rules) and assigns confidence scores to the patterns.– The system applies the extraction patterns to the documents and extracts new candi-

dates.– Based on some validation mechanism, the system assigns scores to the extracted candi-

dates, and chooses the best ones to add to the seed set.– Then the system starts over to perform many similar iterations, and at every iteration it

learns more patterns and can extract more instances.

Page 7: A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information

7

THREE-LEVEL FRAMEWORK

Page 8: A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information

8

THREE-LEVEL FRAMEWORK: First Level

• The goal– High recall rates and not worry about the precision

• A very large set of candidates are extracted with their contexts.• With context, we refer to a window that surrounds the candidate (20

words)• This usually reduces the total data size significantly, since most parts of

the corpus are eliminated.

Page 9: A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information

9

THREE-LEVEL FRAMEWORK: Second Level

• The goal – Identify the target entities with very high precision.

• Bootstrapping algorithm1. The candidates with their contexts are moved to a Positive Set if the candidate

matches a seed example2. The confidence score of each remaining candidate is computed based on the simi-

larity of its contexts to the positive set and valid ones are moved to the positive set3. The positive set is analyzed and the qualifying entities are included in the seed set

based on some statistical data

• Similarity Function– Cosine Similarity– Cosine Similarity with Distances– String Length

• Validate candidates– Local Ranking– Global Ranking

• Stop the bootstrapping algorithm when the positive set growth slows signifi-cantly (e.g., if it grows by less than 1%)

Page 10: A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information

10

THREE-LEVEL FRAMEWORK: Third Level

• The goal – Create a final model using machine learning techniques to balance and fur-

ther improve the overall precision and recall rates.

• Feature space– A bag of words model– Multi-term phrases using pointwise mutual information– Prefix and suffix strings that immediately precede or follow the candidates

(or seeds) as features

• Model creation– Support vector machine (SVM)– One class model: the positive set alone– Two class mode: the positive set + negative examples(most dissimilar can-

didates)

Page 11: A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information

11

Example(1): Phone Number Detection

• First Level– finds sequences of digits that may be separated by space character and fol-

lowing characters: -+()./[]_*– 7 ~ 13 character, <10 non-digit characters

• Second Level: A two-step algorithm– Adding new regular expressions– Adding new seed terms: tf*idf

• Third Level

Page 12: A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information

12

Example(2): Date and Time Detection

Page 13: A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information

13

EVALUATION AND RESULTS

• English, German, Polish, Swedish and Turkish• For each language, we use a corpus of 1GB, which contains about 200K

random web pages from a large crawl• Sample set

– Phone: 250 instance– Date-time: 300 instance

• Each instance is then judged by a native speaker of the language– No: This is not a target entity– Yes: This is a target entity and boundaries are correct– Over-Selection: This is a target entity, but some unrelated text is included– Under-Selection: This is a target entity, but some part of the entity is not

included

• Precision = TP/(TP+FP)• Recall = TP/(TP+FN)

Page 14: A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information

14

EVALUATION AND RESULTS: Phone Number Detection

Page 15: A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information

15

EVALUATION AND RESULTS: Phone Number Detection

Page 16: A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information

16

Contributions

1. Propose a novel three-level bootstrapping framework2. Adopt the framework for semi-structured entity detection and pro-

pose a two-step bootstrapping algorithm3. Evaluate the proposed techniques extensively on English, German,

Polish, Swedish and Turkish documents for phone, date and time enti-ties

4. Discuss implementation details for the real-time detection of semi-structured entities