Automatic Rule Refinement for Information Extraction Bin Liu University of Michigan Laura Chiticariu IBM Research - Almaden Vivian Chu IBM Research - Almaden

Automatic Rule Refinement for Information Extraction

Bin LiuUniversity of Michigan

Laura ChiticariuIBM Research - Almaden

Vivian ChuIBM Research - Almaden

Frederic R. ReissIBM Research - Almaden

H. V. JagadishUniversity of Michigan

VLDB 2010

Presenter: Ajay Gupta Date: 20th Oct 2011

Outline2

Introduction Rules Representation Method Overview Experimental Setup Results Conclusion & Future Work

3

Information Extraction (IE)3

Distill structured data from unstructured text Exploit the extracted data in your applications

For years, Microsoft Corporation CEO Bill Gates was against open source. But today he appears to have changed his mind. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

Name Title OrganizationBill Gates CEO MicrosoftBill Veghte VP MicrosoftRichard Stallman Founder Free Soft..

Frederick Reiss et. al. SIGMOD 2010 Tutorial

AnnotationsAnnotations

4

Rule Based Information Extraction

Most IE systems uses Rules to define important patterns in the text.

Example: Person name extractorIf a match of a dictionary of common first names occurs in the text, followed immediately by a capitalized word, mark the two words as a “candidate person name”.

5

Anna at James St. office (555-1234), or James, her assistant - 555-7789 have the details.

Example Extraction Rules – When Things Go Wrong

Phone

555-7789

Phone

555-1234

Person

James

Person

Anna

Person

James

6

Rule Development in Information Extraction

Analyze

Develop

Test

Iterative refinement process

labor intensive

time-consuming

error prone

7

Rule Refinement Is Hard

Number of rules could be large.

Rule interactions could be complex.

Analyzing side effects- False positive → improve precision- Correct results → decrease recall

Identifying change could take hours

- Person extractor has 14 complex rules

8

Rules Representation

9

Rules Representation SQL to represent rules. SQL Subset: – Select, Project, Join, Union ALL, Except ALL

SQL Extension: Data Type: span Table: Document(text span) Predicate Functions: - Follows, FollowsTok, Contains

Scalar Functions: - Merge, Between, LeftContext

Table Functions: - Regex, Dictionary

10

Rules Examples

Dictionary file first_names.dict: anna, james, john, peter…

R1: create view Phone as Regex(‘d{3}-\d{4}’, Document, text);

R2: create view FirstNameCand F as Dictionary(‘first_names.dict’, Document, text);

R3: create view FirstName as Select * from FirstNameCand F where Not(ContainsDict('street_suffix.dict', RightContextTok(F.match,1)));

Anna at James St. Office (555-1234), or James, her assistant

- 555-7789 have the details. t0:

Phone

t1 555-1234

t2 555-7789

FirstNameCand

t3 Anna

t4 James

t5 James

FirstName

t6 Anna

t7 James

11

Rules Examples

R4: create view PersonPhoneAll as Select Merge(F.Match, P.match) as match from FirstName F, Phone P where Follows(F.match, P.match, 0, 60);

R5: create table PersonPhone(match span); insert into PersonPhone ( select * from PersonPhoneAll A) except all ( select A1.* from PersonPhoneAll A1 , PersonPhoneAll A2 where Contains(A1.match, A2.match) and Not(Equals(A1.match, A2.match));

PersonPhoneAll

PersonPhone

t11 Anna at James St. Office (555-1234)

t12 James, her assistant - 555-7789

t8 Anna at James St. Office (555-1234)

t9 James, her assistant - 555-7789

t10 Anna at James St. Office (555-1234) or James, her assistant - 555-7789

12

Canonical Representation of Rules

13

Method Overview

14

Method OverviewData Provenance:

Boris Glavic, Gustavo Alonso, ICDE 09

15

Input: Set of correct and incorrect examples

generated by an Extractor

Goal: Generate refinements of Extractor

that remove incorrect example, while minimizing the rest

of the results.

Basic Idea: Data provenance allows one to understand the origins of an output

Cut any provenance link wrong output disappears

Method Overview

FirstNameCandDictionary

FirstNames.dict

Doc

PersonPhoneAllJoin

Follows(name,phone,0,60)

Anna

Anna555-7789

PhoneRegex

/\d{3}-\d{4}/

555-7789

(Simplified) provenance

of a wrong output

16

Method Overview

Solution:

Stage1: Generate High Level Changes

“remove tuple t from the output of operator Op in the canonical representation of the extractor”.

Problems:

1) feasibility

2) side-effects

Stage2: Generate Low Level Changes

- How to modify the operator to implement high level change.

- Ranking

17

High-Level Change

Let t be a tuple in an output table V . A high-level change for t is a pair (t′ , Op), where Op is an operator in the canonical operator graph of V and t′ is a tuple in the output of Op such that eliminating t′ from the output of Op by modifying Op results in eliminating t from V .

DEFINITION: HIGH-LEVEL CHANGE

18

Computing Provenance

19

Algorithm to Generate HLCs

20


FirstNames.dict

Doc

PersonPhoneAllJoin


Anna

Anna555-7789

PhoneRegex

/\d{3}-\d{4}/

555-7789FirstName

SelectNot(ContainsDict('street_suffix..

HLC Example

Anna

HLC1Remove Anna<--> 555-7789From output of Join in R4

HLC4Remove Anna From

Output of Dictionary in R2


Output of Select in R3HLC2

Remove 555-7789From Output of Regex in R1

21


FirstNames.dict

Doc

PersonPhoneAllJoin


Anna

Anna555-7789

PhoneRegex

/\d{3}-\d{4}/

555-7789FirstName

SelectNot(ContainsDict('street_suffix..

Anna

HLC1Remove Anna<--> 555-7789From output of Join in R4


Output of Dictionary in R2

Generating Low-Level Changes from HLCs

LLCChange Join Predicate to Follows(name,phone,0,50)

LLCRemove 'anna' From

FirstNames.dict

22

Generating Low-Level Changes from HLCs:Naive Approach

Input: Set of HLCs

Output: List of LLCs, ranked based on effects

Algorithm:

1) For each operator Op, consider all HLCs

2) For each HLC, enumerate all possible LLCs

3) For each LLC:• Compute the set of local tuples it removes from the output of Op• Propagate these removals up through the provenance graph to compute the

effect on end-to-end result

4) Rank LLCs

23

Problems with Naive Approach

Problem1: Number of possible LLCs for a HLC could be very large

Example: Remove output tuple of a Dictionary operator

Dictionary with 1000 entries possible LLCs: 2^999 -1 !.

Solution:

Limit the LLCs considered to a set of tractable size, while still considering all feasible combinations of HLCs for given operator1) Generate a single LLC for each of k promising combinations of HLCs for given operator2) k is the number of LLCs presented to the user

24

Problems with Naive Approach

Problem2: Traversing the provenance graph is expensive O(n2), where n is the size of the operator tree

Solution:Remember the mapping from each high-level change back to the affected output tuple.

25

Specific Classes of Low-Level Changes1) Modify numerical join parameters -

E.g., “Modify max char. distance of Follows() predicate in the join operator of rule R4 from 60 to 20”

2) Remove dictionary entries -

E.g., “Modify the Dictionary operator of rule R2 by removing entry Anna from first_names.dict”

3) Add filtering dictionary -

E.g., “Add predicate Not(ContainsDict(‘street_suffix.dict’, RightContextTok(match,1))) to Dictionary operator of rule R3”

4) Add filtering view - applies to an entire view

E.g., “Subtract from the result of rule R4 PersonPhoneAll spans that are strictly contained within another PersonPhoneAll span”

26

LLC Generation: Removing Dictionary Entries

James James X

James Y

James Anderson

Anna Anna XYZ

Anna ABC

26

Output of operator Dictionary(‘FirstNameDict’)

Final output of FirstName extractor

‘anna’ Anna XYZ

Anna ABC

‘james’ James X

James Y

James Anderson

Dictionary entries in ‘FirstNameDict’

Effects of removing Dictionary entry

1. ‘anna’

2. ‘anna’, ‘james’

Generated LLCs:

Remove from dictionary FirstNameDict the following entries:

27

Experiments

- Rule refinement approach implemented in SystemT information extraction system - Uses SystemT’s AQL rule language

Goals:Quality evaluation of generated refinementsPerformance evaluation

Setup: Ubuntu Linux version 9.10, 2.26 GHz Intel Xeon CPU with 8 GB RAM. 10 fold cross validation.

28

Extraction Tasks and Rule Sets

Person task 14 complex rules for identifying person names

• E.g., “CapitalizedWord followed by FirstName”

“LastName followed by Comma followed by CapitalizedWord”

Rules for identifying other Named Entities • E.g., Organization, EmailAddress, AddressThese can be used as filtering purpose to enable refinement. - “Morgan Stanley”, “Georgia”

PersonPhone task 11 complex rules for identifying phone numbers High-quality Person extractor One rule to identify PersonPhone candidates: “Person followed by Phone within 0 to 60 characters”

29

Evaluation Datasets

Dataset #docs #labels #docs #labels

ACECoNLLEnronEnronPP

273946434322

520165604500157

69216218161

12201842196946

Training Set Test Set

ACE: collection of newswire reports, broadcast news and con- versations with Person labeled data from the ACE05 Dataset.

CoNLL: collection of news articles with Person labeled data from the CoNLL 2003 Shared Task.

Enron, EnronPP: collections of emails from the Enron corpus annotated with Person and respectively PersonPhone labels.

30

Quality Evaluation

31

Quality Evaluation

- F1-measure improves between 6% to 26% in few iterations

- Recall remains stable.

- F1-measure and Precision reaches platue

- First few high ranked refinements - Some low level changes are not implemented yet

32

Quality Evaluation: Comparison with Experts

- Two experts- Enron dataset for Person task- Time: One hour

33

Performance Evaluation

- One hour by an expert - 3 min to 15 min per refinement- System refinement time: 2 min

34

Conclusion & Future work

- Database provenance technique for refining information extraction rules.

Future work:

- Extensions• Other types of LLCs. e.g. Regex

- Addressing false negatives

35

Thank You

Documents

Automatic Rule Refinement for Information Extraction Bin Liu University of Michigan Laura Chiticariu IBM Research - Almaden Vivian Chu IBM Research - Almaden