OPTIMIZING INFORMATION EXTRACTION PROGRAMS OVER EVOLVING … · OPTIMIZING INFORMATION EXTRACTION PROGRAMS OVER EVOLVING TEXT Fei Chen Under the supervision of Associate Professor

OPTIMIZING INFORMATION EXTRACTION PROGRAMS OVER EVOLVING TEXT

by

Fei Chen

A dissertation submitted in partial fulfillment of

the requirements for the degree of

Doctor of Philosophy

(Computer Science)

at the

UNIVERSITY OF WISCONSIN–MADISON

2010

c© Copyright by Fei Chen 2010

All Rights Reserved

i

To my mother and father

ii

ACKNOWLEDGMENTS

I owe my deepest gratitude to my two advisors, AnHai Doan and Raghu Ramakrishnan. I am

extremely lucky to work with both of them, and am very thankful that they shared their knowl-

edge, passion and vision in databases with me. I especially thank Raghu for providing me with

the funding for the first four years, which allowed me to focus on my research. I am also deeply

grateful to Raghu for introducing me to AnHai and the DBLife project that inspired this disserta-

tion. It is with AnHai that I learnt so much about how to become a good researcher. He taught me

numerous lessons about writing academic papers and presenting research ideas. His intellectual

acuity always challenged me to think deeper and harder, and that always brought out my best ideas.

Without his encouragement and constant guidance this dissertation would not have been possible.

I am also greatly indebted to Jun Yang for working with AnHai and me on the Cyclex and

Delex projects and providing insightful feedback. I would also like to thank Luis Gravano for his

valuable comments on the Delex project. Special thanks go to Jeffery F. Naughton, C. David Page,

and Jignesh M. Patel for being on my Ph.D committee.

This research also benefits tremendously from many graduate and post-doctoral students. I

would like to thank Byron Gao for our discussions and his help on the Delex project. I owe

many thanks to several students on the DBLife project team: Xiaoyong Chai, Ting Chen, Pedro

DeRose, Chaitanya Gokhale, Warren Shen, and Ba-Quy Vuong. Thank you for your feedback

and support. I also thank fellow students Akanksha Baid, Spyridon Blanas, Bee-Chung Chen,

Lei Chen, Eric Chu, Yeye He, Allison Holloway, Willis Lang, SangKyun Lee, Junghee Lim, Eric

Paulson, Christine Reilly, Chong Sun, Khai Tran and Chen Zeng for their friendship and support.

Last but not least, I thank my parents for their unconditional support and love all these years,

and for their encouragement to pursue my interests. It is to them that I dedicate this dissertation.

iii

TABLE OF CONTENTS

Page

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 State of the Art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 IE over Evolving Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Limitations of the Current Solutions . . . . . . . . . . . . . . . . . . . . . . . . . 31.4 Overview of Our Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.4.1 Recycling for Single-IE-Blackbox Programs . . . . . . . . . . . . . . . . 41.4.2 Recycling for Complex IE Programs . . . . . . . . . . . . . . . . . . . . . 61.4.3 Recycling for CRF-Based IE Programs . . . . . . . . . . . . . . . . . . . 7

1.5 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.6 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2 Recycling for Single-IE-Blackbox Programs . . . . . . . . . . . . . . . . . . . . . . 11

2.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.2 The Cyclex Solution Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.3 The Page Matchers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.3.1 Suffix Tree Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.3.2 ST: The Suffix-Tree Matcher . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.4 The Reuser + Extraction Module . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.5 The Cost-Based Matcher Selector . . . . . . . . . . . . . . . . . . . . . . . . . . 282.6 Empirical Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3 Recycling for Complex IE Programs . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353.2 Capturing IE Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383.3 Reusing Captured IE Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.3.1 Scope of Mention Reuse . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

iv

Page

3.3.2 Overall Processing Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 423.3.3 IE Unit Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453.3.4 Identifying Reuse With Matchers . . . . . . . . . . . . . . . . . . . . . . 46

3.4 Selecting a Good IE Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473.4.1 Space of Alternatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483.4.2 Searching for Good Plans . . . . . . . . . . . . . . . . . . . . . . . . . . 493.4.3 Cost Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

3.5 Putting It All Together . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553.6 Empirical Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4 Recycling for CRF-Based IE Programs . . . . . . . . . . . . . . . . . . . . . . . . . 64

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 644.1.1 Conditional Random Fields for Information Extraction . . . . . . . . . . . 654.1.2 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 674.1.3 Challenges and Solution Outlines . . . . . . . . . . . . . . . . . . . . . . 69

4.2 Modeling CRFs for Reusing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 704.3 Capturing CRF IE Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 744.4 Reusing Captured Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 774.5 Putting It All Together . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 814.6 Empirical Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 814.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

6.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 886.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

LIST OF REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

APPENDICES

Appendix A: xlog Programs for Delex Experiments . . . . . . . . . . . . . . . . . . 99

v

LIST OF FIGURES

Figure Page

1.1 Two pages of the same URL, retrieved at different times . . . . . . . . . . . . . . . . 4

2.1 The Cyclex architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.2 An example of inserting a suffix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.3 An example of prefix links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.4 Data flow of Cyclex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.5 Data sets and extractors for our experiments . . . . . . . . . . . . . . . . . . . . . . . 29

2.6 Runtime of Cyclex versus the three algorithms that use different page matchers . . . . 31

2.7 Runtime decomposition of different plans . . . . . . . . . . . . . . . . . . . . . . . . 32

2.8 Accuracy of cost models as a function of (a) number of snapshots k, (b) sample size|S|, (c) α, (d) β . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

2.9 Ratio of runtimes as a function of α and β . . . . . . . . . . . . . . . . . . . . . . . . 34

3.1 (a) A multi-blackbox IE program P in xlog, and (b) an execution plan for P . . . . . . 36

3.2 (a) An execution tree T , and (b) IE units of T . . . . . . . . . . . . . . . . . . . . . . . 38

3.3 Movement of data between disk and memory during the execution of IE unit U onpage p1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.4 An illustration of executing an IE unit. . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.5 IE chains and sharing the work of matching across them. . . . . . . . . . . . . . . . . 48

3.6 Cost model parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

vi

Figure Page

3.7 Data sets and IE programs for our experiments . . . . . . . . . . . . . . . . . . . . . 56

3.8 The execution plan used in our experiments for the “award” IE task. . . . . . . . . . . 57

3.9 Runtime of No-reuse, Shortcut, Cyclex, and Delex. . . . . . . . . . . . . . . . . . 58

3.10 Runtime decomposition of No-reuse, Shortcut, Cyclex and Delex. . . . . . . . . . . 59

3.11 Performance of the optimizer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

3.12 Sensitivity analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

3.13 Runtime comparison wrt number of mentions. . . . . . . . . . . . . . . . . . . . . . . 61

3.14 Runtime comparison on a learning based IE program. . . . . . . . . . . . . . . . . . . 63

4.1 (a) An example of using CRFs to extract persons and locations, and (b) an example ofthe Viterbi algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.2 An example of a path matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

4.3 An illustration of right contexts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

4.4 An illustration of left contexts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

4.5 An illustration of capturing CRF contexts from a path matrix. . . . . . . . . . . . . . 76

4.6 Data sets for our experiments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

4.7 Stanford NER in xlog. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

4.8 Runtime of No-reuse, Cyclex, Delex, and CRFlex. . . . . . . . . . . . . . . . . . . 83

4.9 Runtime decomposition of No-reuse, Cyclex, Delex, and CRFlex. . . . . . . . . . . 84

A.1 xlog Programs for 6 IE tasks in Figure 3.7.(b). IE blackboxes are in bold. . . . . . . . 105

A.2 The xlog program of “actor”. IE blackboxes are in bold. . . . . . . . . . . . . . . . . 105

ABSTRACT

OPTIMIZING INFORMATION EXTRACTION PROGRAMS OVER EVOLVING TEXT

Fei Chen

Under the supervision of Associate Professor AnHai Doan and Dr. Raghu Ramakrishnan

At the University of Wisconsin-Madison

Information extraction (IE) is the problem of extracting structured data from unstructured text.

Examples of structured data are entities such as organizations and relationships such as “company

X is acquired by company Y.” Examples of unstructured text are emails, Web pages, and blogs.

Most current IE approaches have considered only static text corpora, over which we typically

have to apply IE only once. Many real-world text corpora however are evolving, in that documents

can be added, deleted and modified. An example of evolving text is Wikipedia. Therefore, to

keep extracted information up to date, we often must apply IE repeatedly, to consecutive corpus

snapshots. How to efficiently execute such repeated IE?

In this dissertation I describe solutions that efficiently execute such repeated IE by recycling

previous IE efforts. Specifically, given a current corpus snapshot U , these solutions first identify

text portions of U that also appear in the previous corpus snapshot V . Since these solutions have

already executed the IE program over V , they can now recycle the IE results of these parts, by

combining these results with the results of executing IE over the remaining parts of U , to produce

the complete IE results for U . We describe three systems that deal with successively more complex

IE programs. The first system, Cyclex, recycles for IE programs that contain a single IE black-

box. The second system, Delex, recycles for IE programs that consist of multiple IE blackboxes.

The third system, CRFlex, also considers multi-blackbox IE programs, but some of these black-

boxes are based on a leading statistical learning model: Conditional Random Fields. I present

experiments on real-world data that validate the proposed solutions.

1

Chapter 1

Introduction

Information extraction (IE) is the problem of extracting structured data from unstructured text.

Examples of structured data are entities such as persons, locations, organizations, and relationships

such as “company X is acquired by company Y.” Examples of unstructured text are emails, Web

pages, and blogs.

This dissertation studies optimizing IE over evolving text: the problem of how to execute IE

programs efficiently over text corpora that are evolving, in that documents can be added, deleted,

and modified. An example of evolving text is Wikipedia.

We begin in this chapter by reviewing state-of-the-art IE solutions, and showing that these IE

solutions only consider static text. Then we show that evolving text is pervasive and that many IE

applications consider IE over evolving text (Section 1.2). Next, we show that the current solution of

IE over evolving text is not satisfactory (Section 1.3). We then outline our solutions (Section 1.4).

Finally, we list our contributions (Section 1.5) and outline the rest of this dissertation (Section 1.6).

1.1 State of the Art

Information extraction has received much attention in the database, AI, Web, and KDD com-

munities (see [22, 3, 33, 18] for recent tutorials). The vast majority of works consider how to

improve extraction accuracy (e.g., with novel techniques such as CRFs [22]). But recent works

also consider how to improve extraction time. They fall roughly into three groups:

2

• The first group (e.g., [4, 50]) efficiently selects a subset of documents that are likely to contain

the structured data of interest. Then it only applies IE programs to the selected subset of

documents, instead of to the entire text corpus.

• The second group (e.g., [20, 15, 67]) considers the problem of efficiently matching patterns

against documents, which is a common problem in IE tasks. It builds an inverted index over

the documents to reduce the number of documents considered for each pattern. Alteratively,

when there are many patterns to be matched, it builds an index over the patterns to reduce

the number of patterns considered for each document.

• The third group (e.g., [62, 67]) considers IE programs as workflows that consist of multiple

operators. Then it exploits relational style optimization to change the order of evaluating

these operators to reduce extraction time.

These proposed solutions have made significant progress in deploying IE programs efficiently

over large text corpora. However, these solutions have considered only static text corpora, over

which we typically have to apply IE only once. In practice, text corpora are often evolving. There-

fore, to keep extracted information up to date, we often must apply IE repeatedly to consecutive

corpus snapshots. We now list a few examples of IE applications over evolving text.

1.2 IE over Evolving Text

Community Information Management (CIM): CIM systems [32] extract, manage, and keep

track of structured information related to a community on the Web. For example, DBLife [31] is a

structured portal for the database community that we have been developing. It extracts and tracks

information about researchers, organizations, papers, conferences, and talks. To this end, DBLife

operates over a text corpus of 10,000+ URLs. Each day it re-crawls these URLs to generate a 120+

MB corpus snapshot, and then applies IE to this snapshot to extract the aforementioned structured

data. In order to monitor the latest community information (e.g., which database researchers have

been mentioned where in the past 24 hours), it must re-crawl all the URLs to generate a new corpus

snapshot and then re-apply IE.

3

Enterprise Information Management: As another example, Impliance is a system built at IBM

Almaden that aims to manage all information within an enterprise [7]. It crawls the enterprise

intranet, and applies IE programs to each document obtained to extract information such as, “who

is mentioned in this document.” In order to infer the latest information over the intranet, Impliance

must regularly re-crawl the intranet and then re-apply IE.

Social Media Monitoring: Recently, there is growing interest in monitoring social media, such

as blogs, Wikipedia, and Twitter. For example, YAGO [70] is a system that extracts structures from

Wikipedia and stores these extracted structures into a database. In order to keep the database up

to date as Wikipedia evolves, it must regularly re-crawl Wikipedia and re-extract structures. See

[6, 19, 34, 13, 51] for other examples of evolving text corpora.

1.3 Limitations of the Current Solutions

Despite their pervasiveness, no satisfactory solution has been proposed currently for IE over

evolving text. Given such a corpus, the common solution is to apply IE to each corpus snapshot

in isolation, from scratch. This solution is simple, but highly inefficient, with limited applicability.

For example, in DBLife reapplying IE from scratch takes 8+ hours each day, leaving little time left

for higher-level data analysis. As another example, time-sensitive applications (e.g., stock, auction,

intelligence analysis) often want to refresh information quickly, by re-crawling and re-extracting,

say, every 30 minutes. In such cases, applying IE from scratch is inapplicable if it already takes

more than 30 minutes. Finally, this solution is ill-suited for interactive debugging of IE applications

over evolving corpora, because such debugging often requires applying IE repeatedly to multiple

corpus snapshots. Thus, given the growing need for IE over evolving text corpora, it has now

become crucial to develop efficient IE solutions for these settings.

1.4 Overview of Our Solutions

The key idea behind our solutions is to exploit IE efforts spent on previous corpus snapshots to

reduce the extraction time on the current corpus snapshot. We now outline our solutions.

4

Cimple Project Meetings

CS 310 at 4pm on Jun 20, to discuss

CIM and IR.

Will meet in CS 105 at 2pm this

Thursday.

p q

u1

u2

v1

v2

v3

Cimple Project Meetings

Will meet in CS 105 at 2pm this

Thursday.

Figure 1.1 Two pages of the same URL, retrieved at different times

1.4.1 Recycling for Single-IE-Blackbox Programs

We start by considering IE programs that have a single blackbox or an extractor. We consider

how to execute extractors over evolving text efficiently. We have developed Cyclex as a solution

to this problem. The key idea underlying Cyclex is to recycle previous IE results, given that con-

secutive snapshots of a text corpus often contain much overlapping data. The following example

illustrates this idea:

Example 1.1. Consider a tiny corpus of a single URL that lists project meetings. Figure 1.1 shows a

snapshot of this corpus, which is just a single data page p (of the above URL), crawled today. Suppose that

we have applied an extractor E to this snapshot, to extract the tuple (CS 105,2pm) which is a mention of a

meeting. Suppose tomorrow we crawl the above URL to obtain another corpus snapshot, which is the page

q shown in Figure 1.1. Then to extract meetings from q, current solutions would apply extractor E to q from

scratch, and produce tuples (CS 105,2pm) and (CS 310,4pm).

In contrast, Cyclex tries to recycle the IE results of p. Specifically, it starts by “matching” q with p, to

find text regions of q that also appear in p. Suppose it finds two regions v1 and v2 of q that also appear as

u1 and u2 of p, respectively (see Figure 1.1). Cyclex then does not apply E to v1 and v2, but instead copies

over the mentions of u1 and u2. Cyclex then applies E only to v3, the sole region of q that does not appear

in p. The savings come from not having to apply E to the entire page q.

While promising, realizing the above idea raises difficult challenges. The first challenge is that

we cannot simply just copy mentions over, e.g., from regions u1 and u2 of page p to v1 and v2 of

page q, as discussed in Example 1.1. To see why, suppose a particular extractor E is such that

it only extracts meetings if a page has fewer than five lines (otherwise it produces no meetings).

Then none of the mentions of page p can be copied over to page q, which has more than five lines.

5

In general, which mentions can be copied “safely” depends on certain properties of extractor E.

Thus, we must model certain properties of extractor E, so that we can (a) exploit these properties

to reuse certain mentions, and (b) prove that reusing will produce the same set of mentions as

applying IE from scratch. In Cyclex, we define a small set of such properties, show that many

practical extractors exhibit these properties (see Section 2.1), and develop incremental re-extraction

techniques by exploiting these properties.

Our second challenge is how to “match” two pages, e.g., p and q in Example 1.1, to find

overlapping text regions. We first develop ST, a powerful suffix-tree based matcher, and prove

that this matcher achieves the most complete result, i.e., finds all largest possible overlapping

regions. We then show that an entire spectrum of matchers exists, with matchers trading off the

completeness of the result for runtime efficiency (see Section 2.3). Since no matcher is always

optimal, we provide Cyclex with a set of alternative matchers (more can be added easily), and a

way to select a good one, as discussed below.

Since dynamic text corpora can easily contain tens of thousands or millions of data pages,

we must also develop efficient solutions for reusing mentions and applying extractor E to non-

overlapping text, in the presence of a large amount of disk-resident data. We must also consider

how to efficiently interleave these steps with the step of matching data pages (see Section 2.4).

Finally, addressing the above challenges results in a space of execution plans, where the plans

differ mainly on the page matcher employed. Thus, in the final challenge we must develop a cost

model and use it to select the optimal plan. Unlike RDBMS settings, our cost model is extraction-

specific. In particular, it tries to model the rate of change of the text corpus, and the run time and

result size of extractors and matchers, among others (see Section 2.5).

We conduct extensive experiments over two real-world data sets that demonstrate that Cyclex

can dramatically cut the runtime of re-applying IE from scratch by 50-90%. This suggests that

recycling past IE efforts can be highly beneficial.

6

1.4.2 Recycling for Complex IE Programs

The Cyclex work clearly established that recycling IE results for evolving text corpora is highly

promising. The work itself however suffers from a major limitation: it considers only IE programs

that contain a single IE blackbox. Real-world IE programs, in contrast, often contain multiple IE

blackboxes connected in a compositional “workflow.” As a simple example, a program to extract

meetings may employ an IE blackbox to extract locations (e.g., “CS 105”), another IE blackbox to

extract times (e.g., “3 pm”), then pairs locations and times and keeps only those that are within 20

tokens of each other (thus producing (“CS 105”, “3 pm”) as a meeting instance in this case).

The IE blackboxes are either off-the-shelf (e.g., downloaded from public domains or purchased

commercially) or hand-coded (e.g., in Perl or Java), and they are typically “stitched together” using

a procedural (e.g., Perl) or declarative language (e.g., UIMA, Gate, xlog [37, 28, 67]). Such multi-

blackbox IE programs could be quite complex, for example, 45+ blackboxes stacked in five levels

in DBLife, and 25+ blackboxes stacked in seven levels in Avatar [33]. Since Cyclex is not aware

of the compositional nature of such IE programs (effectively treating the whole program as a large

blackbox), its utility is severely limited in such settings.

To remove this limitation, we develop Delex, a solution for effectively executing multi-blackbox

IE programs over evolving text data. Like Cyclex, Delex aims at recycling IE results. However,

compared with Cyclex, developing Delex is fundamentally much harder, for three reasons.

First, since the target IE programs for Delex are multi-blackbox and compositional, we face

many new and difficult problems. For example, how should we represent multi-blackbox IE pro-

grams, e.g., how to stitch together IE blackboxes? How to translate such programs into execution

plans? At which level should we reuse such plans? We show for instance that reusing at the level

of each IE blackbox (i.e., storing its input/output for subsequent reuse), like Cyclex does, is sub-

optimal in the compositional setting. Once we have decided on the level of reuse, what kind of

data should we capture and store for subsequent reuse? Can we reuse across IE blackboxes? These

are examples of problems that Cyclex did not face.

Second, since a target IE program now consists of many blackboxes, all attempting reuse at the

same time, Delex faces a far harder challenge of coordinating their execution and reuse to ensure

7

efficient movement of large quantities of data between disk and memory. In contrast, Cyclex only

had to worry about the efficient execution of a single IE blackbox.

Finally, the main optimization challenge in Cyclex is to decide which matcher to assign to the

sole IE blackbox. A matcher encodes a way to find overlapping text regions between the current

corpus snapshot and the past ones, for the purpose of recycling IE results. Thus, the Cyclex plan

space is bounded by the (relatively small) number of matchers. In contrast, Delex can assign to

each IE blackbox in the program a different matcher. Hence, it must search a blown-up plan space

(exponential in the number of blackboxes). To exacerbate the search problem, optimization in

this case is “non-decomposable;” i.e., we cannot just optimize parts of a plan, then glue the parts

together to obtain an optimized whole.

We conduct extensive experiments with both rule-based and learning-based IE programs over

two real-world data sets to demonstrate the utility of our approach. We show in particular that

Delex can cut the runtime of Cyclex by as much as 71%.

1.4.3 Recycling for CRF-Based IE Programs

So far, we have developed the efficient recycling algorithm Delex for IE programs that consist

of multiple IE blackboxes. If we can open up some of these blackboxes and understand more

about them, can we develop a more efficient recycling algorithm? We study this problem in this

section. In particular, we focus on IE programs that contain IE blackboxes based on a statistical

learning model: Conditional Random Fields (CRFs). We open up these CRF-based IE blackboxes

and explore whether we can develop a more efficient recycling algorithm. CRF-based IE is a

state-of-the-art IE solution that has been successfully applied to many IE tasks, including name

entity extraction [38, 54], table extraction [61], and citation extraction [60]. Therefore, a recycling

solution for CRF-based IE is a practical extension of Delex.

CRF-based IE reduces information extraction to a sequence labeling problem. Given a docu-

ment d, an IE program P that contains a CRF-based IE blackbox F first converts d into a sequence

of tokens x1...xT . Then F takes x1...xT as input and outputs a label from a set Y of labels for each

token. Y consists of the set of entity types to be extracted and a special label “other” for tokens

8

that do not belong to any of the entity types. The output of F is a label sequence y1...yT , where yi

is the label of xi.

We consider how to execute P efficiently over evolving text. To address this problem, a simple

solution is to treat all CRF-based IE blackboxes as general IE blackboxes and then apply Delex

to P . However, we found that this solution does not work well when the text corpus changes

frequently. The main reason is that the properties we use to guarantee safe reuse for general IE

blackboxes are not very effective for CRF-based IE blackboxes. Although they can guarantee the

correctness, they are also very strict in that they lead to very limited reuse opportunities.

This suggests that we should exploit properties that are specific to CRFs. We develop the

solution CRFlex, which captures this intuition. We now discuss the challenges in designing this

solution.

The first challenge is what properties of CRF models we can exploit for reuse. Compared to

the properties of general IE blackboxes, identifying properties of CRFs that could be exploited for

reuse is fundamentally much harder. The main reason is that CRFs operate based on the depen-

dency in the labels of adjacent tokens. To this end, we show that under certain conditions, a token’s

label does not depend on the labels of its adjacent tokens. This allows us to break a token sequence

into several independent pieces and recycle the IE results of each piece independently.

The second challenge is what results to capture for each CRF-based IE blackbox and how to

capture these results while executing P . The CRF properties we identify define small windows

surrounding each token such that the label of the token output by the CRF-based IE blackbox only

depends on tokens in those windows. However, the length of these windows may vary from one

token to another. Therefore, we must identify and capture these windows so that we can exploit

them for safe reuse in the subsequent snapshots. We show how to exploit the intermediate results

of CRF-based blackboxes to identify these windows efficiently. Our theoretical and experimental

results both show that the overhead of capturing is insignificant.

Finally, how can we efficiently reuse the captured results? Similar to Cyclex and Delex, CR-

Flex first finds overlapping regions and then exploits the CRF properties to identify copy regions,

9

which are overlapping regions where we can safely copy results. As we will show later (Sec-

tion 4.2), in order to properly exploit the CRF properties, CRFlex must interleave re-applying the

CRF-based IE blackbox with exploiting the CRF properties to identify the copy regions. The chal-

lenge is that these two steps are dependent upon each other. Without re-applying the CRF-based IE

blackbox, we cannot exploit the CRF properties, and thus cannot identify the copy regions. At the

same time, without identifying the copy regions, we also do not know which regions are non-copy

regions to which we should re-apply the CRF-based IE blackbox. We develop an approach that

explores this dependency constraint to interleave the two steps carefully.

Our experiments over real-world datasets and a CRF-based IE program show that CRFlex cuts

the runtime of Delex by as much as 52%.

1.5 Contributions

In summary, we have made the following contributions:

• The most important contribution of this dissertation is a framework that provides efficient

solutions for IE over evolving text. In particular, the framework advocates the idea of recy-

cling the IE results over previous corpus snapshots. As far as we know, this dissertation is

the first in-depth solution to the problem of IE over evolving text.

• We show how to model common properties of general IE blackboxes and CRF-based IE

blackboxes, and how to exploit these properties for safely reusing previous IE results.

• We show that a natural tradeoff exists in finding overlapping text regions from which we can

recycle past IE results. An approach to finding overlapping regions is called a matcher. We

show that an entire spectrum of matchers exists, with matchers trading off the completeness

of the results for runtime efficiency. Since no matcher is always optimal, our solutions

provide a set of alternative matchers (more can be added easily), and employ a cost model to

make an informed decision in selecting a good matcher.

10

• Our approaches can deal with large text corpora by exploiting many database techniques,

such as cost-based optimization and hash joins.

• Our approaches can deal with complex IE programs that consist of multiple IE blackboxes

by exploiting the compositional nature of these IE programs. We show how to model these

complex IE programs for recycling, how to implement the recycling process efficiently, and

how to find a good execution plan in a vast plan space with different recycling alternatives.

• We have developed a powerful suffix-tree-based matcher that detects all overlapping regions

between two documents. This matcher can be exploited by many other applications that need

to compare two documents.

1.6 Outline

Chapters 2-4 describe Cyclex, Delex, and CRFlex, respectively. They elaborate on the ideas

outlined in Section 1.4. Chapter 5 reviews existing solutions and discusses how they relate to ours.

Finally, Chapter 6 concludes the dissertation and discusses directions for future research.

Parts of this dissertation have been published in conferences. In particular, Cyclex is described

in an ICDE-08 paper [16], and Delex is described in a SIGMOD-09 paper [17].

11

Chapter 2

Recycling for Single-IE-Blackbox Programs

We begin our study by developing an efficient recycling solution for single-IE-blackbox pro-

grams. IE blackboxes are fundamental building blocks of IE programs. We will consider how to

recycle for complex IE programs that consist of multiple IE blackboxes in Chapter 3, and how to

recycle for IE blackboxes that are based on specific statistical learning models in Chapter 4.

This chapter is organized as follows. We first formally define our problem in Section 2.1. Then

we provide an overview of our solution, Cyclex, in Section 2.2. Sections 2.3–2.5 describe our

solution. Section 2.6 presents an empirical evaluation. Finally, Section 2.7 concludes this chapter.

2.1 Problem Definition

Data Sources, Pages, & Corpus Snapshots: Let S = {S1, . . . , Sn} be a set of data sources

considered by an application A. We assume that A crawls these sources at regular intervals to

retrieve sets of data pages. For example, DBLife considers 10,000+ data sources, each specified

with a URL, and crawls these URLs (each to a pre-specified depth) each day to retrieve a set of

14,000+ Web pages. We will refer to Pi — the set of data pages retrieved at time i — as the i-th

snapshot of the evolving text corpus S.

Entities, Attributes, & Mentions: Data pages often mention entities, which are real-world con-

cepts, such as person, paper, and meeting. We represent each entity type e with a set of attributes

a1, . . . , ak, which can be atomic (e.g., meeting room) or set-valued (e.g., topics).

Given a data page p, we refer to a consecutive sequence of characters in p as a string, or a

text fragment, or a region (we will use these notions interchangeably). We use p[i..j] to denote the

12

string s that starts with the i-th character and ends with the j-th characters of p. In this case, we

will also say s.start = i and s.end = j.

A mention of an atomic (set-valued) attribute a is then a string in p (a set of strings in p) that

refers to a. We can now define an entity mention as follows:

Definition 2.1 (Entity mention). Let p be a data page, and a1, . . . , ak be the attributes of an entity

type e. Then a mention of an instance of entity type e is a tuple m = (m1, . . . , mk), where

each mi, i ∈ [1, k], is either a mention of ai in page p, or the special value “nil,” indicating

that a mention of ai cannot be extracted from p. We also define m.start = minki=1 mi.start and

m.end = maxki=1 mi.end.

Example 2.1. Suppose the entity type “meeting” has three attributes: room, time, and topics. Then

tuple (CS 310, 4pm, {CIM,IR}) is a mention of “meeting” in page q of Figure 1.1. String s = “CS

310” (where s.start = 25 and s.end = 30) is a mention of attribute “room.” “4pm” is a mention

of “time,” and the set of strings {“CIM,” “IR”} is a mention of “topics.”

Extractors: Real-world IE applications extract mentions of one or multiple entity types from data

pages. As a first step, in this chapter we consider extracting mentions of a single entity type e (e.g.,

meeting). To extract such mentions, current applications usually employ an extractor E, which is

typically a learning-based program, or a set of extraction rules encoded in, say, a Perl script [33].

We assume that E extracts mentions from each data page in isolation, e.g., extracting meetings as

in Figure 1.1. Such per-page extractors are pervasive (e.g., constituting 94% of extractors in the

current DBLife, see [33, 67] for many examples). Hence, we start with such extractors, leaving

more complex extractors (e.g., those that extract mentions that span multiple pages) for future

work. We can now define extractors considered in this chapter as follows:

Definition 2.2 (Extractors). Let a1, . . . , ak be the attributes of an entity type e. Then an extractor

E : p → M takes as input a data page p and produces as output a set M of mentions of e in page

p, where each mention is of the form (m1, . . . , mk) as described in Definition 2.1.

Modeling Properties of Extractors: Recall from the introduction that we must model certain

properties of extractors, so that we can reuse mentions and prove the correctness of our algorithm.

13

We now describe two such properties: scope and context. To motivate scope, we observe that

attribute mentions of an entity often appear in close proximity in text pages. Consequently, an

extractor often starts by extracting attribute mentions, then combines the mentions and prunes

those combinations that span more than a maximal length α.

Example 2.2. Suppose we apply E to page q in Figure 1.1 to extract (room,time). E may start

by extracting all room mentions: “CS 310,” “CS 105,” then all time mentions: “4pm,” “2pm.” E

then pairs room and time mentions, and prunes pairs that are not found within, say, a length of

100 characters. Thus, E returns only the pairs (CS 310,4pm) and (CS 105,2pm).

Thus, we can formalize the notion of scope as follows:

Definition 2.3 (Extractor scope). An extractor E has scope α iff for any mention m produced by

E we have (m.end−m.start) < α.

To motivate context, we observe that when extracting mentions, many extractors examine only

small “context windows” to both sides of a mention, as the following example illustrates:

Example 2.3. Let E be an extractor for (room,time,topics). Suppose E produces string X as a

topic if (a) X matches a pre-defined word (e.g., “IR”), and (b) the word “discuss” or “topic”

occurs within a 30-character distance, either to the left or to the right of X . Then we say that the

context of topic mentions is 30 characters. That is, once E has extracted X as a topic, then no

matter how we perturb the text outside a 30-character window of X (on both sides), E would still

recognize X as a valid topic mention.

Let m be a mention produced by an extractor E in page p. Then we formalize the notion of

context as follows:

Definition 2.4 (β-context of mention & extractor context). The β-context of m (or context for

short when there is no ambiguity) is the string p[(m.start − β)..(m.end + β)], i.e., the string of

m being extended on both sides by β characters. Extractor E has a context β iff for any m and p′

obtained by perturbing the text of p outside the β-context of m, applying E to p′ still produces m

as a mention.

14

We assume that each extractor E comes with a scope α and a context β. These values can be

supplied by whoever implementing E or knowing how E works (e.g., the application builder, after

examining E’s description or code). As we show in the experiments, α and β do not have to be

“tight” in order for us to benefit from recycling IE results. However, the “tighter” (i.e., smaller)

these values are, the larger the benefits.

The Generality of Our IE Model: So far we have defined extractor scope and context at the

character level (see Definitions 2.1-2.1), and in this chapter, for ease of exposition, we will limit

our discussion to only the character level. However, Cyclex can be easily generalized to work with

scope/context at higher-granularity levels (e.g., word, sentence, paragraph), should that be more

appropriate for the target extractors.

Problem Definition: We can now describe our problem as follows. Let P1, . . . , Pn be consecutive

snapshots of a text corpus, E be an extractor with scope α and context β, and M1, . . . ,Mn be the

set of mentions extracted by E from P1, . . . , Pn, respectively. Let Pn+1 be the corpus snapshot

immediately following Pn. Then develop a solution to extract the set of mentions Mn+1 from Pn+1

in a minimal amount of time, by utilizing P1, . . . , Pn, α, β, and M1, . . . , Mn. In the rest of the

chapter we describe Cyclex, our solution to this problem.

2.2 The Cyclex Solution Approach

To describe Cyclex, we begin with two notions:

Definition 2.5 (Old region & maximally old region). A region r in a data page p of snapshot Pn+1

is an old region if it occurs in a page q of snapshot Pn. r is a maximally old region if it cannot be

extended on either side and still remains an old region.

To extract mentions from Pn+1, Cyclex then considers each page p in Pn+1 and “matches,” i.e.,

compares p with pages in Pn, to find old regions of p. Next, it uses the old regions to identify copy

regions and extraction regions of p (see Section 2.4). Cyclex then applies extractor E only to the

extraction regions, and copies over the mentions of the copy regions.

15

Pn , Pn+1

Pn-w, P

n-w+1, …, P

n

Mn-w,Mn-w+1

, … , Mn

Cost Model

Matcher Selector

Mn+1

Matchers

Page Matcher

Reuser

Extraction Module

Figure 2.1 The Cyclex architecture

Since pages retrieved (in consecutive snapshots) from the same URL often share much over-

lapping data, to find old regions of p, Cyclex currently matches p only with q, the page in Pn that

shares the same URL with p. (If q does not exist, then Cyclex declares that p has no old regions.)

Section 2.6 shows that the choice of matching pages with the same URL already significantly re-

duces IE time. Considering more complex choices (e.g., matching p with all pages in Pn) is an

ongoing research.

We call algorithms that match p and q to find old regions in p page matchers. Sections 2.3

shows that such matchers span an entire spectrum, trading off result completeness for runtime,

and that no matcher is always optimal. For example, the ST matcher described below returns all

maximally old regions, thus providing the most opportunities for recycling past IE results. But it

may also incur more runtime than matchers that return only some old regions. So, a priori we do

not know if it would be better than these other matchers.

The above result leads to the Cyclex architecture in Figure 2.1. Given snapshot Pn+1, the

matcher selector employs a cost model (that utilizes statistics computed over the past w snapshots)

to select a page matcher from a library of matchers. The page matcher then finds old regions of

pages in Pn+1. Next, the extraction module applies extractor E to extraction regions of pages in

Pn+1, and the reuser copies over mentions of the copy regions. Cyclex then combines the results

of both the extraction module and the reuser to produce the final IE result for Pn+1. The next three

sections describe the matchers (Section 2.3), the reuser and extraction module (Section 2.4, and

the matcher selector (Section 2.5) in detail.

16

2.3 The Page Matchers

Recall from Section 2.2 that a page matcher compares pages p and q to find old regions of p.

We have provided the current Cyclex with three page matchers: DN, UD, and ST (more matchers

can be easily plugged in as they become available). DN incurs zero runtime, as it immediately

declares that page p has no old region. Cyclex with DN thus is equivalent to applying IE from

scratch to Pn+1.

UD employs an Unix-diff-command like algorithm [58], which splits pages p and q into lines,

then employs a heuristic to find common lines. Thus, UD is relatively fast (takes time linear in

|p| + |q|), but finds only some old regions. We omit further description for space reason, but refer

the reader to [58].

ST is a novel suffix-tree based matcher that we have developed, which finds all maximal old

regions of p using time linear in |p| + |q|. ST and DN thus represent the two ends of a spectrum

of matchers that trade off the result completeness for runtime efficiency, while UD represents an

intermediate point on this spectrum.

In the rest of this section we describe ST in detail. Roughly speaking, ST inserts all suffixes

of q and p into one suffix tree T [40]. As we insert each suffix of p, T helps us identify the longest

prefix of this suffix that also appears in q. To realize this intuition, however, we must handle a

number of intricacies, so that we can locate all maximal old regions without slowing down ST to

quadratic time.

2.3.1 Suffix Tree Basics

The suffix tree for a string q is a tree T with |q| leaves, each describing a suffix of q. T must

satisfy the following: (1) Each non-root internal node has at least two children. (2) Each edge is

labeled with a nonempty substring of q, and no two edges out of a node can have labels beginning

with the same character. (3) The path label of a node is the concatenation of all edge labels on the

path from the root to this node; each suffix of q corresponds to the path label of a leaf. (4) Each

non-root internal node with path label λu (where λ is a single character and u is a string) has a

17

suffix link to the node with path label u; the root has a suffix link to itself. Figure 2.2.a shows the

suffix tree for “ababbabaab$,” where symbol $ terminates the string. Suffix links are showed as

dotted lines.

To construct a suffix tree for q, we insert all suffixes of q one by one into an initially empty

tree. For example, the suffixes of “ababbabaab$” are “ababbabaab$,” “babbabaab$,” “abbabaab$,”

. . ., “b$.” Let si denote q[i..|q|], the i-th suffix of q. Conceptually, to insert si, we first look up si,

matching si against edge labels as we go down the tree until no more characters can be matched.

If lookup stops at a node, we insert si as a leaf below that node; if lookup stops in the middle of

an edge, we add a new node to split the edge right before the point where it diverges from si, and

then insert si as a leaf of the new node.

Unfortunately, if we insert every si by starting the lookup from the root, we would end up with

a quadratic-time algorithm. The secret to more efficient suffix-tree construction is to exploit the

suffix links, which allow us to leverage the matching work we have already done when inserting

si−1. We now sketch the construction algorithm below.

Suppose we have just inserted si−1 as a leaf child of node αi−1; note that αi−1 is the only

possibly new internal node created during the insertion of si−1. Next, we want to insert si into the

suffix tree, and ensure that αi−1’s suffix link is properly set up. To this end, we follow a series of

up, across, and down moves in the suffix tree. Suppose αi−1’s path label is λu, where λ is a single

character; note that u is a prefix of si. First, we go up from αi−1 to its parent θ, whose path label

is λu′, where u′ is a prefix of u. Then, following the suffix link of θ, we go across to θ′, whose

path label is u′. Next, starting from θ′, we go down the tree, matching u − u′, the substring of u

that follows u′. We end up with node β with path label u, to which we set the suffix link of αi−1.

If β does not currently exist in the tree, we create β by splitting the edge right where the matching

of u − u′ stops; we then add si (which, as we recall, begins with u) as a child of β. On the other

hand, if β already exists in the tree, we continue to go down the tree from β, matching si − u, the

substring of si that follows u, and insert si at the point where matching stops; this process may

create a new internal node. It can be shown that this construction algorithm is linear in the size of

the string [40].

18

$baababb

a

b

a

$ba

$

$baabab

ab$

$baa

babaab$

ab$

babaab$

$

b

a

8

9

3 16 5

2

74

10

b

a

b

a

$baabab

ab$

$baa

babaab$

babaab$

b

a

3 16 5

2

4

b

$baababb

a

$baabab

ab$

$baa

babaab$

ab$babaab$

b

a

3 16 5

2

74

b

a

b

$baababb

(a) (b) (c)

Figure 2.2 An example of inserting a suffix

Figure 2.2.b shows the suffix tree before inserting s7 of “ababbabaab$.” The only new internal

node in the tree now is α6 (the dark node). The path label for the dark node is “aba” and u is “ba.”

First, we go up from the dark node to its parent θ. Then we follow the suffix link of θ and go across

to θ′ (the dotted node). Notice that we skip looking up the first “b” in s7 by following the suffix

link. Next, from the dotted node, we go down the tree, matching the substring of u that follows

“b.” The matching stops in the middle of the edge with label “ab” out from the dotted node, which

leads to splitting the edge and creating a new node β. In Figure 2.2.c, β is the dark node. We then

insert the leaf corresponding to s7 as the child of β. Finally, we set up the suffix link from α6 to β.

2.3.2 ST: The Suffix-Tree Matcher

ST starts by building a suffix tree T for q, the old page, as described in Section 2.3.1. Next,

it inserts the suffixes of p, the new page, one by one, into T , and reports each maximal old region

as soon as it is detected. To carry out this second step, we make important extensions to both the

insertion procedure and the suffix tree structure. First, we augment suffix-tree nodes with prefix

links, which are crucial to finding old regions efficiently. We also show how to set up these links

during construction. Second, we show how to detect maximal old regions without introducing

additional performance overhead. We describe these two extensions next.

Finding Old Regions Using Prefix Links: By inserting s′i, the i-th suffix of p, into T , we can

easily find the longest common prefix between s′i and any suffixes that have been already inserted.

19

(a)(c)

$c

$aaaaba

baabaaabaaaa$

1

a

baaaa$

4

$a

8

a

a

baaa

5c$

$ca c$

1 2

c$

1

baabaaabaaaa$

2

abaaabaaaa$

$c

a

(b)

Figure 2.3 An example of prefix links

Let hi denote this string, which corresponds to node α′i, the parent of the leaf corresponding to s′i.

On the other hand, what we are looking for, ri, is the longest common prefix between s′i and any

suffix of q, the old page. Unfortunately, ri may not be the same as hi, because the suffix tree at this

time additionally contains suffixes s′1, . . . , s′i−1 of p, inserted earlier than s′i.

However, it is not difficult to see that ri must be a prefix of hi, because hi by definition cannot

be shorter than any common prefix between s′i and suffixes of q. To find ri, we need to locate

the last node on the path from the root to α′i with at least one descendant leaf corresponding to

a suffix of q. Efficiently finding this node, which we denote by δi, turns out to be quite tricky.

One might think that we should encounter δi as we go down T when inserting s′i. However, recall

from Section 2.3.1 that we use suffix links to avoid quadratic-time construction; thus, we reach α′i

without starting from the root, and possibly without passing through δi.

To ensure the efficiency of locating δi, we add a prefix link for each node of T . The prefix link of

node γ, denoted Lp(γ), points to its lowest ancestor with at least one descendant leaf corresponding

to a suffix of q. If γ itself has at least one descendant leaf corresponding to a suffix of q, we do not

explicitly store a prefix link, but we implicitly understand that Lp(γ) points to γ itself.

We construct prefix links as follows. Suppose we have created the suffix tree T for q. Then

there are no explicit prefix links yet (i.e., every node’s prefix link implicitly points to itself) because

every node leads to a suffix of q. Now, for every new leaf γ we create (for a suffix of p), we let Lp(γ)

point to the same node as γ’s parent’s prefix link. For an internal node γ created by splitting an

edge pointing to node γ′, if Lp(γ′) points to γ′ itself, we let Lp(γ) point to γ itself; otherwise, we set

Lp(γ) = Lp(γ′). For example, Figure 2.3.a shows the suffix tree for q = “ac$.” Figure 2.3.b shows

20

the prefix links (in solid arrows) after we insert the first two suffixes of p = “baabaaabaaaa$.” The

black leaves are corresponding to the suffixes of q. For those nodes which have a prefix link to

itself, we do not show the links.

With prefix links, we now show how to find the longest common prefix between a suffix s′i of

p and any suffix of q, while inserting s′i into the suffix tree. After a leaf has been created for s′i, we

check the node δi pointed to by the prefix link of the leaf’s parent. The path label of δi gives us

the largest possible old region matching a prefix of s′i. For example, Figure 2.3.c shows the state

of the suffix tree before we inserting s′9, the ninth suffix of p, “aaaa$.” We omit the irrelevant part

of the tree (in triangle) and links from the figure. Following the standard suffix-tree construction

algorithm, we first use the suffix link (in dotted arrow) of the parent node of α8 to go across to θ′.

Then we go down the tree and match the substring of u = “aaa” that follows “aa.” The matching

stops in the middle of the edge with label “abaaaa$,” which leads to splitting the edge and creating

a new internal node α′9 with path label “aaa.” The leaf for s′9 is then inserted below α′9. The prefix

links of α′9 and the leaf point to the same node pointed to by the prefix link (in solid arrow) of leaf

5. We then use the prefix link of α′9 to find “a,” the longest common prefix between s′9 and any

suffix of q.

Detecting Maximally Old Regions: So far, we have seen how to find, for each suffix of p, the

longest common prefix between it and all suffixes of q. However, these prefix matches are not

necessarily maximally old regions (cf. Definition 2.5). Although such matches cannot be extended

any further to the right, it may be possible to extend them to the left. How do we then find the

globally maximally old regions?

We make two observations. First, any maximally old region must be the longest common prefix

between some suffix of p and suffixes of q. The second observation is captured by the following

lemma:

Lemma 2.1. Let p[i − 1..j] be the longest common prefix between s′i−1, the (i − 1)-th suffix of p,

and any suffix of q. Let p[i..k] be the longest common prefix between s′i and any suffix of q. Then,

p[i..k] is a maximally old region if and only if k > j.

21

Proof. If k > j, p[i− 1..k] cannot be a substring of q, as p[i− 1..j] is already the longest common

prefix between s′i−1 and any suffix of q. Hence p[i..k] cannot be extended further to the left.

Furthermore, p[i..k] cannot be extended further to the right either because it is already the longest

common prefix between s′i and any suffix of q. Therefore, p[i..k] is a maximally old region.

If p[i..k] is a maximally old region, p[i − 1..k] cannot be a substring of q, which implies that

j < k.

The above observations lead to a simple, efficient method for identifying all maximally old

regions in a streaming fashion while we process suffixes of p one by one. After processing the i-th

suffix of p and finding the longest common prefix ri between it and q’s suffixes, we compare the

end position of ri with that of ri−1 (identified while processing the (i− 1)-th suffix of p). As long

as the end position has advanced, we output ri as a maximally old region.

The complete psudocode for ST is listed in Algorithm 2.1.

Runtime Complexity: We conclude this section by stating the complexity of our suffix-tree

matching algorithm in the following theorem. The dominating cost, in terms of both time and

space, comes from standard suffix tree construction. Our implementation uses balanced search

trees to manage parent-child relationships in the suffix tree, which implies that an additional time

cost factor c = O(log A), where A is the size of the alphabet. Other alternatives with c = O(1) also

exist, but we have found our implementation to work well when A is very large. This is probably

because suffix trees with balanced search trees to manage parent-child relationships take smaller

space and thus lead to fewer cache misses.

Theorem 2.1. ST takes O((|p|+ |q|)c) time and O(|p|+ |q|) space, where c is the cost of looking

up a child of a node in the suffix tree.

Proof. First, we prove that ST takes O((|p| + |q|)c) time. ST proceeds in two phases. In the first

phase, it builds a suffix tree T (line 3) for q using O(|q|c) time [40]. In the second phase, ST finds

the maximally old regions while it is inserting each suffix of p into T (line 4-26). Except the step of

locating α′i (line 9), each of the other steps takes O(1) time. Therefore, line 2-8 and line 10-26 take

O(|p|) time. [40] shows that the total time of locating all α′i (line 9) is dominated by the total time

22

Algorithm 2.1 ST1: Input: old data page q, new data page p

2: Output: all maximal old regions R in p

3: T ⇐ buildSuffixTree(q)

4: //initialization

5: R ⇐ ∅6: α′0 ⇐ T.root

7: for each suffix s′i of p do

8: //locate the node corresponding to the longest common prefix of s′i and any suffixes in T and set up the suffix link of α′i−1

9: α′i ⇐ longestCommonPrefix(s′i,T ,α′i−1)

10: if α′i is a new node created by splitting an edge pointed to γ then

11: //set up the prefix link of α′i12: if Lp(γ) = γ then

13: Lp(α′i) ⇐ α′i14: else

15: Lp(α′i) ⇐ Lp(γ)

16: end if

17: end if

18: Insert leaf η′i as a child of α′i19: Lp(η′i) ⇐ Lp(α′i)

20: //find ri, the longest common prefix of s′i and any suffix of q, using prefix link of α′i21: ri ⇐ p[i..i + pathLength(T.root, Lp(α′i))− 1]

22: //compare the ending positions of ri and ri−1 to check if ri is a maximal old region

23: if ri.end > ri−1.end or i = 1 then

24: R ⇐ R⋃{ri}

25: end if

26: end for

of locating the children of all nodes visited in T . The total number of nodes visited is O(|p|+ |q|)and the cost of locating the children of each node is c. Therefore, line 9 takes O((|p| + |q|)c)time. Hence, the total time of the second phase is O((|p|+ |q|)c) and the overall runtime of ST is

O((|p|+ |q|)c).Now, we prove that ST takes O(|p| + |q|) space. The space taken by ST is used to store the

suffix tree and the ending position of the longest common prefix between the most recently inserted

suffix of p and all suffixes of q. The latter only needs O(1) space. A standard suffix tree for a string

of length l has at most 2l number of nodes and takes O(l) space [40]. A suffix tree augmented with

prefix links has one prefix link per node. Therefore, the augmented tree still takes O(l) space. ST

23

builds a suffix tree T with prefix links to store all suffixes of p and q. Therefore, T has at most

2(|p|+ |q|) number of nodes and takes O(|p|+ |q|) space. Hence, the overall space taken by ST is

O(|p|+ |q|).

2.4 The Reuser + Extraction Module

Suppose Cyclex has selected a page matcher M (see Section 2.2). We now describe how M

works in conjunction with the reuser and the extraction module to recycle mentions and extract

new ones. We face two key challenges. First, since corpus snapshots often are large, we must

handle disk-resident data efficiently. Second, we must employ scope α and context β to identify

precise text regions from which it is “safe” to copy mentions or to apply extractor E. To address

these challenges, we proceed in the following three steps.

1. Find Copy Regions: We begin by reading pages from disk-resident Pn+1 in a sequential

manner. For each page p, we find q ∈ Pn which shares the same URL with p. (If no such q exists,

we simply apply extractor E to p.) Next, we apply M to p and q (in memory) to find old regions

(see Section 2.3).

Not all mentions in old regions (if we find any) are safe to be copied. This is illustrated by the

following example.

Example 2.4. Let q = “Dr. John Doe is a CS prof.”. Suppose extractor E declares string n to

be a person name if it is two capitalized words preceded by “Dr. ”. Then E has context β = 3,

and produces “John Doe” as a mention of q. Now consider p = “John Doe is a CS professor”.

Suppose M declares o = “John Doe is a CS prof” to be an old region of p. Then since “John Doe”

is a mention (of q) in o, we may think that it will also be a mention of p. However, this is incorrect

because applying E to p would produce no mention.

In general, we can copy a mention only if both the mention (e.g., “John Doe”) and its context

(e.g., “Dr.”) are contained in an old region. Specifically, if p[c..c + k] is an old region because it

matches q[c′..c′ + k], then we copy a mention m only if it is contained in the region q[c′ + β..c′ +

k − β]. We refer to such regions, from which it is safe to copy mentions, as copy regions. We now

24

describe finding copy regions, distinguishing two cases: disjoint old regions, and overlapping old

regions.

• Old regions are disjoint: Let r1, . . . , rk be old regions of p (discovered by matcher M ). We

represent each ri as a tuple (idp, idq, sp, sq, l), where idp and idq are IDs of p and q, sp and sq are

the start positions of the old region in p and q, respectively, and l is the length of the old region.

Suppose old regions represented by r1, . . . , rk are disjoint. Then we simply construct for each

ri a copy region hi which is a tuple (idp, idq, s′p, s

′q, l

′), where s′p = sp + β, s′q = sq + β, and

l′ = l − 2β. Next, we insert hi into a memory-resident table H .

• Old regions are overlapping: In this case we extend the above algorithm so that we copy each

mention in the overlapping regions only once. First, we construct a set of copy region candidates

by chopping β characters at both ends of each old region, as we described in the disjoint case. Let

the resulting set of regions be r′1, . . . , r′k. This step gives us a set of regions where we are sure that

if a mention is contained in one of those regions, it will be extracted by E from p, and thus it can

be safely copied. However, since regions r′1...r′k can overlap, a mention can be contained in more

than one region and copied more than once. The following two steps ensure that any mentions

contained in at least one of r′1...r′k will be copied exactly once.

Let a and b be two overlapping regions from r′1, . . . , r′k. Then a corresponds to a copy region

candidate p[i..j] and b corresponds to another copy region candidate p[k..l] such that i < k <

j < l. Then we discard a and b and generate instead the following regions: (1) regions c, d, e that

corresponds to p[i..k − 1], p[k..j], p[j + 1..l], respectively. These regions are created so that we

can avoid copying mentions in region d twice. (2) regions f, g that corresponds to p[k − α..k +

α], p[j−α..j +α], respectively. These regions are created to catch any mention that may cross the

splitting points k and j and thus is not contained in any of the above regions.

We insert the tuples corresponding to these regions into table H . Figure 2.4 shows the data

flows of Cyclex for the step of finding copying regions in phase I.

2. Find Extraction Regions & Apply Extractor E: Let c1, . . . , ct be the copy regions of p,

identified as in Step 1. We now find extraction regions, those regions of p on which we must apply

extractor E, to ensure the correctness of Cyclex.

25

Page Matcher M

Old Regions

Extraction

Regions

Copy

RegionsExtractor E

Reuser

Phase I

Phase II

q

pPn

Pn+1

Mn+1

Mn

N

H

Figure 2.4 Data flow of Cyclex

To obtain extraction regions, at a first glance, it appears that we can simply remove copy regions

from p. However it is not difficult to construct examples where this would “remove too much,” thus

dropping mentions that we should have found for p. In general, we can prove that if p[c..c + k] is

an old region, then it is safe to remove only region p[c + γ..c + k − γ], where γ = 2β + α − 1.

We now describe finding extraction regions for two cases: disjoint old regions, and overlapping

old regions.

• Old regions are disjoint: Let R be the set of disjoint old regions of p. We begin by initializing

c, the start position of the next extraction region, to 1. Then we scan regions of R sequentially,

in increasing value of their start positions. For each r ∈ R, we create p[c..(r.sp − 1 + γ)] as an

extraction region. Then we update c = r.sp + r.l − γ. The last extraction region ends at position

|p|.• Old regions are overlapping: In this case, the extraction regions identified by the above algo-

rithm might not be minimal in the sense that if we remove some parts of the extraction regions,

we can still guarantee correctness of Cyclex. Hence,we waste the time of applying E over the

additional regions.

To ensure that an identified extraction region is not contained in any old region, we extend

the algorithm for disjoint old regions case as follows. First, we repeatly concatenate any two

overlapping old regions p[i..j] and p[k..l] if the length of the overlapping part is larger than γ.

Without loss of generality, suppose i < k < j < l. Since j− k ≥ γ + 1, the maximal length of the

26

β-context of any mention extracted by E, the β-context of any mention across the two old regions

p[i..j] and p[k..l] is either contained in p[i..j] or p[k..l], and thus the mention will be copied. Hence,

we can ignore the adjacent boundaries of p[i..j] and p[k..l] when identifying extraction regions. We

refer to the concatenated regions as super old regions. Let the set of super old regions be R′. Any

mention such that both itself and its context is contained in a region r′ ∈ R′ will be copied.

Next, we create a set of extraction regions to catch any mention that will not be copied. For

each r′ corresponding to p[i..j] in R′, we create a removal region p[i+γ..j−γ]. Since the length of

the overlapping part of any two regions in R′ is at most γ, the removal regions created at this step

are disjoint. Let the set of removal regions be D. Finally, we remove D form p and the remaining

set of regions are the extraction regions.

Once we have identified all extraction regions of a page p, we apply extractor E to these

regions. To guarantee correctness of Cyclex, among all extracted mentions, we only retain those

such that both the mentions and their contexts are contained in an extraction region. We then insert

the retained mentions into a memory-resident table N . N is flushed to the disk-resident table Mn+1

(which stores all mentions extracted from Pn+1) whenever it is full. Figure 2.4 shows the data flow

of Cyclex for the step of finding extraction regions and applying extractor E in phase I.

3. Copy Mentions from Copy Regions: We repeat step 1 and step 2 until we have processed all

pages p in Pn+1. At this point, we have extracted mentions from all extraction regions. We have

also stored all copy regions (actually, only the start- and end-positions of these regions, not the

regions themselves) in table H . Now we must copy to N any mention that (a) exists in Mn (the IE

result over the previous snapshot Pn) and (b) can be found in a region stored in H .

Since Mn can be large, we assume it is on disk. Furthermore, since each application may want

to store the mentions in a particular order (for further processing, e.g., mention disambiguation),

we do not assume any particular order for mentions in Mn. Rather, we proceed as follows. We

perform a sequential scan of Mn. For each mention m of Mn, we immediately probe m against

regions of table H (implemented as a hash table, with key idq, sq and l). In case of a hit, m appears

in one of the copy regions, thus, we construct an appropriate mention m′ of p (that correspond to

27

m), then insert m′ into table N . Figure 2.4 shows the data flow of Cyclex for the step of copying

mentions in phase II.

The following theorem states the correctness of Cyclex:

Theorem 2.2 (Correctness of Cyclex). Let Mn+1 be the set of mentions obtained by applying

extractor E from scratch to snapshot Pn+1. Then Cyclex is correct in that when applied to Pn+1 it

produces exactly Mn+1.

Proof. Let M ′n+1 be the set of mentions produced by applying Cyclex to Pn+1. Let Mn be the set

of mentions produced by applying E to Pn.

We first prove that M ′n+1 ⊆ Mn+1. Let m be a mention in M ′

n+1 and p be the page that contains

m. Cyclex produces m in one of the following ways:

Case 1: If Cyclex produces m by copying mentions from a copy region r, there must exist a

mention m′ in Mn and a region r′ in a data page q ∈ Pn such that m = m′, r contains the β-context

of m, r′ contains the β-context of m′, and r matches r′. Therefore the β-context of m matches the

β-context of m′. This implies the β-context of m′ is contained in r, and thus in p. Hence, p can be

obtained by perturbing the text of q outside the β-context of m′. From the definition of β-context,

it follows that applying E to p from scratch also produces m′. Since m′ = m, producing m′ is

equivalent to producing m. Hence, applying E to p from scratch produces m.

Case 2: If Cyclex produces m by applying E to an extraction region r in page p, r must contain

the β-context of m. Since p can be generated by perturbing the text of r outside the β-context of

m, from the definition of β-context, it follows that applying E to p from scratch also produces m.

Case 3: If Cyclex produces m by applying E to the entire data page p (i.e., there does not exist

q ∈ Pn such that q shares the same URL with p), then obviously applying E to p produces m.

In summary, no matter how Cyclex produces m, applying E to p ∈ Pn+1 from scratch also

produces m. Therefore M ′n+1 ⊆ Mn+1.

Similarly, we can prove Mn+1 ⊆ M ′n+1. Given that M ′

n+1 ⊆ Mn+1 and Mn+1 ⊆ M ′n+1, it

follows that M ′n+1 = Mn+1.

28

2.5 The Cost-Based Matcher Selector

We now describe how the matcher selector employs a cost model to select the best matcher

(one that minimizes Cyclex’s runtime).

Our cost model captures the three execution steps of Section 2.4. We model the elapsed time of

each step as a weighted sum of I/O and CPU costs. The weights are measured empirically, allowing

us to account for varying execution characteristics across steps, implementations, and platforms.

With the weights, we can reasonably capture completion times of highly tuned implementations

that overlap I/O with CPU computation (in this case, the dominated cost component will be com-

pletely masked and therefore have weight 0) as well as simple implementations that do not exploit

parallelism.

Let m be the number of pages in Pn+1, mb be the total size of Pn+1 on disk (in blocks), and

l be the average page size (in bytes). Let n be the number of mentions in the previous mention

table Mn, and nb be the total size of Mn on disk (in blocks). Let b be the number of buckets in

the in-memory hash table H (cf. Section 2.4). We model the completion time of a Cyclex plan on

Pn+1 as:

w1,IO ·mb · f + w1,mat ·m · l · f + w1,ex ·m · l · f · g (2.1)

+w2,IO · nb + w2,find · n · m · f · hb

(2.2)

+w3,IO ·mb(1− f) + w3,ex ·m · l · (1− f), (2.3)

where f is the fraction of pages in Pn+1 with a match in Pn; g measures, on average, what fraction

of the text within a matched page still needs re-extraction; and h is the average number of tuples

inserted into hash table H per matched page. The w’s are weights, whose numeric subscripts

reflect which phases incur the associated costs.

Line (2.1) models the completion time of the first execution step. This includes I/O cost of

reading in matching pages from Pn+1 and Pn, CPU cost of matching the pairs of pages to identify

copy regions, and the CPU cost of applying E to extraction regions. Line (2.2) models the second

29

35M180MAvg Size per Snapshot

303810155Avg # Page per Snapshot

2030# Snapshots

21 days1 dayTime Interval

925980# Data Sources

WikipediaDBLifeData Sets

10400talk (speaker, time, location, topics)

793affiliation (researcher name, organization)

332researcher (first name, mid name, last name)

βαExtractors for DBLife

10250award (actor name, award, movie, role)

496play (actor name, movie)

335actor (first name, mid name, last name)

βαExtractors for Wikipedia

Figure 2.5 Data sets and extractors for our experiments

step. This includes I/O cost of reading in Mn, and CPU cost of probing H to determine whether to

copy each mention. The term m·f ·hb

estimates the number of hash table entries per bucket. Finally,

Line (2.3) models I/O cost of reading in unmatched pages in Pn+1, and CPU cost of applying E to

them. In all three steps, we ignore the cost of writing out mentions in Pn+1, since this cost is the

same for all matcher choices.

As a special case for DN, which simply runs E over the entire Pn+1, Lines (2.1) and (2.2) are

always 0, and f = 0 on Line (2.3). For UD and ST, f is the same. In general, however, the hatted

parameters f , g, h, and w’s need be estimated, and their values may differ across alternatives. On

the other hand, unhatted parameters do not need to be estimated, because their exact values are

directly available from either the corpus metadata (for m, mb, l, n, and nb) or the execution context

(for b).

We estimate the parameters using a small sample S of Pn as well as the past k snapshots, for

a pre-specified k. Section 2.6 demonstrates empirically that small |S| and k are sufficient for our

applications of Cyclex, meaning that parameter estimation and cost-based plan selection adds very

little overhead to the overall cost.

30

2.6 Empirical Evaluation

We now empirically evaluate the utility of Cyclex. Figure 2.5 describes two real-world data

sets and six extractors used in our experiments. DBLife consists of 30 consecutive one-day snap-

shots from DBLife system [31], and Wikipedia dataset consists of 20 consecutive snapshots obtain

from Wikipedia.com. The DBLife extractors extract mentions of academic entities and their rela-

tionships, and the three Wikipedia extractors extract mentions of entertainment entities and rela-

tionships (see the figure). Although these extractors may not be the state-of-the-art IE solutions to

extract these entities and relationships, there are real-world IE systems (e.g. DBLife) that employ

such extractors. Our goal is to evaluate the utility of Cyclex for the extractors used in those IE

systems.

We obtained extractor scopes and contexts by analyzing the extractors. For example, “talk”

extractor detects speakers, time and topics by matching a set of regular expressions. The length

of extraction context for these attribute is 0. Then “talk” detects location attribute by (a) detecting

a set of keywords such as “Location: ,” “Room: ” etc., and (b) extracting 1-2 capitalized words

immediately following the detected keyword as the location. We thus set the context β of “talk” to

be the maximal length of all keywords.

Runtime Comparison: For each of the above six extraction tasks, Figure 2.6 shows the run-

time of Cyclex vs. DNplan, STplan, and UDplan, three plans that employ matchers DN, ST, and

UD, respectively, over all consecutive snapshots (the X axis). The runtimes of DNplan are signif-

icantly higher than those of the other three plans. Hence, to clearly show the differences in the

runtimes among all plans in one figure, we only plot the curves of STplan, UDplan, and Cyclex,

and summarize the trends of the curves of DNplan. Note that for each snapshot, Cyclex employs

a cost model to pick and execute the best among the above three plans. Cyclex’s runtime includes

statistic collection, optimization, and execution times.

The results show that in all cases except “actor,” UDplan, STplan, and Cyclex drastically cut

runtime of DNplan (which always applies extraction from scratch to the current snapshot), by

50-90%. This suggests that recycling past IE efforts can be highly beneficial.

31

researcher

0

500

1000

1500

2000

2500

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

runtime (s)

affiliation

0500

100015002000250030003500

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

runtime (s)

talk

020004000600080001000012000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

runtime (s)

DNplan STplan UDplan Cyclex

play

200

400

600

800

1000

1200

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

runtime (s)

award

500

1500

2500

3500

4500

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

runtime (s)

actor

50

100

150

200

250

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

runtime (s)

Figure 2.6 Runtime of Cyclex versus the three algorithms that use different page matchers

Next, the results show that none of DNplan, STplan, and UDplan is uniformly better than

the others. For example, for “actor,” where the changes between two consecutive snapshots are

substantial and the extraction cost is fairly low, DNplan outperforms UDplan and STplan. In

contrast, for “play” and “award,” where the change of data is still substantial but extraction is very

expensive, STplan is the winner. For DBLife cases, where the consecutive snapshots change little

and matching regions detected by UD and ST are quite similar, UDplan is the winner.

32

Others

0

200

400

600

800

DNplan UDplan STplan Cyclex DNplan UDplan STplan Cyclex DNplan UDplan STplan Cyclex

runtime (s) affiliation3027

talk11198

researcher2261

0

200

400

600

800

1000

DNplan UDplan STplan Cyclex DNplan UDplan STplan Cyclex DNplan UDplan STplan Cyclex

runtime (s) play award3885actor

Match Extraction Copy Opt

Figure 2.7 Runtime decomposition of different plans

The above results underscore the importance of optimization to select the best plan for a partic-

ular extraction situation. They also show that Cyclex handles this optimization well. It successfully

picks the fastest plan in all six cases, while incurring only a modest overhead of 4-13% the runtime

of the fastest plan.

Contributions of Components: Figure 2.7 shows the decomposition of runtime of various plans

(numbers in the figure are averaged over five random snapshots per IE task). “Match” is time to

match pages, “Extraction” is time to apply IE, “Copy” is time to copy mentions, “Opt” is opti-

mization time of Cyclex, and “Others” is the remaining time (to read file indices, doing scoping,

etc.).

The results show that matching and extracting dominate runtimes, hence we should focus on

optimizing these components. The suffix-tree matcher ST clearly spends more time finding old

regions than the Unix-diff matcher UD. However, the figure shows that this effort clearly pays off

in certain cases, such as “play” and “award,” where IE is expensive and the consecutive snapshots

change substantially. Here, STplan saves significant time avoiding IE. Finally, the results show

that the overhead of Cyclex (statistic collection, etc.) remains insignificant compared to the overall

runtime.

We also found that DNplan (i.e., applying IE from scratch) incurs very little IO time in most

tasks (less than 3% of total runtimes, numbers not shown due to space reasons) Thus, it is important

to optimize CPU time, as we do in this work.

33

play

55

70

85

100

100 150 200 250 300 350 400 450 500

accuracy (%)

α

play

55

70

85

100

10 20 30 40 50 60 70 80 90

# sampled pages

accuracy (%)

play

55

70

85

100

2 3 4 5 6 7 8 9 10

# snapshots

accuracy (%)affiliation

55

70

85

100

2 3 4 5 6 7 8 9 10

# snapshots

accuracy (%)

(a)

affiliation

55

70

85

100

10 20 30 40 50 60 70 80 90

# sampled pages

accuracy (%)

(b)

affiliation

55

70

85

100

100 150 200 250 300 350 400 450 500

α

accuracy (%)

(c)

affiliation

55

70

85

100

10 20 40 60 80 100 120 140 160

β

accuracy (%)

(d)

play

55

70

85

100

10 20 40 60 80 100 120 140 160

β

accuracy (%)

Figure 2.8 Accuracy of cost models as a function of (a) number of snapshots k, (b) sample size|S|, (c) α, (d) β

Sensitivity Analysis: Finally, we examined the sensitivity of Cyclex wrt the main input param-

eters: k and |S|, the number of snapshots and size of sample used in statistic estimation, and the

scope and context values.

Figure 2.8.a plots the “accuracy” of Cyclex as a function of k, where “accuracy” is the fraction

of snapshots where Cyclex picks the correct (i.e., fastest) plan. We show results for “affiliation”

and “play” only, results for other IE tasks show similar phenomenons.

Figure 2.8.b-d plots the “accuracy” of Cyclex in a similar fashion against changes in the sample

size |S|, scope α, and context β, respectively.

The results show that Cyclex only needs a few recent snapshots (3) and a small number of

sample size (30 pages) to do well. Regarding scope and context, the results show that for “affili-

ation,” Cyclex performs well even when we increased α and β significantly, by 5 and 100 times,

respectively. For “play,” Cyclex performs well until α was increased by 4 times. As α increases,

the difference between the fastest plan, STplan, and the second fastest plan, UDplan, becomes

34

smaller and smaller, thus causing the optimizer to mistakenly select the second fastest plan on

certain snapshots.

affiliation

0

5

10

15

20

100 150 200 250 300 350 400 450 500

α

runtime ratio (%)

affiliation

0

5

10

15

20

10 20 40 60 80 100 120 140 160

β

runtime ratio (%) play

20

30

40

50

60

10 20 40 60 80 100 120 140 160

β

runtime ratio (%)

play

25

35

45

55

65

75

100 150 200 250 300 350 400 450 500

α

runtime ratio (%)

(a)

(b)

STplan UDplan

Figure 2.9 Ratio of runtimes as a function of α and β

In the final experiment, Figure 2.9 shows the runtime ratio of STplan and UDplan as a function

of α and β. The runtime ratio is the ratio of the runtime of these plans over the runtime of DNplan.

The results show that this ratio changes only slowly, as we increase α and β. This suggests that a

rough estimation of α and β does increase the runtime of the various plans, but only in a graceful

fashion.

2.7 Summary

A growing number of real-world applications must deal with IE over dynamic text corpora.

We have shown that executing such IE in a straightforward manner is very expensive, and have

developed Cyclex, an efficient solution that recycles past IE results. As far as we know, Cyclex

is the first in-depth solution in this direction. Our extensive experiments over two real-world data

sets demonstrate that Cyclex can dramatically cut the runtime of re-applying IE from scratch by

50-90%. This suggests that recycling past IE results can be highly beneficial.

35

Chapter 3

Recycling for Complex IE Programs

The Cyclex work clearly established that recycling IE results for evolving text corpora is highly

promising. The work itself however suffers from a major limitation: it considers only IE programs

that contain a single IE “blackbox.” Real-world IE programs, in contrast, often contain multiple

IE blackboxes connected in a compositional “workflow.” Since Cyclex is not aware of the compo-

sitional nature of such IE programs (effectively treating the whole program as a large blackbox),

its utility is severely limited in such settings.

To remove this limitation, in this chapter we describe Delex, a solution for effectively executing

multi-blackbox IE programs over evolving text data.

We first formally define our problem in Section 3.1. Sections 3.2–3.5 describe Delex. Sec-

tion 3.6 presents an empirical evaluation. Finally, Section 3.7 concludes this chapter.

3.1 Problem Definition

We now briefly describe xlog (see [67] for a detailed discussion), then build on it to define the

problem considered in this chapter.

Compositional, Multi-Blackbox IE Programs: As discussed in Section 4.1, Cyclex has clearly

demonstrated the potential of recycling IE. However, it handles only single-blackbox IE programs,

which severely limits its applicability. Thus, in this chapter, we build on Cyclex to develop an

efficient solution for multi-blackbox IE programs.

To do so, we must first decide how to represent such programs. Many possible representations

exist (e.g., [37, 28, 67]). As a first step, in this chapter we will use xlog [67], a recently developed

36

(a)

extractAbstract(d,abstract)

σapproxMatch(abstract, “relevance feedback”)

σimmBefore(title,abstract)

extractTitle(d,title)

docs(d)docs(d)

(b)

R1: titles(d,title) :- docs(d), extractTitle(d,title).

R2: abstracts(d,abstract) :- docs(d), extractAbstract(d,abstract).

R3: talks(d,title,abstract) :- titles(d,title), abstracts(d,abstract),

immBefore(title,abstract), approxMatch(abstract,“relevance feedback”).

Figure 3.1 (a) A multi-blackbox IE program P in xlog, and (b) an execution plan for P .

declarative IE representation. Extending our work to other IE representations is a subject for future

research.

xlog is a Datalog variant with embedded procedural predicates. Like Datalog, each xlog pro-

gram consists of multiple rules p :− q1, . . . , qn, where the p and qi are predicates. For example,

Figure 3.1.a shows an xlog program P with three rules R1, R2, and R3, which extract talk titles

and abstracts from seminar announcement pages. Currently xlog does not yet support negation or

recursion.

xlog predicates can be intensional or extensional, as in Datalog, but can also be procedural. A

procedural predicate, or p-predicate for short, q(a1, . . . , an, b1, . . . , bm) is associated with a proce-

dure g (e.g., written in Java or Perl) that takes as input a tuple (a1, . . . , an) and produces as output

tuples of the form (a1, . . . , an, b1, . . . , bm). For example, extractT itle(d, title) is a p-predicate in

P that takes a document d and returns a set of tuples (d, title), where title is a talk title appearing

in d. We define p-functions similarly. We single out a special type of p-predicate that we call IE

predicate, defined as:

Definition 3.1 (IE predicate). An IE predicate q extracts one or more output text spans from a

single input span. Formally, q is a p-predicate q(a1, . . . , an, b1, . . . , bm), where there exist i and j

such that (a) ai is either a document or a text span variable, (b) bj is a span variable, and (c) for

any output tuple (u1, . . . , un, v1, . . . , vm), ui contains vj (i.e., q extracts span vj from span ui).

37

In Figure 3.1.a, extractT itle(d, title) is an IE predicate that extracts title span from document

d. The p-predicate extractAbstract(d, abstract) is another IE predicate, whereas immBefore

(title, abstract) (a p-predicate that evaluates to true if title occurs immediately before abstract)

is not.

Thus, an xlog program cleanly encapsulates multiple IE blackboxes using IE predicates, and

then stitches them together using Datalog. To execute such a program, we must translate (and

possibly optimize) it to obtain an execution plan that mixes relational operators with blackbox

procedures. Figure 3.1.b shows a possible execution plan T for program P in Figure 3.1.a. T

extracts all titles and abstracts from d, and keeps only those (title, abstract) pairs where the title

occurs immediately before the abstract. Finally, T retains only talks whose abstracts contain the

phrase “relevance feedback” (allowing for misspelling and synonym matching).

Problem Definition: We are now in a position to define the problem considered in this chapter.

PROBLEM DEFINITION Let P1, . . . , Pn be consecutive snapshots of a text corpus, P be

an IE program written in xlog, E1, . . . , Em be the IE blackboxes (i.e., IE predicates) in P , and

(α1, β1), . . . , (αm, βm) be the estimated scopes and contexts for the blackboxes, respectively. De-

velop a solution to execute P over corpus snapshot Pn+1 with minimal cost, by reusing extraction

results over P1, . . . , Pn.

To address this problem, a simple solution is to detect identical pages, then reuse IE results on

those. This reuse-at-page-level solution however provides only limited reuse opportunities, and

does not work well when the text corpus changes frequently.

Another solution is to apply Cyclex to the whole program P , effectively treating it as a single

IE blackbox. We however found that this reuse-at-whole-program-level solution does not work

well either (see Section 3.6). The main reason is that estimating “tight” α and β for the whole IE

program P is very difficult. Whether we do so directly, by analyzing the behavior of P (which

tends to be a large and complex program), or indirectly, by using the (αi, βi) of its component

blackboxes, we often end up with large α and β, which limits reuse opportunities.

38

3E

4EZ

Y3E

4E

σ π

3

π

σ π

3

π

U V

(a) (b)

1E 2E 1E 2E

Figure 3.2 (a) An execution tree T , and (b) IE units of T .

These problems suggest that we should try to reuse at a finer granularity: regions in a page

instead of whole page, and at a finer level: program components instead of whole program. The

Delex solution captures this intuition. In the rest of the chapter we describe Delex in detail.

3.2 Capturing IE Results

We will explain Delex in a bottom-up fashion. Let T be an execution plan of the target IE

program P (see Problem Definition). In this section we consider what to capture for reuse, when

executing T on a corpus snapshot Pn.

Section 3.3 then discusses how to reuse the captured result when executing T on Pn+1. Section

3.4 describes how to select such a plan T in a cost-based fashion. Section 3.5 puts all of these

together and describes the end-to-end Delex solution.

In what follows we describe how to decide on the level of reuse, what to capture, and how to

store the captured results, when executing T on snapshot Pn.

Level of Reuse: Recall that we want to reuse at the granularity of program components, instead of

the whole program. The question is which components. A natural choice would be the individual

IE blackboxes. For example, given the execution tree1 T in Figure 3.2.a, the four IE blackboxes

E1, . . . , E4 would become “reuse units,” whose input and output would be captured for subsequent

reuse.

Reusing at the IE-blackbox level however turns out to be suboptimal. To explain, consider for

instance blackbox E1 (Figure 3.2.a), and let σ(E1) denote the edge of T that applies the selection

1In the rest of the chapter we will use “tree,” “execution tree,” and “execution plan” interchangeably.

39

operator σ to the output of E1. Instead of storing the output of E1, we can store that of σ(E1).

Doing so does not affect reuse (as we will see below), but is better in two ways. First, it would

incur less storage space, because σ(E1) often produces far fewer output tuples than E1. Second,

less storage space in turn reduces the time of writing to disk (while executing T on Pn) and reading

from disk (for reuse, while executing T on Pn+1). Consequently, we reuse at the level of IE units,

defined as:

Definition 3.2 (IE Unit). Let X = N1 ← N2 ← · · · ← Nk denote a path on tree T that applies

Nk−1 to Nk, Nk−2 to Nk−1, and so on. We say X is an IE unit of T iff (a) Nk is an IE blackbox,

(b) N1, . . . , Nk−1 are relational operators σ and π, and (c) X is maximal in that no other path

satisfying (a) and (b) contains X .

For example, tree T in Figure 3.2.a consists of four IE units U, V, Y , and Z, as shown in

Figure 3.2.b.

In essence, each IE unit can be viewed as a generalized IE blackbox, with similar notions of

scope α and context β. In this setting, it is easy to prove that we can set the (α, β) of an IE unit

N1 ← N2 ← · · · ← Nk to be exactly those of the IE blackbox Nk. This property is desirable and

explains why we do not include join operator ./ in the definition of IE unit: doing so would prevent

us from guaranteeing the above “wholesale transfer” of (α, β) values.

IE Results to Capture: Next we consider what to capture for each IE unit U of tree T . Concep-

tually, each such unit U (which is an IE blackbox E augmented with σ and π operators, whenever

possible) can be viewed as extracting a set of mentions from a text region of a document. Formally,

we can write U : (did, s, e, c) → {(did,m, c′)}, where

• did is the ID of a document d,

• s and e are the start and end positions of a text region S in d,

• c denotes the rest of the input parameter values (see the example below),

• m denotes a mention (of a target relation) extracted from text region S, and

• c′ denotes the rest of the output values.

40

Example 3.1. Consider a hypothetical IE unit σallcap(title)(extractT itle(d,maxlength, title,

numtitles)), which extracts all titles not exceeding maxlength from document d, selects only

those in all capital letters, and outputs them as well as the number of such titles.

Here, for the input tuple, did is the ID of document d, s and e are the positions of the first and

last characters of d (because text region S is the entire document d), and c denotes maxlength.

For the output tuple, m is an extracted title, and c′ denotes numtitles.

In order to reuse the results of U later, at the minimum we should record all mentions m

produced by U (recall that given an input tuple (did, s, e, c), U produces as output a set of tuples

(did,m, c′)). Then, whenever we want to apply U to a region S in a page p, we can just copy

over all mentions of a region S ′ in some page q in a past snapshot, which we have recorded

when applying U to S ′, provided that S matches S ′ and that it is safe to copy the mentions (see

Section 3.1).

This is indeed what Cyclex does. In the Delex context, however, it turns out that since we

employ multiple IE blackboxes that can be “stacked” on top of one another, we must record more

information to guarantee correct reuse, as the following example illustrates.

Example 3.2. Consider a page p = “Midwest DB Courses: CS764 (Wisc), CS511 (Illinois)”.

Suppose we have applied an IE unit V to p to remove the headline (by ignoring all text before

“:”), and then applied another IE unit U to the rest of the page to extract locations “Wisc” and

“Illinois”.

Suppose the next day the page is modified into p′ = “Midwest DB Courses This Year CS764

(Wisc), CS511 (Illinois)”, where character “:” has been omitted (and some new text has been

added). Consequently, V does not remove anything from p′, and p′ ends up sharing the region

S = “Midwest DB Courses” with p. Thus, when applying U to p′, we will attempt to copy over

mentions found in this region. Since no such mention was recorded, however, we will conclude that

applying U to region S in p′ produces no mention. This conclusion is incorrect, since “Midwest”

is a valid location mention in S.

41

The problem is that no mention has been recorded in region S for U and p, not because U

failed to extract any such mentions from S, but rather because U has never been applied to S. U

can only take as input whichever regions V outputs, and V did not output S when it operated on p.

Thus, we must record not only the previously extracted mentions, but also the text regions that

an IE unit has operated over. Specifically, for an IE unit U : (did, s, e, c) → {(did,m, c′)}, we

record all pairs (s, e) and the mentions m associated with those. It is easy to see that we must

record c as well, for otherwise we do not know the exact conditions under which a mention m was

produced, and hence cannot recycle it appropriately.

Storing Captured IE Results: We now describe how to store the above intermediate results

while executing tree T on a corpus snapshot Pn. Our goal is to produce, at the end of the run on

Pn, two reuse files InU and On

U for each IE unit U in tree T .

During the run, whenever U takes as input a tuple (did, s, e, c), we append a tuple (tid, did, s, e,

c), where tid is a tuple ID (unique within InU ), to In

U , to capture the region that U operates on.

Whenever U produces as output a tuple (did,m, c′), we append a tuple (tid, itid, m, c′) to OnU , to

capture the mentions extracted by U . Here, tid is a tuple ID (unique within OnU ), and itid is the

ID of the tuple in InU that specifies the text region from which m is extracted. Hence, tuples are

appended to InU and On

U in the order they are generated. After executing T over Pn, each IE unit U

is associated with two reuse files InU and On

U that store intermediate IE results for U for later reuse.

To avoid excessive disk writes caused by individual append operations, we use one block of

memory per reuse file to buffer the writes. Whenever a block fills up, we flush the buffered tuples

to the end of the corresponding reuse file. The memory overhead during execution is 2|T | blocks

(one per file), where |T | is the number of IE units in T . The I/O overhead, same as the total storage

requirement for reuse files, is exactly∑

U∈T (B(InU) + B(On

U)) blocks, where B(InU) and B(On

U)

represent the number of blocks occupied by InU and On

U , respectively. Although it is conceivable

for an IE unit to produce more mentions than the size of the input document, in practice the number

of mentions is usually no more (and often far smaller) than the input size. Therefore, both the total

storage and the I/O overhead are usually bounded by O(|T |B(Pn)), where B(Pn) denotes the size

of Pn in blocks.

42

3.3 Reusing Captured IE Results

We have described how to capture IE results in reuse files while executing a tree T on snapshot

Pn. We now describe how to use these results to speed up executing T over the subsequent snapshot

Pn+1.

3.3.1 Scope of Mention Reuse

As discussed earlier, to reuse, we must match each page p ∈ Pn+1 with pages in the past

snapshots, to find overlapping regions. Many such matching schemes exist. Currently, we match

each page p only with the page q in Pn at the same URL as p. (If q does not exist then we declare

p to have no overlapping regions.) This simplification is based on the observation that pages

with the same URL often change relatively slowly across consecutive snapshots, and hence often

share much overlapping data. Extending Delex to handle more general matching schemes, such as

matching within the same Web site, or matching over all pages of all past snapshots, is an ongoing

work.

3.3.2 Overall Processing Algorithm

Within the above reuse scope, we now discuss how to process Pn+1. Since Pn+1 can be quite

large (tens of thousands or millions of pages), we will scan it only once, and process each page in

turn in memory in a streaming fashion.

In particular, to process tree T on a page p ∈ Pn+1 (once it has been brought into memory), we

need page q ∈ Pn (the previous snapshot) with the same URL, as well as all intermediate IE results

that we have recorded while executing tree T on q. These IE results are scattered in various reuse

files (Section 3.2), which can be large and often do not fit into memory. Consequently, we must

ensure that in accessing intermediate IE results, we do not probe the reuse files randomly. Rather,

we want to read them sequentially and access IE results in that fashion.

The above observation led us to the following algorithm. Let q1, q2, . . . , qk be the order in

which we processed pages in Pn. That is, we first executed T on q1, then on q2, and so on. The way

43

)( 1qOn

U

1p

)( 1qIn

U

1q

)( 1qIn

U

)( 1qOn

U

)( 2qIn

U nUI

)( 2qOn

U nUO

nP)( 1

1pI

n

U

+

)( 11pO

n

U

+

1+nUI

1+nUO

)( 11pO

n

U

+

)( 11pI

n

U

+

1+nP

2B

1B

3B

4B

5B

6B

IE Unit U

Matcher M

L

L

Figure 3.3 Movement of data between disk and memory during the execution of IE unit U onpage p1.

we wrote reuse files, as described earlier in Section 3.2, ensures that the IE results in each reuse

file are stored in the same order. For example, InU stores all input tuples (for U ) on page q1 first,

then all input tuples on page q2, and so on.

Consequently, we will process pages in Pn+1 following the same order. That is, let pi be the

page with same URL as qi, i = 1, . . . , k. Then we process p1, then p2, and so on. (If a page

p ∈ Pn+1 does not have a corresponding page in Pn, then we can process it at any time, by simply

running extraction on it.) By processing in the same order, we only need to scan each reuse file

sequentially once.

Figure 3.3 illustrates the above idea. Suppose we are about to process page p1 ∈ Pn+1. First,

we read p1 and q1 into memory (buffers B1 and B2 in the figure).

Next, we execute T on p1 in a bottom-up fashion. Consider the execution tree T in Figure 3.2.b.

We start with executing IE unit U . To do so, we bring all intermediate IE results recorded while

executing U on q1 (back when we processed Pn) into memory. Specifically, let InU(q1) denote the

input tuples for U on page q1. Since q1 is the first page in Pn, InU(q1) must appear at the beginning

of file InU , and hence can be immediately brought into memory (buffer B3 in Figure 3.3). Similarly,

OnU(q1)—the tuples output by U on page q1—must occupy the beginning of file On

U and can be

immediately read into memory (buffer B4 in Figure 3.3).

The details of how to execute IE unit U on p1 will be presented next in Section 3.3.3. Roughly

speaking, we identify overlapping regions between q1 and p1, and leverage InU(q1) and On

U(q1) for

reuse. Note that InU(q1) and On

U(q1) store only the start and end positions of regions in q1, so we

need q1 in memory to access these regions. During the execution of U on p1, we produce the

44

input and output tuples of U , In+1U (p1) and On+1

U (p1), in memory (buffers B5 and B6 in Figure 3.3,

respectively). As described in Section 3.2, these tuples are also appended to reuse files In+1U and

On+1U .

Once we are done with U (for p1), memory reserved for InU(q1), On

U(q1), and In+1U (p1) can

be discarded; however, On+1U (p1) will be retained in memory until it is consumed by the parent

operator or IE unit of U in T (in this case, the join operator in Figure 3.2.b).

Next, we move on to IE unit V . We read in InV (q1) and On

V (q1) from the corresponding reuse

files InV and On

V , and generate In+1V (p1) and On+1

V (p1) in memory. Again, once V finishes, only

On+1V (p1) needs to stay in memory to provide input to V ’s parent in T . This process continues until

we have executed the entire T .

Once the entire T finishes execution on p1, we move on to process T on page p2, then p3,

and so on. Note that each time we process a page pi, the intermediate IE results of qi will be at

the start of the unread portion of the reuse files, and thus can be read in easily. Consequently,

we only have to scan each reuse file once during the entire run over Pn+1. The total number of

I/Os is thus∑

U∈T (B(InU) + B(On

U) + B(In+1U ) + B(On+1

U )) + B(Pn) + B(Pn+1), i.e., one pass

over the current and previous corpus snapshots and all reuse files for the two snapshots. At any

point in time (say, when executing IE unit U on page pi), we only need to keep in memory pi,

qi, InU(qi), On

U(qi), In+1U (pi), On+1

U (pi), as well as On+1U ′ (pi) for any child U ′ of U . Therefore,

the maximum memory requirement for the algorithm (not counting memory needed for buffering

writes to reuse files discussed in Section 3.2, or by the IE units and relational operators themselves)

is O(maxi(B(pi) + B(qi) + (F (T ) + 1) maxU∈T (B(InU (qi)), B(On

U (qi)), B(In+1U (pi)), B(On+1

U (pi)))))

blocks, where F (T ) is the maximum fan-in of T . In practice, under the reasonable assumption

that the total size of the extracted mentions is linear in the size of the input page, the memory

requirement comes down to O((F (T ) + 1) maxi(B(pi) + B(qi))).

45

��

��

Extraction

regions

Copy

regions

Matching

Figure 3.4 An illustration of executing an IE unit.

3.3.3 IE Unit Processing

We now describe in more detail how to execute an IE unit U on a particular page p (in snapshot

Pn+1), whose previous version is q (in snapshot Pn). The overall algorithm is depicted in Figure

3.4.

We start with In+1U (p), the set of input tuples to U . Each input tuple (tid, did, s, e, c) ∈ In+1

U (p)

represents a text region [s, e] of page p to which we want to apply U , with additional input param-

eter values c. There are two cases. If U has a child in T , this set is produced by the execution

of the child. If U is a leaf in T , which operates directly on page p, there is only one input tuple

(did, s, e, c), where did is the ID of p, s and e are set to 0 and the length of p, and c denotes all

other input parameters.

To identify reuse opportunities, we consult InU(q), which contains the input tuples to U when it

executed on q. This set is read in from the reuse file InU as discussed in Section 3.3.2. Each tuple

in InU(q) has the form (tid′, did′, s′, e′, c′), where did′ is the ID of q, and c′ records the values of

additional input parameters that U took when applied to region [s′, e′] of q. To find results to reuse

for input tuple (did, s, e, c) ∈ In+1U (p), we “match” the region [s, e] of p with regions of q encoded

by tuples in InU(q) with c′ = c. This matching is done using one of the matchers to be described

later in Section 3.3.4 (Section 3.4 discusses how to select a good matcher).

We repeat the matching step for each input tuple in In+1U (p) to find its matching input tuples

in InU(q). From the corresponding pairs of matching regions in p and q as well as the scope and

context properties of U (Section 3.1), we derive the extraction regions and copy regions. Because

46

of space constraint, we do not discuss the derivation process further, but instead refer the reader to

[16] for details.

Extraction regions require new work: we run U over these regions of p. Copy regions represent

reuse. If a copy region is derived from input tuple (tid′, did′, s′, e′, c′) ∈ InU(q), we find the joining

output tuples (with the same tid′) in OnU(q). Recall that On

U(q) contains the output tuples of U

when it executed on q, and this set is read in from the reuse file OnU as discussed in Section 3.3.2.

The OnU(q) tuples with tid′ represent the mentions extracted from region [s′, e′] of q, which can be

reused by U to produce output tuples for the corresponding copy region.

Regardless of how U produces its output tuples (through reuse or new execution), they are

appended to the reuse file On+1U (as described in Section 3.2), and kept in memory until consumed

by a parent operator or IE unit in T (as described in Section 3.3.2).

3.3.4 Identifying Reuse With Matchers

Delex currently employs four matchers—DN, UD, ST, and RU—for matching regions be-

tween two pages (more matchers can be easily plugged in as they become available). We describe

the first three matchers here only briefly, since they come from Cyclex. Then, we focus on RU, a

novel contribution of Delex that allows sharing the work of matching across IE units.

Given two text regions R (of page p ∈ Pn+1) and S (of page q ∈ Pn) to match, DN immediately

declares that the two regions have no matching portions, incurring zero running time. Using DN

thus amounts to applying IE from scratch to R. UD employs a Unix-diff-command like algorithm

[58]. It is relatively fast (takes time linear in |R|+ |S|), but finds only some matching regions. ST

is a suffix-tree based matcher, which finds all matching regions of R using time linear in |R|+ |S|.We do not discuss these Cyclex matchers further; see [16] for more details.

The development of RU is based on the observation that we can often avoid repeating much

of the matching work for different IE units. This opportunity does not arise in Cyclex because

Cyclex considers only a single IE blackbox. To illustrate the idea in a multi-blackbox setting,

consider again executing tree T of Figure 3.2.b on page p ∈ Pn+1, and suppose that we execute

47

IE units U , V , Y , and Z, in that order. During U ’s execution we would have matched page p with

page q ∈ Pn with the same URL to find overlapping regions on which we can reuse mentions.

Now consider executing V . Here, we would need to match p and q again; clearly, we should

take advantage of the matching work we have already performed on behalf of U . Next, consider

executing Y . Here, we often have to match a region R of p with a set of regions S1, . . . , Sk of

q (as described in Section 3.3.3) to detect overlapping regions (on which we can reuse mentions

produced by Y on page q). However, since we have already matched p with q while executing U ,

we should be able to leverage that result to quickly find all overlapping regions between R of p and

Si of q.

In general, since all regions to be matched by IE units of an execution tree come from two pages

(one from Pn and the other from Pn+1), and since IE units often match successively smaller regions

that are extracted from longer regions (matched by lower IE units), it follows that higher-level IE

units can often reuse matching results of lower ones, as described earlier.

We now briefly describe RU, a novel matcher that draws on this idea. While T executes on

a page p, RU keeps track of all triples (R, S,O), whenever a ST or UD matcher has matched a

region R of p with a region S of q and found overlapping regions O. Now suppose an IE unit X

calls RU to match two regions R′ and S ′. RU computes the intersection of R′ with all recorded

R regions, the intersection of S ′ with all recorded S regions, and then uses these intersections and

the recorded overlapping regions O to quickly compute the set of overlapping regions for R′ and

S ′. We omit further details for space reasons.

The four matchers in Delex make different trade-offs between result completeness and runtime

efficiency. The next section discusses how Delex assigns appropriate matchers to IE units, thereby

selecting a good IE plan.

3.4 Selecting a Good IE Plan

Given an execution tree T , we now discuss how to select appropriate matchers for T using

a cost-based approach. We first describe the space of alternatives, then our cost-driven search

strategy, and finally the cost model itself.

48

Z

Y

U

C1

V

C4C2 C3

π

C1 C2U V

U1 VU2 U3

UD DN ST

(a) (b)

Figure 3.5 IE chains and sharing the work of matching across them.

3.4.1 Space of Alternatives

For each corpus snapshot, we consider assigning a matcher to each IE unit of tree T , and then

use the so-augmented tree to process pages in the snapshot. Let |T | be the number of IE units in

T , and k be the number of matchers available to choose (Section 3.3.4). We would have a total of

up to k|T | alternatives. For ease of exposition, we will refer to such an alternative as an IE plan

whenever there is no ambiguity.

Note that we could make the choice of matchers at even finer levels, such as whenever we must

match two regions (while executing T on a page p). However, such low-level assignments would

produce a vast plan space that is practically unmanageable. Hence, we assign matchers only at the

IE-unit level. Even at this level, the plan space is already huge, ranging from 1 million plans for

10 IE units and four possible matchers, to 1 billion plans for 15 IE units, and beyond.

Furthermore, for most plans in this space, optimization is not “decomposable,” in that “gluing”

the locally optimized subplans together does not necessarily yield a globally optimized plan. The

following example illustrates this point.

Example 3.3. Consider a plan of two IE units A(B), where we apply A to the output of B. When

optimizing A and B in isolation, we may find that matcher UD works best for both. So the best

global plan appears to be applying UD to both units. However, when optimizing A(B) as whole,

we may find that applying ST to A and RU to B produces a better plan. The reason is that for

A ST may be more expensive (i.e., takes longer to run) than UD, but it generates more matching

regions, and B can just use RU to recycle these regions at a very low cost.

49

For the above reasons, we did not look for an exact algorithm that finds the optimal plan.

Rather, as a first step, in this chapter we develop a greedy solution that can quickly find a good

plan in the above huge plan space. We now describe this solution.

3.4.2 Searching for Good Plans

Our solution breaks tree T into smaller pieces, finds a good plan for some initial pieces, and

iteratively builds on them to find a good plan to cover other pieces until the entire T is covered. To

describe the solution, we start with the concept of IE chain:

Definition 3.3 (IE Chain). An IE chain is a path in tree T such that (a) the path contains a sequence

of IE units A1, · · · , Ak, (b) the path begins with A1 and ends with Ak, (c) between each pair of

adjacent IE units Ai and Ai+1, there are no other IE units, and Ai extracts mentions from regions

output by Ai+1, and (d) the chain is maximal in that we can not add another IE unit to its beginning

or end and obtain another chain satisfying the above properties.

For example, an IE execution tree extractTopics(extractAbstract(d, abstract)) is itself a

chain because the IE unit extractAbstract extracts abstracts from a document d, and then feeds

them to IE unit extractTopics, which in turn extracts topic strings from the abstract.

Note that the above definition allows two adjacent IE units to be connected indirectly by re-

lational operators that do not belong to any IE units. For example, the chain C1 in Figure 3.5.a

consists of the sequence of IE units Z, Y , U , where Y and U are connected by project-join (and Y

extracts mentions from a text region output by U ).

It is relatively straightforward to partition any execution tree T into a set of IE chains. Fig-

ure 3.5.a shows for example a partition of such a tree into two chains C1 and C2. Note that this

is also the only possible partition created by Definition 3.3, given that Y extracts mentions only

from a text region output by U (not from any text region output by V ). In general, given a tree T ,

Definition 3.3 creates a unique partition of T into IE chains.

We define the concept of IE chain because, within each chain, it is relatively easy to find a

good local plan, as we will see later. Unfortunately, we cannot just find these locally optimal plans

independently, and then assemble them together to form a good global plan. The reason is that

50

chains can reuse results of other chains, and this reuse often leads to a substantially better plan

(than one that does not exploit reuse across chains), as the following example illustrates.

Example 3.4. Suppose we have found a good plan for chain C1 in Figure 3.5.a, and this plan

applies matcher ST for IE unit U . That is, for each page p in snapshot Pn+1, U applies ST to

match p with q, the page with the same URL in Pn. Assuming that the running time of matcher RU

is negligible (which it is in practice), the best local plan for chain C2 is to apply matcher RU in IE

unit V . Since V must also match p and q, RU will enable V to recycle matching results of U , with

negligible cost.

Thus, optimality of IE chains is clearly “interdependent.” To take such interdependency into

account and yet keep the search still manageable, we start with one initial chain, find a good plan

for it in isolation, then extend this plan to cover a next chain, taking into account cross-chain reuse,

and so on, until we have covered all chains. Our concrete algorithm is as follows (Algorithm 3.1

shows the full pseudo code).

1. Sort the IE Chains: Using the cost model (see the next subsection), we estimate the cost of

each IE chain if extraction were to be performed from the scratch in all IE units of the chain. We

then sort the chains in decreasing order of this cost. Without loss of generality, let this order be

C1, . . . , Ch.

2. Find a Good Plan g for the First Chain: Since the first chain is the most expensive, we give

it the maximum amount of freedom in choosing matchers. To do so, we enumerate the following

set of plans for the first chain C1 (based on the heuristics that we explain below):

1. a plan that assigns matcher DN to all IE units of C1;

2. all plans that assign ST to an IE unit U of C1, RU to all “ancestor” IE units of U , and DN to

all “descendant” IE units of U ;

3. all plans that assign UD to an IE unit U of C1, RU to all “ancestor” IE units of U , and DN

to all “descendant” IE units of U .

We then use the cost model to select the best plan g from the above set.

51

Since the cost of RU is negligible in practice (as remarked earlier), it is easy to prove that the

above set of plans dominates the set M of plans where each plan employs matchers ST and UD at

most once, i.e., at most one IE unit in the plan is assigned a matcher that is either ST or UD. Thus,

the plan we select will be the best plan from M.

We do not examine a larger set of plans because any plan outside M would contain at least

either two ST matchers, or two UD matchers, or an ST matcher together with a UD matcher. Since

the cost of these matchers are not negligible, our experiments suggest that plans with two or more

such matchers tend to incur high overhead. In particular, they usually underperform plans where

we apply just one such expensive matcher relatively early on the chain, and then apply only RU

matcher afterward. For this reason, we currently consider only the plan space M.

3. Extend Plan g to Cover the Second Chain: First, we repeat the above Step 2 (but replacing

C1 with C2), to find a good plan g′ for the second chain C2.

Next, let U be the bottom IE unit of chain C1. Suppose the best plan g for C1 assigns either

matcher ST or UD to U . Then we can potentially reuse the results of this matcher for C2 (if C2

is executed later than C1 in T ). Hence, we consider a reuse-across-chains plan g′′ that assigns

matcher RU to all IE units of C2 (and directing them to reuse from IE unit U of C1).

We then compare the estimated cost of g′ and g′′, and select the cheaper one as the best plan

found for chain C2.

4. Cover the Remaining Chains Similarly: We then repeat Step 3 to cover the remaining

chains. In general, for a chain Ci, we could have as many reuse-across-chains plans as the number

of chains in the set {C1, . . . , Ci−1} that assign matcher ST or UD to their bottom IE units.

Example 3.5. Figure 3.5.b depicts a situation where we have found the best plans for chains

C1, C2, and C3. These plans have assigned matchers UD, DN, and ST to the bottom IE units

U1, U2, and U3, respectively. Then, when considering chain C4, we will create two reuse-across-

chains plans: the first one reuses the results of matcher UD of U1, and the second reuses the results

of matcher ST of U3 (see the figure).

52

Algorithm 3.1 Searching for Execution Plan1: Input: IE execution tree T

2: Output: execution plan G

3: C ⇐ partition T //C is a set of chains

4: C1, · · · , Ch ⇐ sort C in decreasing order of cost estimate

5: g1 ⇐ findBest(C1)

6: G ⇐ {g1}7: for 2 ≤ i ≤ h do

8: g′i ⇐ findBest(Ci)

9: B ⇐ bottom IE units for all chains in G

10: if (any U ∈ B has the raw data page as input and is assigned ST or UD) then

11: g′′i ⇐ assign RU to all IE units of Ci reusing the matching results of U

12: gi ⇐ select g′i or g′′i with the smaller cost estimate

13: G ⇐ G ∪ {gi}14: else

15: G ⇐ G ∪ {g′i}16: end if

17: end for

Procedure FindBest(Ci)1: Input: chain Ci = A1(A2(· · · (Ak) · · · ))2: Output: best execution plan for Ci in Mi, where Mi is the set of plans each having at most one IE unit Aj , 1 ≤ j ≤ k,

assigned matcher ST or UD.

3: M′i ⇐ ∅

4: g ⇐ assign DN to each Aj , 1 ≤ j ≤ k

5: M′i ⇐M′

i ∪ {g}6: for 1 ≤ j ≤ k do

7: g ⇐ assign ST to Aj , RU to Am, 1 ≤ m < j, and DN to An, j < n ≤ k

8: M′i ⇐M′

i ∪ {g}9: g ⇐ assign UD to Aj , RU to Am, 1 ≤ m < j, and DN to An, j < n ≤ k

10: M′i ⇐M′

i ∪ {g}11: end for

12: for each g ∈M′i, estimate its cost using the cost model

13: return the g with the smallest cost estimate

Once we have covered all the chains, we have found a reasonable plan for execution tree T .

Our experiments in Section 3.6 show that such plans prove quite effective on our real-world data

sets.

53

a Average number of input tuples in IU per page

b Size of IU on disk (in blocks)

c Size of OU on disk (in blocks)

d Size of all pages on disk (in blocks) in a snapshot

l Average length of a region encoded by an input tuple

m Number of pages in a single snapshot

v Number of buckets in the in-memory hash table of copy regions

(a) Meta data statistics

f Fraction of pages with an earlier version in the previous snapshot

s Number of times a matcher is invoked on a region encoded by an input tuple

g After matching region R, the ratio of resulting extraction regions to R (in length)

h Number of copy regions generated from matching a region

(b) Selectivity statistics

Figure 3.6 Cost model parameters.

3.4.3 Cost Model

We now describe how to estimate the runtime of an execution plan. Since the difference among

all plans is how they execute the IE units of tree T (Section 3.4.1), we focus on the cost incurred by

executing IE units, and ignore other costs. Therefore, we estimate the cost of a plan to be∑

U∈T tU ,

where tU denotes the elapsed time of executing the IE unit U .

For an IE unit U , we further model tU as the sum of the elapsed time of the steps involved in

executing U (Section 3.3.3). We model the elapsed time of each step as a weighted sum of I/O

and CPU costs to capture the elapsed times of highly tuned implementations that overlap I/O with

CPU computation (in which case, the dominated cost component will be completely masked and

therefore have weight 0) as well as simple implementations that do not exploit parallelism.

To model tU , our cost model employs three categories of parameters. The first category of

parameters (listed in Figure 3.6.a) are the meta data of data pages and intermediate results. For

these parameters, we use subscript n to represent the value of the parameter on snapshot n. For

example, an denotes the average number of input tuples in InU(q) for a page q ∈ Pn.

The second category of parameters (listed in Figure 3.6.b) are selectivity statistics of a matcher.

The last category of parameters are I/O and CPU cost weights w, whose subscripts reflect which

54

step incur the associated costs. For all parameters, we use hatted variables to represent parameters

are estimated.

We now describe tU as follows. tU consists of 4 cost components in executing U . The first cost

component is the cost of identifying regions encoded by input tuples (tid, did, s, e, c) ∈ In+1U and

(tid′, did′, s′, e′, c′) ∈ InU where c = c′. We model the cost component as:

w1,IO · bn + w1,find · an · an+1 ·mn+1 · f (3.1)

The term w1,IO · bn models the I/O cost of reading in InU into buffer. The term an · an+1 ·mn+1 · f

models the total number of comparisons between arguments c and c′ for input tuples in InU and

In+1U respectively.

The second cost component is the cost of matching the regions identified in the first step. We

model this component as:

w2,IO · dn · f + w2,mat · an+1 ·mn+1 · f · s · l (3.2)

This model accounts for the I/O cost of reading in pages in Pn and the CPU cost of applying

matchers. The term dn · f estimates the size (in disk blocks) of raw data pages in Pn that share the

same URL, since we only match same URL pages (see Section 3.3.1). The term an+1 ·mn+1 · f · sestimates the total number of times we apply the matcher when executing U on Pn+1.

The third cost component is the cost of applying U to all extraction regions. We model this

component as:

w3,ex · (an+1 ·mn+1 · (1− f) · l + an+1 ·mn+1 · f · l · g) (3.3)

We will apply U to those input tuples (in In+1U ) on pages in Pn+1 that do not have an earlier version

in Pn. The term an+1 ·mn+1 ·(1− f) · l estimates the total length of regions encoded in those tuples.

In addition, we also need to apply U to the extraction regions on pages Pn+1 that do have an earlier

version in Pn. The term an+1 ·mn+1 · f · l · g estimates the length of these extraction regions. In

particular, g measures, on average, the fraction of a region we still need to apply U after we match

it using a matcher.

55

The last cost component is the cost of reusing output tuples for copy regions. We model this

component as:

w4,IO · cn + w4,copy · an ·mn · an+1 ·mn+1 · f · hv

(3.4)

The formula models the I/O cost of reading in OnU and the CPU cost of probing the copy regions to

determine whether to copy each mention. Delex stores the copy regions in a hash table to facilitate

fast lookups. The term an+1·mn+1·f ·hv

estimates the number of hash table entries per bucket.

Notice that we ignore the costs of reading the raw data pages in Pn+1 and writing out the

intermediate results and the final target relation, since these costs are the same for all plans.

Given the cost model, we then estimate the parameters using a small sample S of Pn+1 as well

as the past k snapshots, for a pre-specified k. Since our parameter estimation techniques are similar

to those in Cyclex, we do not discuss the details any further.

3.5 Putting It All Together

We now describe the end-to-end Delex solution. Given an IE program P written in xlog, we

first employ the techniques described in [67] to translate and optimizes P into an execution tree T ,

and then pass T to Delex.

Given a corpus snapshot Pn+1, Delex first employs the optimization technique described in

Section 3.4 to assign matchers to the IE units of T . Next, Delex executes the so-augmented tree

T on Pn+1, employing the reuse algorithm described in Section 3.3 and the reuse files it produced

for snapshot Pn. During execution, it captures and stores intermediate IE results (for reuse in the

subsequent snapshot Pn+2), as described in Section 3.2.

Note that Delex executes essentially the same plan tree T on all snapshots. The only aspect of

the plan that changes across snapshots is the matchers assigned to the IE units. Our experiments

in Section 3.6 show that for our real-world data sets this scheme already performs far better than

current solutions (e.g., applying IE from scratch, running Cyclex, reusing IE results on duplicate

56



1515# Snapshots

21 days2 daysTime Between Snapshots



(a) Data sets for our experiments.

12

5

9

β (in char.)

205395advise (advisor, advisee, topics)

94583chair (person, chairType, conference)

1551talk (speaker, topics)

α (in char.)# IE “Blackboxes”IE Program for DBLife

15

7

7

β (in char.)

106252blockbuster (movie)

305066award (actor, movie, role, award )

227054play (actor, movie)

α (in char.)# IE “Blackboxes”IE Program for Wikipedia

(b) IE programs for our experiments.

Figure 3.7 Data sets and IE programs for our experiments

pages). Exploring more complex schemes, such as re-optimizing the IE program P for each snap-

shot or re-assigning the matchers for different pages, is a subject of ongoing work. The following

theorem states the correctness of Delex:

Theorem 3.1 (Correctness of Delex). Let Mn+1 be mentions of the target relation R obtained by

applying IE program P from scratch to snapshot Pn+1. Then Delex is correct in that when applied

to Pn+1 it produces exactly Mn+1.

Proof. Let U be an IE blackbox in P and OUn+1 be the output of U produced by re-applying U from

scratch to Pn+1. In the similar way we have shown in Cyclex, we can show that Delex produces

exactly OUn+1 for U when it is applied to Pn+1. Since Delex produces the correct output for each

IE blackbox in P , it is easy to show that Delex produces exactly Mn+1.


We now empirically evaluate the utility of Delex. Figure 3.7 describes two real-world data sets

and six IE programs used in our experiments. DBLife consists of 15 snapshots from the DBLife

57

exBioSection(d,bioSection)

docs(d)

exActor(bioSection,p,actor)

namePatterns(p)

exAwardItem(awardSection,awardItem)

exAwardSection(d,awardSection)

exAward(awardItem,m,a,movie2,award)

awardPatterns(a) docs(d)

exRole(d,m,movie1,role)

moviePatterns(m)

(movie1,role,award)

match(movie1,movie2)

moviePatterns(m)

docs(d)

π

σ

Figure 3.8 The execution plan used in our experiments for the “award” IE task.

system [31], and Wikipedia consists of 15 snapshots from Wikipedia.com (Figure 3.7.a). The three

DBLife IE programs extract mentions of academic entities and their relationships, and the three

Wikipedia IE programs extract mentions of entertainment entities and relationships (Figure 3.7.b).

Figure 3.8 shows for example the execution plan used in our experiments for the “award” IE task

(with IE blackboxes shown in bold font). The above IE programs are rule-based. However, we also

experimented with an IE program consisting of multiple learning-based blackboxes, as detailed at

the end of this section.

We obtained the scope α and context β of each IE blackbox and the entire IE program by

analyzing the IE blackboxes and their relationships. The appendix describes this analysis in detail.

Runtime Comparison: For each of the six IE tasks in Figure 3.7.b, Figure 3.9 shows the runtime

of Delex vs. that of other possible baseline solutions over all consecutive snapshots. We consider

three baselines: No-reuse, Shortcut, and Cyclex. No-reuse re-executes the IE program over all

pages in a snapshot; Shortcut detects identical pages, then reuses IE results on those; and Cyclex

treats the whole IE program as a single IE blackbox.

On DBLife, No-reuse incurred much more time than the other solutions. Hence, to clearly

show the differences in the runtimes of all solutions, we only plot the runtime curves of Shortcut,

Cyclex, and Delex on DBLife (the left side of Figure 3.9). Since in each snapshot both Cyclex and

58

talk

200

550

900

1250

1600

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15snapshot

runtime(s) Runtime of No-reuse varies from 14685 seconds to 15231 seconds

chair

200

550

900

1250

1600

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15snapshot


advise

200

700

1200

1700

2200

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15snapshot


award

300

750

1200

1650

2100

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15snapshot

runtime(s)

blockbuster

150

300

450

600

750

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15snapshot

runtime(s)

play

100

450

800

1150

1500

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15snapshot

runtime(s)

No-reuse Shortcut Cyclex Delex

Figure 3.9 Runtime of No-reuse, Shortcut, Cyclex, and Delex.

Delex employ a cost model to select and execute a plan, their runtime includes statistic collection,

optimization, and execution times.

Figure 3.9 shows that, in all cases, No-reuse (i.e., rerunning IE from the scratch) incurs large

runtimes, while Shortcut shows mixed performance. On DBLife, where 96-98% of pages re-

main identical on consecutive snapshots, it performs far better than No-reuse. But on Wikipedia,

where many pages tend to change (only 8-20% pages remain identical on consecutive snapshots),

Shortcut is only marginally better than No-reuse. In all cases, Cyclex performs comparably or

significantly better than Shortcut.

Delex however outperforms all of the above solutions. For “talk” task, where the IE program

contains a single IE blackbox, Delex performs as well as Cyclex. For all the remaining tasks,

where the IE program contains multiple IE blackboxes, Delex significantly outperforms Cyclex,

cutting runtime by 50-71%. These results suggest that Delex was able to exploit the compositional

nature of multi-blackbox IE programs to enable more reuse, thereby significantly speeding up

program execution.

Contributions of Components: Figure 3.10 shows the runtime decomposition of the above

solutions (numbers in the figure are averaged over five random snapshots per IE task). “Match”

is the total time of applying all matchers in the execution tree. “Extraction” is the total time to

59

Match Extract ion Copy Opt Others

blockbuster

0306090120150


runtime(s)505 491 386 203

play

0306090120150


runtime(s)870 865 751

award

0306090120150


runtime(s)1575 1496 1575 312

talk

020406080100


runtime(s)11552 904 414 414

chair

020406080100


runtime(s)14658 878 755 473

advise

020406080100


runtime(s)14805 1151 1068 458

Figure 3.10 Runtime decomposition of No-reuse, Shortcut, Cyclex and Delex.

apply all IE extractors. “Copy” is the total time to copy mentions. “Opt” is the optimization time

of Cyclex and Delex. Finally, “Others” is the remaining time (to apply relational operators, read

file indices, etc.).

The results show that matching and extracting dominate runtimes. Hence we should focus on

optimizing these components, as we do in Delex. Furthermore, Delex spends more time on match-

ing and copying than Cyclex and Shortcut in complex IE programs (e.g., “play” and “award”).

However, this effort clearly pays off (e.g., reducing the extraction time by 37-85%). Finally, the

results show Delex incurs insignificant overhead (optimization, copying, etc.) compared to its

overall runtime.

We also found that in certain cases the best plan (one that incurs the least amount of time)

employs RU matchers, and that the optimizer indeed selected such plans (e.g., for “chair” and

“advise” IE tasks), thereby significantly cutting runtime (see the left side of Figure 3.9). This

suggests that reusing across IE units can be highly beneficial in our Delex context.

Effectiveness of the Delex Optimizer: To evaluate the Delex optimizer, we enumerate all

possible plans in the plan space, and then compare the runtimes of the best plan versus the one

selected by the optimizer. To conduct the experiment, we first selected the “play” IE task, whose

60

play

0

400

800

1200

1600

3 6 9 12 15snapshot

runtime(s)

best plan plan picked by Delex bad plan

play

1

3

5

7

9

3 6 9 12 15snapshot

rank

Figure 3.11 Performance of the optimizer.

play

02505007501000

10 20 30 40 50# sampled pages

runtime(s) play

02505007501000

2 3 4 5 6#snapshot

runtime(s)

Cyclex Delex

Figure 3.12 Sensitivity analysis.

plan space contains 256 plans, thereby enabling us to enumerate and run all plans. We then ranked

the plans in increasing order of their actual runtimes. Figure 3.11.a shows the positions in this

ranking for the plan selected by the optimizer, over five snapshots. The results show that the

optimizer consistently selected a good plan (ranked number five or three). Figure 3.11.b shows

the runtime of the actual best plan, the selected plan, and the worst plan, again over the same five

snapshots. The results show that the selected plan performs quite comparably to the best plan, and

that optimization is important, given the significantly varying runtimes of the plans.

Sensitivity Analysis: Next, we examined the sensitivity of Delex with respect to the main input

parameters: number of snapshots, size of sample used in statistics estimation, and the scope and

context values.

Figure 3.12.a plots the runtime of the plans selected by the optimizers of Delex and Cyclex as

a function of sample size, only for “play” (results for other IE tasks show similar phenomenons).

Figure 3.12.b plots the runtime of the plans selected by the optimizer of Delex and Cyclex as a

function of the number of snapshots.

The results show that in both cases Delex only needs a few recent snapshots (3) and a small

sample size (30 pages) to do well. Furthermore, even when using statistics over only the last 2

snapshots, and a sample size of 10 pages, Delex can already reduce the runtime of Cyclex by

61

play

0

2000

4000

6000

8000

20 40 60 80 100 120# of mentions (k)

runtime (s)


Figure 3.13 Runtime comparison wrt number of mentions.

25%. This suggests that while collecting statistics is crucial for optimization, we can do so with a

relatively small number of samples over very recent snapshots.

We also conducted experiments to examine the sensitivity of Delex with respect to the α and β

of IE “blackboxes” (figure omitted for space reasons). We found that the runtime of Delex grows

gracefully when α and β of IE “blackboxes” increase. Consider for example a scenario in our

experiments: randomly selecting an IE blackbox in the “play” task and increasing its α and β to

examine the change in Delex’s time. When we increased α from 52 to 150, the averaged runtime

of Delex over five randomly selected snapshots only increases by 15% (from 216 seconds to 248

seconds). When we further increased α to 250 (five times of the original α), the averaged runtime

of Delex over the same five snapshots increases by only 38% (from 216 seconds to 298 seconds).

We observe a similar phenomenon for β. The results suggest that a rough estimation of the α and

β of the IE blackboxes does increase the runtime of Delex, but in a graceful fashion.

Impact of Capturing IE Results: We also evaluated the impact of capturing IE results on Delex.

To do so, we varied the number of mentions extracted by the IE blackboxes and then examined

the runtimes of Delex and the baseline solutions. For example, given the IE program “play,” we

changed the code of each IE blackbox in “play” so that a mention extracted by the IE blackbox is

output multiple times. Then we applied Delex and the baseline solutions to this revised IE program

of “play.” Figure 3.13 plots these runtimes on “play” as a function of the total number of mentions

extracted by all IE blackboxes.

The results show Delex continues to outperform the baseline solutions by large margins as the

total number of mentions grows. This suggests that Delex scales well with respect to the number

of extracted mentions (and thus the size of captured IE results). Furthermore, we found that as

62

the number of mentions grows by 400% (from 22K to 110K), the time Delex spends on capturing

and reusing the IE results only grows by 88% (from 17 seconds to 32 seconds). Additionally, the

overhead of capturing and reusing IE results incurred by Delex remains to occupy an insignificant

portion (3% - 8%) of its overall runtime. This suggests that the overhead of capturing IE results

does increase as the number of extracted mentions increases, but only in a graceful manner.

Learning-based IE Programs: Finally, we wanted to know how well Delex works on IE pro-

grams that contain learning-based IE blackboxes. To this end, we experimented with an IE program

proposed by a recent work [76] to automatically construct infoboxes (tabular summaries of an ob-

ject’s key attributes) in Wikipedia pages. This IE program extracts name, birth name, birth date,

and notable roles for each actor. To do this, it employs a maximal entropy (ME) classifier to seg-

ment a raw data page into sentences, then employs four conditional random field (CRF) models –

one for each attribute – to extract the appropriate values from each of the sentences.

To apply Delex, we first converted the above IE program into an xlog program that consists

of five IE blackboxes. These blackboxes capture the ME classifier and the four CRF models,

respectively. Then we derived α and β for each of the blackboxes. For example, given a delimit

character in a raw data page, the ME classifier examines its context (i.e., surrounding characters) to

determine if the delimit character is indeed the end of a sentence. Given this, we can set αME to be

the maximal number of characters in a sentence, and βME to be the maximal number of characters

in the contexts examined by the ME classifier (321 and 16 in our experiment, respectively). It is

more difficult to derive tight values of αCRF and βCRF for the four CRF models, as these models

are quite complex. However, we can always set them to the length of the CRF model’s longest

input string, i.e., the longest sentence, and this is what we did in the current experiment.

Figure 3.14 shows the runtime of Delex and the three baseline solutions on the above xlog

program running on Wikipedia. The results show that both Shortcut and Cyclex only perform

marginally better than No-reuse, due to significant change of pages across snapshots and large α

(17824 characters) of the entire IE program. However, Delex significantly outperforms all three

solutions. In particular, Delex reduces the runtime of Cyclex by 42-53%. This suggests that

Delex can benefit from exploiting the compositional nature of multi-blackbox learning-based IE

63


actor

700

1000

1300

1600

1900

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15snapshot

runtime(s)

Figure 3.14 Runtime comparison on a learning based IE program.

programs, even though we are not able to derive tight α and β for some learning-based IE black-

boxes (e.g. the complex CRF models) in the programs.

3.7 Summary

A growing number of real-world applications involve IE over dynamic text corpora. Recent

work on Cyclex has shown that executing such IE in a straightforward manner is very expensive,

and that recycling past IE results can lead to significant performance improvements. Cyclex,

however, is limited in that it handles only IE programs that contain a single IE blackbox. Real-

world IE programs, in contrast, often contain multiple IE blackboxes connected in a workflow.

To address the above problem, we have developed Delex, a solution that effectively executes

multi-blackbox IE programs over evolving text data. As far as we know, Delex is the first in-

depth solution to this important problem. Our extensive experiments over two real-world data sets

demonstrate that Delex can cut the runtime of Cyclex by as much as 71%. This suggests that

exploiting the compositional nature of multi-blackbox IE programs can be highly beneficial.

64

Chapter 4

Recycling for CRF-Based IE Programs

So far, we have developed the efficient recycling algorithm Delex for IE programs consisting

of multiple IE blackboxes. If we can open up some of these blackboxes and understand more

about them, can we develop a more efficient recycling algorithm? We study this problem in this

chapter. In particular, we study IE programs that contain IE blackboxes based on a statistical

learning model, Conditional Random Fields (CRFs). We open up these CRF-based IE blackboxes

and explore whether we can develop a more efficient recycling algorithm. CRF-based IE is a

state-of-the-art IE solution that has been successfully applied to many IE tasks, including name

entity extraction [38, 54], table extraction [61], and citation extraction [60]. Therefore, a recycling

solution for CRF-based IE is a practical extension of Delex.

We first review CRF-based IE and introduce our problem in Section 4.1. Sections 4.2–4.5

describe our solution CRFlex. Section 4.6 presents an empirical evaluation. Finally, Section 4.7

concludes this chapter.

4.1 Introduction

In this section, we first briefly review CRF-based IE. Then we formally define our problem.

Finally we discuss the challenges in recycling for CRF-based IE and outline our solution.

65

4 9 10 16 18 25

1 6 7 14 17 27

1 2 13 16 19 24

1 2 3 4 5 6

P

L

O

d: Tom Cruise was born in NY.

NYinbornwasCruiseTom

x6x5x4x3x2x1

LOOOPP

y6y5y4y3y2y1

Entity TypeName

LOCATIONNY

PERSONTom Cruise

x:

y:

M:

(a)

(b)

Figure 4.1 (a) An example of using CRFs to extract persons and locations, and (b) an example ofthe Viterbi algorithm.

4.1.1 Conditional Random Fields for Information Extraction

CRF-based IE reduces information extraction to a sequence labeling problem. Given a doc-

ument d, a CRF-based IE program P first converts d into a sequence of tokens x1...xT1. Then

P employs a CRF model F that takes x1...xT as input and outputs a label from a set Y of labels

for each token. Y consists of the set of entity types to be extracted and a special label “other” for

tokens that do not belong to any of the entity types. The output of F is a label sequence y1...yT ,

where yi is the label of xi. Finally, P considers the labels of consecutive tokens to extract mentions.

Example 4.1. Figure 4.1.a illustrates an example of using CRFs to extract PERSON and LOCA-

TION entities from a document d. A CRF-based IE program P first converts d into a sequence x

of tokens. Then it tags each token with one of the labels in Y = {PERSON(P), LOCATION(L),

OTHER(O)}, and outputs the label sequence y. Finally, P outputs a set of name mentions M ,

where each mention consists of the longest sequence of tokens with the same labels P or L.

To label token sequences accurately, CRFs capture the dependency in the labels. For example,

while New York is a location, New York Times is an organization. In particular, the most popular

CRF models used for IE assume a linear-chain dependency in labels. This implies that the label yi

1Besides token sequences, there are many other types of sequences (e.g., line sequences). Our solution can begenerally applied to all types of sequences. For simplicity of discussion, we will focus on token sequences.

66

of xi is directly influenced only by labels yi−1 and yi+1 (besides by xi). Once yi−1 is fixed, yi−2 has

no influence on yi. In this chapter, we focus on linear-chain CRFs. Extending our work to other

CRF models is a subject for future research.

To capture such dependency in the labels of adjacent tokens, CRFs employ a set of feature

functions {fk(yi−1, yi, xi)}Kk=1. These feature functions indicate the properties of xi, given its label

yi and the previous label yi−1. For example, in the previous example, two possible feature functions

can be:

f1(yi−1, yi, xi) = [xi matches a state name] · [yi = LOCATION ],

f2(yi−1, yi, xi) = [xi starts with a capitalized character] · [yi−1 =PERSON] · [yi =PERSON],

where [p] = 1 if the predicate p is true and 0 otherwise.

Each function fk is associated with a weight λk, which is obtained during the training phase.

With these feature functions and their weights, CRFs model the conditional distribution of the

label sequence y = y1...yT given the token sequence x = x1...xT as

p(y|x) =1

Z(x)exp{

T∑i=1

K∑

k=1

λk · fk(yi−1, yi, xi)} (4.1)

where Z(x) is a normalizing constant, which equals to

∑y

exp{T∑

i=1

K∑

k=1

λk · fk(yi−1, yi, xi)}. (4.2)

During the inference phase, to label the input token sequence x, we compute the most likely

labeling

y∗ = argmaxyp(y|x) (4.3)

A brute force approach to find y∗ is to enumerate all possible y, which requires time exponential

in the sequence length. Fortunately, for linear-chain CRFs, the Viterbi algorithm can find y∗ in a

more efficient way. We now briefly describe this algorithm.

67

Viterbi Algorithm: The Viterbi algorithm is a dynamic programming algorithm for finding the

most likely label sequences. It operates in two phases: forward phase and backward phase. In

the forward phase, it computes a two dimensional V matrix. Each cell (y, i) of V stores the best

labeling score of the sequence from 1 to i with the ith position labeled y. The Viterbi algorithm

computes score V (y, i) recursively as follows:

V (y, i) =

maxy′{V (y′, i− 1) +K∑

k=1

λk · fk(y′, y, i)} if i > 0

0 if i = 0

While computing score V (y, i), it also keeps track of which y′ is used to compute V (y, i) by

adding an edge from cell (y′, i − 1) to cell (y, i). In the end of the forward phase, it fills in all

cells of the V matrix and adds all edges that indicate which previous labels are used to compute

the V scores. Then y∗ corresponds to the path traced from the cell that stores maxyV (y, T ). In

the backward phase, the Viterbi algorithm backtracks by following the edges added in the forward

phase to restore y∗.

Example 4.2. Continuing Example 4.1, Figure 4.1.b illustrates the V matrix computed over the

token sequence x in Figure 4.1.a. The first row, second row, and third row contain the scores for

label P, L, and O respectively. Each column contains the scores for a given position. We also

plot all the edges that keep track of which previous labels are used to compute the V scores. For

example, the edge cell(P, 1) → cell(L, 2) indicates V (P, 1) is used to compute V (L, 2). Finally,

the path of the best labeling is highlighted in bold.

The running time of the Viterbi algorithm is O(T |Y|2), where T is the length of the input token

sequence and |Y| is the size of Y .

4.1.2 Problem Definition

CRF-Based IE Programs: We consider how to execute a CRF-based IE program P efficiently

over evolving text. Like the IE programs considered by Delex (Chapter 3), P is a multi-blackbox

IE program, represented in xlog. Some of these blackboxes employ CRF models for extraction.

68

1 2 3 4 5 6

P

L

O

Figure 4.2 An example of a path matrix.

We call such IE blackboxes CRF-based IE predicates. A CRF-based IE predicate takes input as a

text span and a sequence of tokens contained in the text span and outputs the labels of the tokens.

In order to reuse efficiently for CRF predicates2, we assume that these predicates expose some

intermediate results to CRFlex. In particular, we assume that CRF predicates output the paths

created by the Viterbi algorithm. These paths are stored in a path matrix. For example, Figure 4.2

illustrates the path matrix that stores all the paths created in Example 4.2. Furthermore, CRF

predicates also take an additional input argument, a vector that stores the scores of position 0 in the

V matrix. The CRF predicate uses this vector to initialize the V matrix according to Equation 4.1.1.

We now formally define CRF predicates as follows:

Definition 4.1 (CRF-based IE predicate). A CRF-based IE predicate F labels a sequence of tokens

using a CRF model. Formally, F is a p-predicate q(a1, a2, a3, . . . , an, b1, b2, . . . , bm), where (a) a1

is either a document or a text span variable, (b) a2 is a token sequence variable, (c) a3 is a score

vector variable, (d) b1 is a label sequence variable, (e) b2 is a path matrix variable, and (f) for any

output tuple (u1, u2, u3, . . . , un, v1, v2, . . . , vm), u2 is a token sequence contained by u1, v1 is the

sequence of labels of u2 output by F with initializing the scores of position 0 to u3, and v2 is the

path matrix output by F during its execution over u1.

Example 4.3. Consider a CRF predicate

applyCRFs(textSpan, tokenSequence, initialScores, labelSequence, paths)

that represents the CRF predicate used in Example 4.1. An instant input to applyCRFs is (d, x, [0, 0,

0]), where d, x are illustrated in Figure 4.1.a, and [0, 0, 0] represents the scores for position 0 in

2In the rest of the chapter, we use “CRF-based IE predicate” and “CRF predicate” interchangeably.

69

the V matrix illustrated in Figure 4.1.b. This input tuple results in an output tuple (y,G), where y

is illustrated in Figure 4.1.a, and G is the matrix illustrated in Figure 4.2.

A CRF-based IE program is an IE program that contains one or more CRF predicates. We can

now define our problem formally as follows:

PROBLEM DEFINITION Let P1, . . . , Pn be consecutive snapshots of a text corpus, P be an

IE program written in xlog. Let F1, . . . , Fl be the CRF-based IE predicates in P , with the esti-

mated scopes and contexts (α1, β1), . . . , (αl, βl), respectively. Furthermore, let E1, . . . , Em be the

non-CRF-based IE predicates in P with the estimated scopes and contexts (α′1, β′1), . . . , (α

′m, β′m),

respectively. Develop a solution to execute P over corpus snapshot Pn+1 with minimal cost, by

reusing extraction results over P1, . . . , Pn.

To address this problem, a simple solution is to treat all F1, . . . , Fl, together with E1, . . . , Em

as general IE blackboxes, and then apply Delex to P . We found, however, that this solution does

not work well when the text corpus changes frequently. The main reason is that estimating “tight”

αi and βi for Fi is very difficult. As discussed before (Section 4.1.1), the Viterbi algorithm used

for CRF inference considers the entire sequence together to output the best labels. Therefore, we

have to set αi and βi to the maximal length of the text spans covering the entire token sequences.

These values are often very large and limit reuse opportunities.

This suggests that we should exploit properties that are specific to CRFs. In this chapter,

we present CRFlex as a solution that captures this intuition. We now discuss the challenges in

designing this solution.

4.1.3 Challenges and Solution Outlines

The first challenge is what properties of CRFs we can exploit for reuse. As we discussed

before, the general scope and context provide limited reusing opportunities for CRF predicates. To

address this problem, we identify an important property of CRFs: CRF context. Similar to mention

contexts identified in Cyclex, the CRF context of a token x specifies small windows surrounding

x, such that no matter how we perturb tokens outside these windows, the label of x remains the

same. Compared to mention contexts, however, identifying CRF contexts is fundamentally much

70

harder. The main reason is that CRFs operate based on the dependency in the labels of adjacent

tokens. To this end, we show that, under certain conditions, a token’s label does not depend on the

labels of its adjacent tokens. This allows us to break a sequence into several independent pieces

and recycle the results of each piece independently.

The second challenge is what results to capture for each CRF predicate and how to capture

these results. As we will show later (Section 4.2), CRF contexts vary from one token to another.

Therefore, we must identify, capture, and store these contexts in Pn, so that we can exploit them in

Pn+1 for safe reuse. To this end, we develop a solution that efficiently infers the CRF contexts from

the path matrix output by the CRF predicate. In addition, we show how to store these contexts to

reduce the I/O overhead.

Finally, how can we efficiently reuse the captured results? Similar to Cyclex and Delex, CR-

Flex first finds overlapping regions and then exploits the CRF contexts to identify copy regions.

As we will show later (Section 4.4), in order to exploit CRF contexts properly, CRFlex must inter-

leave re-applying the CRF predicate with exploiting the CRF contexts to identify the copy regions.

The challenge is that these two steps are dependent upon each other. Without re-applying the CRF

predicate, we cannot exploit the CRF contexts, and thus cannot identify the copy regions. At the

same time, without identifying the copy regions, we also do not know the extraction regions, and

thus do not know to which regions we should re-apply the CRF predicate. We develop a solution

that explores this dependency constraint to interleave the two steps carefully.

In the rest of this chapter, we describe our solution CRFlex in detail. We first present prop-

erties of CRFs we exploit for safe reuse in Section 4.2. Then we describe how to capture results

efficiently for future reuse in Section 4.3, and reuse the captured results in Section 4.4.

4.2 Modeling CRFs for Reusing

We now discuss how to model CRF predicates for safe reuse. Our goal is to model some

properties so that we can safely recycle the labels output by CRF predicates. In CRFlex, we

identify such a property of CRF predicates: CRF context. Like mention contexts, the CRF context

of a token xi specifies windows surrounding xi, such that, given these windows, the tokens outside

71

1 2 3 4 5 6 7 8

L1

L2

L3

Figure 4.3 An illustration of right contexts.

1 2 3 4 5 6 7 8

L1

L2

L3

Figure 4.4 An illustration of left contexts.

those windows are irrelevant to the label of xi. Unlike mention contexts, the CRF context of xi

also specifies the labels of certain tokens in those windows. We first introduce right context, which

specifies a window after a token. Then we introduce left context, which specifies both a window

before a token and the label of a certain token in that window. Finally, we introduce CRF context

based on right context and left context.

To motivate right contexts, we observe that the tokens that are far away after a token xi have

little influence on xi’s label, as illustrated by the following example:

Example 4.4. Figure 4.3 illustrates the path matrix G of an eight-token sequence and three possible

labels L1, L2, and L3. Notice that G(L1, 1) can reach all cells in column 3 by following the

highlighted paths between column 1 and 3. Since the best labeling path must contain one of those

3 cells in column 3, no matter what follows the third token, the best labeling path must contain

G(L1, 1). Therefore, L1 is the best label of x1 no matter how we perturb the tokens after x3. We

call token sequence x2...x3 the right context of x1. Similarly, we can find the right context for each

token in the token sequence. As another example, Figure 4.3 illustrates that G(L3, 4) can reach all

cells in column 6 by following the highlighted paths between column 4 and 6. Therefore, the right

context of x4 is x5...x6.

Let x = x1...xT be a token sequence, xi be a token in x, and y be the label of xi produced by

applying a CRF predicate F to x. Furthermore, let G denote the path matrix output by F on x.

Then we formalize the notion of right context as follows:

Definition 4.2 (Right context). The right context of a token xi is the token sequence xi+1...xi+ν ,

i.e., the consecutive ν tokens after xi in x, such that i + ν is the first column of G where all cells

can be reached by G(y, i) through the paths stored in G.

72

The nice property of right context is that the tokens outside the right context of xi are irrelevant

to the label of xi. That is, for any token sequence x′ obtained by perturbing the tokens of x after

the right context of xi, applying F to x′ still produces the same label y of xi.

To motivate left contexts, we observe that the tokens that are far away before a token also

have little influence on its label. In particular, we observe that if a cell G(y, i) can reach all cells

in some column j of G, then no matter how we perturb the tokens of x before xi, under certain

conditions, all labels of tokens after xj remain the same. We formulate this observation in the

following lemma.

Lemma 4.1. Let xj be the first token after xi such that all cells in column j of G can be reached

by G(y, i). Let x′ be any token sequence obtained by perturbing tokens of x before xi. Let G ′ be

the path matrix output by F over x′. Suppose the positions of xi and xj become i′ and j′ in x′. If

G ′(y, i′) can reach all cells in column j′ of G ′, then the labels of tokens after xj in x remain the

same.

Proof. Let W (u, v, a, b) denote the max score of a sequence starting at position a with label u and

ending at position b with label v. Let V denote the score matrix computed by F over x. Then it is

easy to show that, given that G(y, i) can reach any cell in and after column j of G, for any k ≥ j

and any y′ ∈ Y , V (y′, k) = V (y, i) + W (y, y′, i, k).

Let V ′ denote the score matrix computed by F over x′. In the similar way, we can show

that V ′(y′, k′) = V ′(y, i′) + W (y, y′, i′, k′), where i′ and k′ are the positions of xi and xk in x′

respectively.

We can also show that W (u, v, a, b) and the path traced by it remain the same no matter how

we perturb tokens outside token sequence xa...xb. Therefore, W (y, y′, i, k) = W (y, y′, i′, k′) and

the paths traced by them are exactly the same.

Hence, V ′(y′, k′)−V (y′, k) = V ′(y, i)−V (y, i′) = δ. This indicates the V scores of the tokens

after xj are only increased by a constant δ. Therefore, the ranking of scores of different labels for

the same token remains the same. Therefore, the best label for the last token xT remains the same.

Furthermore, the paths traced by V (y′, k) and V ′(y′, k′) after xi are exactly the same path as

the path traced by W (y′, y, i, k), and thus contain the same edges. Since the best label of the last

73

token xT remains the same and all the paths between xj and xT remain the same, the best labeling

path also remains the same at and after xj . Hence, the labels of all tokens after xj (including the

label of xj) remain the same.

We can now define the left context of a token xi as follows:

Definition 4.3 (Left context). The left context of a token xi is the token sequence xi−µ...xi−1 and

the label λ of xi−µ, such that G(λ, i − µ) can reach all cells in column i through the paths in G,

and no cell in a column after i− µ can reach all cells in column i. We represent the left context as

a tuple (xi−µ...xi−1, λ). Furthermore, we call xi−µ...xi−1 the left context window of xi.

By Lemma 4.1, we can show that for any token sequence x′ obtained by perturbing the tokens

of x before the left context window xi−µ...xi−1 of xi, as long as the cell of the path matrix for label

λ and token xi−µ can still reach all cells for xi, applying F to x′ still produces the same label y of

xi.

Example 4.5. Figure 4.4 illustrates the same path matrix as the one illustrated in Figure 4.3.

Notice that G(L1, 2) can reach all cells in column 4 by following the highlighted paths between

column 2 and 4. Therefore, the left context of x4 is x2...x3 and label L1. Let x′ be a token sequence

resulted from perturbing the tokens of x before x2, and G ′ be the path matrix created by F over x′.

Then if the cell of G ′ for label L1 and token x2 can still reach all cells for token x4, the label of x4

remains the same.

We now define CRF context. Intuitively, the CRF context of a token xi consists of its left

context and its right context. Formally:

Definition 4.4 (CRF context). The CRF context of a token xi is the token sequence xi−µ...xi+ν

and the label λ of xi−µ, such that xi−µ...xi−1 is its left context window and xi+1...xi+ν is the right

context window. We represent the CRF context as a tuple (xi−µ...xi+ν , λ).

The nice property of CRF context is that no matter how we perturb the tokens of x outside

xi−µ...xi+ν , as long as the label λ of xi−µ can still reach all possible labels of xi, applying F to the

perturbed token sequence still produces the same label y of xi.

74

Example 4.6. From Example 4.4 and Example 4.5, we know that the left context of x4 is (x2...x3, L1)

and the right context is x5...x6. Therefore, the CRF context of x4 is (x2...x6, L1).

4.3 Capturing CRF IE Results

In this section, we discuss what results to capture for a CRF-based IE program P and how to

capture them while running P on the current snapshot Pn.

Like Delex, CRFlex captures both input tuples and output tuples for each non-CRF IE pred-

icate in P . Additionally, CRFlex also captures results for each CRF predicate. We first discuss

what results to capture for a CRF predicate F , and then discuss how to capture and store them.

Capturing IE Results: In order to reuse the results of F safely, we need to capture: (a) the token

sequences F has operated over, (b) the CRF contexts of tokens in these token sequences, and (c)

labels output by F .

We can capture the token sequences and the labels from the input and output tuples of F .

Capturing the CRF contexts raises a challenge since F does not output CRF contexts directly. Our

solution is to exploit the path matrices output by F and infer the CRF contexts from the paths

stored in these matrices. We now describe this solution in detail.

Given the path matrix G output by F over a token sequence x, we scan G once and identify the

CRF contexts of all tokens in x.

The key step in identifying the CRF contexts is to identify a cell in each column i of G that

can reach all cells in a column after i. To do so, we use a matrix R of the same size of G to keep

track of the reachability of each cell of G. Initially, R is empty. Then we update R as we scan

G column by column. When we scan column j of G, each cell R(y, i) is either empty or stores a

label y′ if G(y′, i) can reach G(y, j) by following the paths stored in G. If all cells in column i of R

contain the same label y′, this indicates G(y′, i) can reach all cells in column j of G. The concrete

algorithm is as follows:

1. Initialize R: Initially, each cell of R is empty.

75

2. Scan Column 1 of G and Update R: Since no edge points to any cell in column 1 of G, we

set R(y, 1) = y for each y ∈ Y , indicating G(y, 1) can reach itself.

3. Scan Column 2 of G and Update R: We first make a copy R′ of R. Then we scan column

2 of G. For each cell G(y, 2) in column 2, if there is an edge from G(y′, 1) to G(y, 2), then we set

R(y, 1) = R′(y′, 1). This indicates that G(R(y, 1), 1) can reach G(y, 2). Finally, we set R(y, 2) = y

for each possible label y ∈ Y .

4. Check R to Identify Left and Right Contexts: Now we check if all cells in any column

before column 2 of R store the same label. In this case, there is only one column before 2, which

is column 1. So if all cells in column 1 of R contain the same label y, this indicates G(y, 1) can

reach all cells in column 2. Hence we identify x2 as the right context of x1. Furthermore, x1 and

its label y form the left context of x2.

5. Scan the Rest of the Columns of G and Update R Similarly: We repeat step 3-4 for the rest

of the columns of G. In general, before we begin to scan a column j, we first make a copy R′ of

R. Then while we are scanning column j, if there is an edge from G(y′, j − 1) to G(y, j), then we

set R(y, k) = R′(y′, k) for each k < j. Next, we set R(y, j) = y for each possible label y ∈ Y .

Finally, we check if there is any column k of R such that all cells in column k store the same label

y. If so, xk+1...xj is the right context of xk. Furthermore, xk...xj−1 and y forms the left context of

xj . After we finish scanning G, we can combine the left contexts and the right contexts to find the

CRF contexts.

Example 4.7. Figure 4.5 illustrates an example of capturing CRF contexts. The matrix in the

first row is the path matrix over token sequence x1...x5 with 3 possible labels L1, L2, and L3.

The matrices in the second and third row are the reachability matrices when we scan column 1 to

column 5 of the path matrix respectively. First, we scan the first column of the path matrix and set

the reachability matrix to R1. Then we scan the second column of the path matrix and update the

reachability matrix. This results in matrix R2. After we scan the third column of the path matrix

and update the reachability matrix, all cells in the first column of the reachability matrix contain

the same label L1. This indicates that the right context of x1 is x2...x3, and the left context of x3 is

76

1 2 3 4 5

L1

L2

L3

L1

L2

L3

L1

L2

L3R1

L1 L1

L1 L2

L3 L3R2

L1 L1 L1

L1 L2 L2

L1 L1 L3

R3x1’s right context = x2 ...x3x3’s left context = (x1 ...x2, L1)

L1 L1 L1 L1

L1 L1 L3 L2

L1 L1 L3 L3

L1 L1 L1 L1 L1

L1 L1 L3 L3 L2

L3 L1 L3 L3 L3R5

L1

L2

L3

x2’s right context = x3...x4x4’s left context = (x2 ...x3,L1)

R4

Figure 4.5 An illustration of capturing CRF contexts from a path matrix.

x1...x2 and L1, which is the label of x1. Then we cover the rest of the columns in the path matrix

similarly.

Capturing CRF contexts incurs overhead of O(TD|Y|) in time and O(T |Y|) in memory space,

where T is the length of x and D is the length of the longest right context. In our experiments, we

found that D is generally 2-3 tokens.

Storing the Captured IE Results: We now discuss how to discover and capture the above results

while running F over Pn.

Our goal is to generate three files at the end of the run on Pn: InF that stores the input token

sequences to F , OnF that stores the labels output by F , and Cn

F that scores all CRF contexts.

Formally, we can write each CRF predicate F : (did, s, e, x, S) → (y,G), where

• did is the ID of a document d,

• s and e are the start and end positions of a text span t in d,

• x is the token sequence contained in t,

• S is the initialization score vector,

77

• y is the resulting label sequence, and

• G is the resulting path matrix..

Then for each input tuple (did, s, e, x, S), we append a tuple (tid, did, s, e, p) to InF , where

• tid is the tuple ID unique in InF , and

• p is a sequence of tuples (si, ei), where si and ei are the start and end positions of xi in d

respectively.

For each output tuple (y,G), we append a set of tuples {(otid, itid, i, y)} to OnF and a set of

tuples {(ctid, itid, i, µ, ν)} to CnF , where

• otid is the tuple ID unique in OnF ,

• ctid is the tuple ID unique in CnF ,

• itid is the ID of the input tuple that results in the output tuple (y,G),

• i is the position of token xi in x,

• y is the label of xi, and

• µ and ν are the lengths of the left and right context window of xi respectively.

The overall process is the same as in Delex: we process pages in Pn, append the results gener-

ated from each page to the three files, and store these files I/O efficiently on disk while executing

F . Pleaser refer to Chapter 3 for a detailed discussion.

4.4 Reusing Captured Results

We now describe how to use the captured results to speed up executing P over snapshot Pn+1.

The overall processing algorithm is the same as the one used in Delex, which we summarize

as follows. Please refer to Section 3.3.2 for a detailed discussion. We assume that we match each

page p ∈ Pn+1 with pages in Pn, to find overlapping regions, from which we can reuse previous

IE results. To reuse, we need Pn+1, Pn and all intermediate IE results we captured over Pn. These

intermediate IE results are stored in various reuse files (Section 4.3 and Section 3.2). To ensure

sequential access to these results during reuse, the IE results in each reuse file are stored in the same

78

order. Particularly, let q1, q2, . . . , qk be the order in which we processed pages in Pn. Then in each

reuse file, we stored all tuples on page q1 first, then all tuples on page q2, and so on. Consequently,

we will process pages in Pn+1 following the same order. That is, let pi be the page with the same

URL as qi, i = 1, . . . , k. Then we process p1, then p2, and so on. When we execute P over a page

p1, we execute the predicates of P in a bottom-up fashion of the execution plan tree (Section 3.1).

Please refer to Section 3.3.3 for a detailed discussion on how to execute a non-CRF-based IE

predicate. In what follows, we discuss how to execute the CRF predicates.

Suppose we are going to execute a CRF predicate F on a particular page p (in snapshot Pn+1),

whose previous version is q (in snapshot Pn). We first read in InF (q), On

F (q) and CnF (q) from the

corresponding reuse files InF , On

F and CnF . Then we execute F in three steps as follows:

1. Match Input Sequences: We start with In+1F (p), the set of input tuples to F . Each input tuple

(tid, did, s, e, p) ∈ In+1F (p) represents a text region [s, e] of page p that contains token sequence x

(whose positions are encoded by p). Then, we consult InF (q), which contains the input tuples to F

when it executed on q. This set is read in from the reuse file InF as discussed above. Each tuple in

InF (q) has the form (tid′, did′, s′, e′, p′), where did′ is the ID of q, and p′ records the positions of

tokens in x′ (contained in region [s′, e′] of q), to which we applied F .

Our goal is to find matching token sequences between x and x′. We call such matching token

sequences matching regions, in a similar way as we defined in Delex and Cyclex.

There are two ways to find matching regions between x and x′. One way is to match region

p[s, e] of p with region q[s′, e′] of q. The matching is done using one of the matchers employed

by Delex (Section 3.3.4). Then we join the resulting matching regions with p and p′ to identify

matching token sequences. Another way is to match the sequence x directly with sequence x′ of q.

We call this matching algorithm a token matcher. As we have shown in Cyclex and Delex, none

of these matchers is always optimal. So CRFlex considers all matchers employed by Delex and

the token matcher. Then it uses a cost model to select matchers, as Delex does. Please refer to

Section 3.4 for how to select a matcher.

We repeat the matching step for each input tuple in In+1F (p) to find its matching regions. For

each matching region between x on p and x′ on q, we store in memory a tuple (tid, tid′, s, s′, l),

79

where tid and tid′ are the tuple IDs of the tuples that encodes x and x′ respectively, s and s′ are the

start positions of the matching regions in x and x′, respectively, and l is the length of the matching

region. We store these tuples in buffer Rn+1F (p).

2. Apply F & Identify Copy Regions: Given the set of matching regions, we then identify copy

regions and apply F to find the labels of extraction regions, which are regions that are not copy

regions.

To identify copy regions, we must check the CRF context of each token in the matching region.

Given that token x in x on page p matches token x′ in x′ on page q, we must check if the CRF

context of x is the same as the CRF context of x′. Recall that the CRF context of x′ also includes

the label of the first token in the left context window of x′ (see Section 4.2). This implies that

we must check if that token’s match in x also has the same label. This suggests that we must first

re-apply F to x to output the labels of some tokens. Then we can check the CRF contexts of tokens

in the matching regions and determine the copy regions. We proceed in the following steps:

• a. Determine the First Extraction Region: Let r be the first matching region in x, and r′ be r’s

match in x′. We consult CnF (q) to locate the first token x′i in x′ such that its left context window

x′i−µ...x′i−1 is totally contained in r′. Let xj be the match of x′i. Then the first extraction region is

x1...xj .

• b. Apply F to the Extraction Region: We apply F to token sequence x1...xj with score vector

S, where all elements of S are set to 0.

• c. Output the Labels and Identify the CRF Contexts: From the path matrix G output in step b.,

we apply the same approach described in Section 4.3 to identify the left and right contexts of tokens

in the extraction region x1...xj . Let xk be the last token with its right context window xk+1...xk+ν

totally contained in the extraction region. Then we can output the labels of x1...xk. Furthermore,

we check if the left context of xj is the same as the left context of x′i. If so, we go to step d. to

determine the copy region. Otherwise, we go to step e. to continue applying F .

• d. Determine the First Copy Region: We first locate the last token x′g in x′ such that its right

context window x′g+1...x′g+ν′ is totally contained in region r′. Let xh be the match of x′g in x. Then

xj...xh is the first copy region. We output a tuple (tid, tid′, s, s′, l) that encodes this copy region,

80

where tid and tid′ are the tuple IDs of the input tuples that encode x and x′ respectively, s and s′

are the start positions of the copy region in x and x′, respectively, and l is the length of the copy

region. We then go to step g.

• f. Continue Applying F After an Extraction Region: Let xk be the last token whose right

context is contained in the last extraction region. Let y be its label. We then use the same approach

described in step a. to determine the end of the next extraction region. Let this extraction region

be xk+1...xl. We apply F to xk+1...xl with score vector S, where S is set such that except for the

score of label y, the initial scores of all other labels are 0. In this way, we enforce F to start with

label y for the next extraction region. Then we go to step d. and continue.

• g. Apply F After a Copy Region: Suppose we have found a copy region xj...xh. We then follow

the similar approach in step f. to determine the next extraction region. The only difference here is

that the extraction region starts at xh+1, and initial score vector S is set such that, except for the

label of xh, all other labels’ scores are 0.

• e. Cover the Rest of x: We repeat step a. to g. for the rest of the token sequence.

3. Copy Labels and CRF Contexts: We now have obtained a set of copy regions and labels

of non-copy regions. In the last step, we copy the labels and CRF contexts of the tokens in the

copy regions. Specifically, for each tuple (tid, tid′, s, s′, l) that encodes a copy region, we consult

OnV (q) to find the joining output tuples (with the same tid′), and consult Cn

V (q) to find joining

CRF contexts (with the same tid′). This step is similar to the copy step of Delex. Please refer to

Section 3.3.3 for a detailed discussion.

We conclude this section by showing the correctness of CRFlex.

Theorem 4.1 (Correctness of CRFlex). Let Mn+1 be the set of mentions obtained by applying a

CRF-based IE program P from scratch to snapshot Pn+1. Then CRFlex is correct in that when

applied to Pn+1 it produces exactly Mn+1.

Proof. Let F be a CRF-based IE blackbox in P and OFn+1 be the output of F produced by re-

applying F from scratch to Pn+1. In the similar way we have shown in Cyclex, we can show that

CRFlex produces exactly OFn+1 for F when it is applied to Pn+1. Therefore, CRFlex produces the

81



1515# Snapshots

21 days2 daysTime Interval



Figure 4.6 Data sets for our experiments.

correct output for each CRF-based IE blackbox. Since CRFlex and Delex behave in the same way

for all non-CRF-based blackboxes, CRFlex produces the correct output for each non-CRF-based

blackbox as Delex does. Hence, CRFlex produces exactly Mn+1.

4.5 Putting It All Together

We now describe the end-to-end CRFlex solution. Given a CRF-based IE programP written in

xlog, we first employ the techniques described in [67] to translate and optimize P into an execution

tree T , and then pass T to CRFlex.

Given a corpus snapshot Pn+1, CRFlex first employs the optimization technique described in

Section 3.4 to assign matchers to the IE predicates, including the CRF predicates. Next, CRFlex

executes the so-augmented tree T on Pn+1, employing the reuse algorithm described in Section 4.4

and the reuse files it produced for snapshot Pn. During execution, it captures and stores intermedi-

ate IE results (for reuse in the subsequent snapshot Pn+2), as described in Section 4.3.


We now empirically evaluate the utility of CRFlex. Figure 4.6 describes two real-world data

sets used in our experiments. DBLife consists of 15 snapshots from the DBLife system [31], and

Wikipedia consists of 15 snapshots from Wikipedia.com.

We experimented with an open source real-world CRF-based IE program, the Stanford CRF-

based Name Entity Recognizer (NER) [38]. Given a document, NER labels sequences of tokens

in a document as the names of PERSON, ORGANIZATION, or LOCATION entities. The CRF

82

stanfordNER(d, token, entityType) :- docs(d), stanfordNERTokenize(d,x),

applyCRFs(d,x,y), outputLabels(x,y,token,entityType)

(a)

docs(d)

applyCRFs(d,x,y)

stanfordNERTokenize(d,x)

outputLabels(x,y,token,entityType)

(b)

Figure 4.7 Stanford NER in xlog.

model employed by NER uses a variety of feature functions, such as the prefixes and suffixes of

a token, as well as conjunctions of these feature functions. To apply Delex and CRFlex to NER,

we first converted NER into an xlog program. The resulting program and its execution plan are

illustrated in Figure 4.7. Given a document d, stanfordNERTokenize converts d into a sequence

x of tokens. Then applyCRFs takes d and x as input, employs a CRF model to label x, and outputs

the label sequence y. We analyzed the program and set α and β of the entire IE program and all IE

blackboxes. In particular, we set αstanfordNERTokenize to 25 characters, and βstanfordNERTokenize

to 2 characters. αapplyCRFs, βapplyCRFs, and the α, β of the entire IE program are all set to the

maximal length of the entire document.

Runtime Comparison: Figure 4.8 shows the runtime of CRFlex vs. that of other possible

baseline solutions over all consecutive snapshots. We consider three baselines: No-reuse, Cyclex,

and Delex. No-reuse re-executes NER over all pages in a snapshot; Cyclex treats the whole NER

program as a single IE blackbox for reuse; and Delex is aware of the blackboxes in the IE program,

but not aware that the blackbox applyCRFs is based on CRFs.

Figure 4.8 shows that, in all cases, No-reuse (i.e., rerunning IE from scratch) incurs large

runtimes, while Cyclex and Delex shows mixed performance. On DBLife, where 96-98% of

pages remain identical on consecutive snapshots, they perform far better than No-reuse. But on

Wikipedia, where many pages tend to change (only 8-20% pages remain identical on consecutive

snapshots), they perform only slightly better than No-reuse.

83

DBLife

0

275

550

825

1100

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15snapshot

runtime (s)

No-reuse Cyclex Delex CRFlex

Wikipedia

0

175

350

525

700

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15snapshot

runtime(s)

Figure 4.8 Runtime of No-reuse, Cyclex, Delex, and CRFlex.

CRFlex, however, performs comparably or significantly better than all of the above solutions.

On DBLife, where most of the pages remain identical, CRFlex performs as well as Cyclex and

Delex. On Wikipedia, where many pages tend to change, CRFlex significantly outperforms the

other three solutions, cutting runtime by nearly 52%. These results suggest that CRFlex is able

to exploit the properties of CRFs to enable more reuse, thereby significantly speeding up program

execution.

Contributions of Components: Figure 4.9 shows the runtime decomposition of the above solu-

tions (numbers in the figure are averaged over five random snapshots on each data set). “Match”

is the total time of applying all matchers in the execution tree. “CRFs” is the total time to apply

CRF-based IE blackboxes. “Non-CRFs” is the total time to apply all non-CRF-based IE black-

boxes. “Copy” is the total time to copy results. “Opt” is the optimization time of Cyclex, Delex,

and CRFlex. Finally, “Others” is the remaining time (to apply relational operators, read file in-

dices, etc.).

The results show that matching and extracting dominate runtimes. In particular, 67-90% of

overall runtime is for CRF-based IE in all 4 solutions. Hence, we should focus on optimizing

these components, as we do in CRFlex. Furthermore, CRFlex spends more time on matching

and copying than Cyclex and Delex on Wikipedia, where pages change frequently. However, this

84

DBLife

0

200

400

600

800

1000


runtime (s) Wikipedia

0

100

200

300

400

500


runtime (s)

Match CRFs Non-CRFs Copy Opt Others

Figure 4.9 Runtime decomposition of No-reuse, Cyclex, Delex, and CRFlex.

effort clearly pays off (e.g., reducing the extraction time of No-reuse, Cyclex, and Delex by 47-

52%). Finally, the results show CRFlex incurs insignificant overhead (optimization, copying, etc.)

compared to its overall runtime.

4.7 Summary

A growing number of real-world applications involve IE over evolving text corpora. Recent

work on Cyclex and Delex has shown that executing such IE in a straightforward manner is very

expensive, and that recycling past IE results can lead to significant performance improvements.

Cyclex and Delex, however, are limited in that they are not aware that some IE blackboxes are

based on statistical learning models. However, learning-based IE programs have been successfully

applied to many real-world applications.

To address the above problem, we have developed CRFlex, a solution for efficiently executing

IE programs based on CRFs, a state-of-the-art learning model. As far as we know, CRFlex is the

first in-depth solution for this important problem. Our experiments over real-world datasets and

a CRF-based IE program show that CRFlex cuts the runtime of Delex by as much as 52%. This

suggests that exploiting the properties of CRFs can be highly beneficial.

85

Chapter 5

Related Work

Information Extraction: The problem of information extraction has received much attention

(see [22, 3, 33, 18] for recent tutorials). Numerous rule-based extractors (e.g., those relying on

regular expressions or dictionaries [30, 62, 46, 52, 69]) and learning-based extractors (e.g., those

employing CRFs, SVMs, and Mokov Networks [72, 26, 14, 12, 2, 76, 65, 10]) have been devel-

oped. Our work can handle both types of extractors.

Much work has tried to improve the accuracy and runtime of these extractors [35, 63]. But

recent work has also considered how to combine and manage such extractors in large-scale IE

applications [3, 33, 1]. Our work fits into this emerging direction.

Once we have extracted entity mentions, we can perform additional analysis, such as mention

disambiguation (a.k.a. record linkage, e.g., [5, 8, 23, 25, 55, 64, 74, 43]). Thus, such analysis is

higher level and orthogonal to our current work.

While we have focused on IE over unstructured text, our work is related to wrapper construc-

tion, the problem of inferring a set of rules (encoded as a wrapper) to extract information from

template-based Web pages [24]. Since wrappers can be viewed as extractors (as defined in Chapter

2), our techniques can potentially also apply to wrapper contexts. In this context, the knowledge

of page templates may help us develop even more efficient IE algorithms.

Finally, optimizing IE programs and developing IE-centric cost models have also been consid-

ered in several recent papers [67, 50, 42, 4]. These efforts however have considered only static

corpus contexts, not dynamic ones as we do in this dissertation.

86

Evolving Text: Several recent works have also considered evolving text data, but in different

problem contexts. The work [47, 57, 56] considers how to repair a wrapper (so that it continues

to extract semantically correct data) as the underlying page templates change, the work [29] con-

siders how to build robust wrappers over evolving text, the work [77] considers how to efficiently

recrawl evolving text to improve the freshness of extracted data, and the work [48] considers how

to incrementally update an inverted index, as the indexed Web pages change.

Recent work [78, 41] has also exploited overlapping text data, but again in different problem

contexts. These works observe that document collections often contain overlapping text. They then

consider how to exploit such overlap to “compress” the inverted indexes over these documents,

and how to answer queries efficiently over such compressed indexes. In contrast, we exploit the IE

results over the overlapping text regions to reduce the overall extraction time.

Detecting Matching Regions: The problem of finding matching text regions is related to detecting

duplicated Web pages. Many algorithms have been developed in this area (e.g., [36, 68, 11]).

But when applied to our context they do not guarantee to find all largest possible overlapping

regions, in contrast to the suffix-tree based algorithm developed in this work. Several suffix tree

algorithms have been widely used to find matching substrings in a given input string [40]. Here we

have significantly extended these algorithms, to develop one that can efficiently detect all maximal

matching regions (i.e., substrings) between two given strings, in time linear in the total length of

these two strings.

CRFs: CRF-based IE has received much attention recently. Most works [44, 21, 53, 66, 71, 54,

59, 27, 45, 49] have considered how to improve extraction accuracy of CRF-based IE programs.

Recent work [73] has considered how to implement CRF-based IE programs over RDBMS, and

then exploit RDBMS to improve extraction time. However, this work has only considered static

text corpora, not evolving text corpora as we do.

View Maintenance: Our work is also related to incremental view maintenance [39, 79, 9, 75]

– namely, if changes to the input of a dataflow program are small, then incrementally computing

changes to the result can be more efficient than recomputing the dataflow from scratch. But the

works differ in many important ways. First, our inputs are text documents instead of tables. Most

87

work on view maintenance assumes that changes to the inputs (base tables) are readily available

(e.g., from database logs), while we also face the challenge of how to characterize and efficiently

detect portions of the input texts that remain unchanged. Most importantly, view maintenance only

needs to consider a handful of standard operators with well-defined semantics. In contrast, we

must deal with arbitrary IE blackboxes.

88

Chapter 6

Conclusions

Evolving text is pervasive, and there are many applications that consider IE over evolving

text. The current solution is to re-apply IE programs to each corpus snapshot from scratch and in

isolation. This approach is inefficient and has limited applicability. To this end, this dissertation

has developed a set of solutions that execute IE programs over evolving text efficiently. In this

chapter, we summarize the key contributions of the dissertation and discuss directions for future

research.

6.1 Contributions

We have made the following contributions:

• The most important contribution of this dissertation is a framework that provides efficient

solutions for IE over evolving text. In particular, the framework advocates the idea of recy-

cling the IE results over previous corpus snapshots. As far as we know, this dissertation is

the first in-depth solution to the problem of IE over evolving text.

• We show how to model common properties of general IE blackboxes and CRF-based IE

blackboxes, and how to exploit these properties for safely reusing previous IE results.

• We show that a natural tradeoff exists in finding overlapping text regions from which we can

recycle past IE results. An approach to finding overlapping regions is called a matcher. We

show that an entire spectrum of matchers exists, with matchers trading off the completeness

of the results for runtime efficiency. Since no matcher is always optimal, our solutions

89

provide a set of alternative matchers (more can be added easily), and employ a cost model to

make an informed decision in selecting a good matcher.

• Our approaches can deal with large text corpora by exploiting many database techniques,

such as cost-based optimization and hash joins.

• Our approaches can deal with complex IE programs that consist of multiple IE blackboxes

by exploiting the compositional nature of these IE programs. We show how to model these

complex IE programs for recycling, how to implement the recycling process efficiently, and

how to find a good execution plan in a vast plan space with different recycling alternatives.

• We have developed a powerful suffix-tree-based matcher that detects all overlapping regions

between two documents. This matcher can be exploited by many other applications that need

to compare two documents.

6.2 Future Directions

Handling More General Matching Schemes: To recycle IE results, we must match each page

in the current snapshot with pages in the past snapshots to find overlapping regions. Many such

matching schemes exist. Currently, we match each page p in snapshot Pn+1 only with the page

q in snapshot Pn at the same URL as p. However, in some cases, it is desirable to match p with

other pages as well. For example, bloggers and online news editors often quote other articles on a

particular subject, and then make their own comments about the subject. Therefore, news and blog

articles, of different URLs or even from different Web sites, often contain overlapping regions. In

this case, if we allow matching pages across URLs (e.g., matching within the same Web sites or

matching over all pages of all previous snapshots), we can find more overlapping regions, and thus

save more IE efforts. The key challenge is how to match pages across URLs efficiently and how to

access IE results of all previous snapshots efficiently for reuse.

Maintaining the Quality of IE Programs over Evolving Text: In this dissertation, we have

considered the problem of how to execute the same IE programs repeatedly over evolving text.

90

However, due to the heterogeneous nature of unstructured text, IE programs themselves also need

to evolve continuously over time to adapt to the changes in the incoming text. For instance, when

documents in newer formats come, IE programs need to incorporate new parsers accordingly.

Hence, IE systems must constantly monitor the source text, and detect and deal with any possible

changes. Manually monitoring, detecting, and adapting is very expensive and not scalable. The

key challenge here is to develop techniques to automatically monitor and adapt IE programs.

Optimizing Information Integration over Evolving Text: Another direction is to optimize the

runtime of programs that consist of not only IE but also Information Integration (II) blackboxes

over evolving text. Many applications require II together with IE. For example, II can be used to

decide if two extracted text fragments “UW-Madison” and “University of Wisconsin, Madison”

refer to the same entity. To optimize the total runtime of those programs, ideally we should op-

timize the runtime of II as well as that of IE. The key challenge is to identify the properties of II

blackboxes that we can exploit for efficient and correct reuse.

91

LIST OF REFERENCES

[1] http://langrid.nict.go.jp.

[2] Eugene Agichtein and Venkatesh Ganti. Mining reference tables for automatic text seg-mentation. In KDD ’04: Proceedings of the 10th International Conference on KnowledgeDiscovery and Data Mining, pages 20–29, 2004.

[3] Eugene Agichtein and Sunita Sarawagi. Scalable information extraction and integration (tu-torial). In KDD ’06: Proceedings of the 12th International Conference on Knowledge Dis-covery and Data Mining, 2006.

[4] Yevgeny (Eugene) Agichtein. Extracting relations from large text collections. PhD Thesis,2005. Adviser-Gravano, Luis.

[5] Rohit Ananthakrishna, Surajit Chaudhuri, and Venkatesh Ganti. Eliminating fuzzy duplicatesin data warehouses. In VLDB ’02: Proceedings of the 28th International Conference on VeryLarge Data Bases, pages 586–597, 2002.

[6] Nilesh Bansal, Fei Chiang, Nick Koudas, and Frank Wm. Tompa. BlogScope: a systemfor online analysis of high volume text streams. In VLDB ’07: Proceedings of the 33rdInternational Conference on Very Large Data Bases, 2007.

[7] B. Bhattacharjee, V. Ercegovac, J. Glider, R. Golding, G. Lohman, V. Markl, H. Pirahesh,J. Rao, R. Rees, F. Reiss, E. Shekita, and G. Swart. Impliance: a next generation informa-tion management appliance. In CIDR ’07: Proceedings of the 3rd Biennial Conference onInnovative Data Systems Research, pages 351–362, 2007.

[8] Mikhail Bilenko, Raymond Mooney, William Cohen, Pradeep Ravikumar, and StephenFienberg. Adaptive name matching in information integration. IEEE Intelligent Systems,18(5):16–23, 2003.

[9] Jose A. Blakeley, Per-Ake Larson, and Frank Wm Tompa. Efficiently updating materializedviews. SIGMOD Record, 15(2):61–71, 1986.

[10] Vinayak Borkar, Kaustubh Deshmukh, and Sunita Sarawagi. Automatic segmentation of textinto structured records. SIGMOD Record, 30(2):175–186, 2001.

92

[11] Andrei Z. Broder. Identifying and filtering near-duplicate documents. In COM ’00: Pro-ceedings of the 11th Annual Symposium on Combinatorial Pattern Matching, pages 1–10,2000.

[12] Razvan Bunescu and Raymond J. Mooney. Collective information extraction with relationalmarkov networks. In ACL ’04: Proceedings of the 42nd Annual Meeting on Association forComputational Linguistics, page 438, 2004.

[13] Yuhan Cai, Xin Luna Dong, Alon Halevy, Jing Michelle Liu, and Jayant Madhavan. Per-sonal information management with SEMEX. In SIGMOD ’05: Proceedings of the 31stInternational Conference on Management of Data, pages 921–923, 2005.

[14] Mary Elaine Califf and Raymond J. Mooney. Relational learning of pattern-match rules forinformation extraction. In AAAI ’99/IAAI ’99: Proceedings of the 16th National Confer-ence on Artificial Intelligence and the 11th Innovative Applications of Artificial IntelligenceConference, pages 328–334, 1999.

[15] Amit Chandel, P. C. Nagesh, and Sunita Sarawagi. Efficient batch top-k search for dictionary-based entity recognition. In ICDE ’06: Proceedings of the 22nd International Conference onData Engineering, pages 28–38, 2006.

[16] Fei Chen, AnHai Doan, Jun Yang, and Raghu Ramakrishnan. Efficient information extractionover evolving text data. In ICDE ’08: Proceedings of the 24th International Conference onData Engineering, pages 943–952, 2008.

[17] Fei Chen, Byron J. Gao, AnHai Doan, Jun Yang, and Raghu Ramakrishnan. Optimizingcomplex extraction programs over evolving text data. In SIGMOD ’09: Proceedings of the35th International Conference on Management of Data, pages 321–334, 2009.

[18] Laura Chiticariu, Yunyao Li, Sriram Raghavan, and Frederick R. Reiss. Enterprise informa-tion extraction: recent developments and open challenges (tutorial). In SIGMOD ’10: Pro-ceedings of the 36th International Conference on Management of Data, pages 1257–1258,2010.

[19] Junghoo Cho and Hector Garcia-Molina. Effective page refresh policies for web crawlers.ACM Transaction on Database Systems, 28(4), 2003.

[20] Junghoo Cho and Sridhar Rajagopalan. A fast regular expression indexing engine. In ICDE’02: Proceedings of the 18th International Conference on Data Engineering, page 419, 2002.

[21] Yejin Choi, Claire Cardie, Ellen Riloff, and Siddharth Patwardhan. Identifying sources ofopinions with conditional random fields and extraction patterns. In HLT ’05: Proceedingsof the Conference on Human Language Technology and Empirical Methods in Natural Lan-guage Processing, pages 355–362, 2005.

93

[22] W. Cohen and A. McCallum. Information extraction from the World Wide Web (tutorial).In KDD ’03: Proceedings of the 9th International Conference on Knowledge Discovery andData Mining, 2003.

[23] William Cohen and Jacob Richman. Learning to match and cluster entity names. In SIGIR’01 Workshop on Mathematical/Formal Methods in Information Retrieval, 2001.

[24] William W. Cohen, Matthew Hurst, and Lee S. Jensen. A flexible learning system for wrap-ping tables and lists in HTML documents. In WWW ’02: Proceedings of the 11th Interna-tional Conference on World Wide Web, pages 232–241, 2002.

[25] William W. Cohen, Pradeep Ravikumar, and Stephen E. Fienberg. A comparison of stringdistance metrics for name-matching tasks. In IIWeb ’03: Proceedings of IJCAI-03 Workshopon Information Integration on the Web, pages 73–78, 2003.

[26] William W. Cohen and Sunita Sarawagi. Exploiting dictionaries in named entity extraction:combining semi-markov extraction processes and data integration methods. In KDD ’04:Proceedings of the 10th International Conference on Knowledge Discovery and Data Mining,pages 89–98, 2004.

[27] Michael Collins. Discriminative training methods for hidden markov models: theory and ex-periments with perceptron algorithms. In EMNLP ’02: Proceedings of the 2002 Conferenceon Empirical Methods in Natural Language Processing, pages 1–8, 2002.

[28] Hamish Cunningham, Diana Maynard, Kalina Bontcheva, and Valentin Tablan. GATE: anarchitecture for development of robust hlt applications. In ACL ’02: Proceedings of the 40thAnnual Meeting on Association for Computational Linguistics, pages 168–175, 2002.

[29] Nilesh N. Dalvi, Philip Bohannon, and Fei Sha. Robust web extraction: an approach basedon a probabilistic tree-edit model. In SIGMOD ’09: Proceedings of the 35th InternationalConference on Management of Data, pages 335–348, 2009.

[30] Pedro DeRose, Warren Shen, Fei Chen, AnHai Doan, and Raghu Ramakrishnan. Buildingstructured web community portals: a top-down, compositional, and incremental approach.In VLDB ’07: Proceedings of the 33rd International Conference on Very Large Data Bases,pages 399–410, 2007.

[31] Pedro DeRose, Warren Shen, Fei Chen, Yoonkyong Lee, Douglas Burdick, AnHai Doan,and Raghu Ramakrishnan. DBLife: a community information management platform forthe database research community (demo). In CIDR ’07: Proceedings of the 3rd BiennialConference on Innovative Data Systems Research, pages 169–172, 2007.

[32] AnHai Doan, Raghu Ramakrishnan, Fei Chen, Pedro DeRose, Yoonkyong Lee, Robert Mc-Cann, Mayssam Sayyadian, and Warren Shen. Community information management. IEEEData Engineering Bulletin, 29(1):64–72, 2006.

94

[33] AnHai Doan, Raghu Ramakrishnan, and Shivakumar Vaithyanathan. Managing informationextraction: state of the art and research directions (tutorial). In SIGMOD ’06: Proceedingsof the 32nd International Conference on Management of Data, pages 799–800, 2006.

[34] Jenny Edwards, Kevin McCurley, and John Tomlin. An adaptive model for optimizing per-formance of an incremental web crawler. In WWW ’01: Proceedings of the 10th InternationalConference on World Wide Web, pages 106–113, 2001.

[35] Oren Etzioni, Michael Cafarella, Doug Downey, Stanley Kok, Ana-Maria Popescu, TalShaked, Stephen Soderland, Daniel S. Weld, and Alexander Yates. Web-scale informationextraction in Knowitall (preliminary results). In WWW ’04: Proceedings of the 13th Interna-tional Conference on World Wide Web, pages 100–110, 2004.

[36] Min Fang, Narayanan Shivakumar, Hector Garcia-Molina, Rajeev Motwani, and Jeffrey D.Ullman. Computing iceberg queries efficiently. In VLDB ’98: Proceedings of the 24rdInternational Conference on Very Large Data Bases, 1998.

[37] D. Ferrucci and A. Lally. UIMA: an architectural approach to unstructured information pro-cessing in the corporate research envrionment. Natural Language Engineering, 10(3-4):327–348, 2004.

[38] Jenny Rose Finkel, Trond Grenager, and Christopher Manning. Incorporating non-local in-formation into information extraction systems by gibbs sampling. In ACL ’05: Proceedingsof 43rd Annual Meeting of the Association for Computational Linguistics, pages 290–294,1997.

[39] A. Gupta and I.S. Mumick. Materialized Views: Techniques, Implementations and Applica-tions. MIT Press, 1999.

[40] D. Gusfield. Algorithms on strings, trees, and sequences. Cambridge : Cambridge UniversityPress, 1997.

[41] Michael Herscovici, Ronny Lempel, and Sivan Yogev. Efficient indexing of versioned docu-ment sequences. In ECIR’07: Proceedings of the 29th European Conference on IR Research,pages 76–87, 2007.

[42] Alpa Jain, AnHai Doan, and Luis Gravano. SQL queries over unstructured text batabases.In ICDE ’07: Proceedings of the 23rd International Conference on Data Engineering, pages1255–1257, 2007.

[43] N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage: similarity measures and algo-rithms (tutorial). In SIGMOD ’06: Proceedings of the 32nd International Conference onManagement of Data, 2006.

95

[44] Trausti Kristjansson, Aron Culotta, Paul Viola, and Andrew McCallum. Interactive informa-tion extraction with constrained conditional random fields. In AAAI ’04: Proceedings of the19th National Conference on Artifical Intelligence, pages 412–418, 2004.

[45] Taku Kudo, Kaoru Yamamoto, and Yuji Matsumoto. Applying conditional random fields tojapanese morphological analysis. In EMNLP ’04: Proceedings of the 2004 Conference onEmpirical Methods in Natural Language Processing, pages 230–237, 2004.

[46] W. Lehnert, J. McCarthy, S. Soderland, E. Riloff, C. Cardie, J. Peterson, F. Feng, C. Dolan,and S. Goldman. UMass/Hughes: description of the circus system used for tipster text. InProceedings of a workshop on held at Fredericksburg, Virginia, pages 241–256, 1993.

[47] Kristina Lerman, Steven N. Minton, and Craig A. Knoblock. Wrapper maintenance: a ma-chine learning approach. Journal of Artificial Intelligence Research, 18:2003, 2003.

[48] Lipyeow Lim, Min Wang, Sriram Padmanabhan, Jeffrey Scott Vitter, and Ramesh Agarwal.Dynamic maintenance of web indexes using landmarks. In WWW ’03: Proceedings of the12th International Conference on World Wide Web, pages 102–111, 2003.

[49] Yan Liu, Jaime Carbonell, Peter Weigele, and Vanathi Gopalakrishnan. Segmentation condi-tional random fields (SCRFs): a new approach for protein fold recognition. In RECOMB ’05:Proceeding of the 9th International Conference on Computer Biology, pages 14–18, 2005.

[50] Panagiotis G. lpeirotis, Eugene Agichtein, Pranay Jain, and Luis Gravano. To search or tocrawl? Towards a query optimizer for text-centric tasks. In SIGMOD ’06: Proceedings ofthe 32nd International Conference on Management of Data, pages 265–276, 2006.

[51] Michael Mathioudakis and Nick Koudas. TwitterMonitor: trend detection over the twitterstream. In SIGMOD ’10: Proceedings of the 36th International Conference on Managementof Data, pages 1155–1158, 2010.

[52] Diana Maynard, Valentin Tablan, Cristian Ursu, Hamish Cunningham, and Yorick Wilks.Named entity recognition from diverse text types. In Recent Advances in Natural LanguageProcessing 2001 Conference, Tzigov Chark, 2001.

[53] Andrew Mccallum and David Jensen. A note on the unification of information extraction anddata mining using conditional-probability, relational models. In IJCAI 03: Proceedings ofthe 18th International Joint Conference on Artificial Intelligence, 2003.

[54] Andrew McCallum and Wei Li. Early results for named entity recognition with conditionalrandom fields, feature induction and web-enhanced lexicons. In CoNLL ’03: Proceedings ofthe 7th Conference on Natural Language Learning, 2003.

[55] Andrew Mccallum and Ben Wellner. Toward conditional models of identity uncertainty withapplication to proper noun coreference. In NIPS ’03: Proceedings of the sixteenth Interna-tional Conference on Advances in Neural Information Processing Systems, pages 905–912,2003.

96

[56] Robert McCann, Bedoor K. AlShebli, Quoc Le, Hoa Nguyen, Long Vu, and AnHai Doan.Maveric: mapping maintenance for data integration systems. In VLDB ’05: Proceedings ofthe 31st International Conference on Very Large Data bases, pages 1018–1029, 2005.

[57] Xiaofeng Meng, Dongdong Hu, and Chen Li. Schema-guided wrapper maintenance for web-data extraction. In WIDM ’03: Proceedings of the 5th ACM International Workshop on WebInformation and Data Management, pages 1–8, 2003.

[58] Eugene W. Myers. An O(ND) difference algorithm and its variations. Algorithmica,1(1):251–256, 1986.

[59] Fuchun Peng, Fangfang Feng, and Andrew McCallum. Chinese segmentation and new worddetection using conditional random fields. In COLING ’04: Proceedings of the 20th Interna-tional Conference on Computational Linguistics, page 562, 2004.

[60] Fuchun Peng and Andrew McCallum. Accurate information extraction from research papersusing conditional random fields. In HLT-NAACL ’04: Proceedings of the Human LanguageTechnology Conference and North American Chapter of the Association for ComputationalLinguistics, 2004.

[61] David Pinto, Andrew McCallum, Xing Wei, and W. Bruce Croft. Table extraction usingconditional random fields. In SIGIR ’03: Proceedings of the 26th Annual International Con-ference on Research and Development in Informaion Retrieval, pages 235–242, 2003.

[62] Frederick Reiss, Sriram Raghavan, Rajasekar Krishnamurthy, Huaiyu Zhu, and Shivaku-mar Vaithyanathan. An algebraic approach to rule-based information extraction. In ICDE’08: Proceedings of the 24th International Conference on Data Engineering, pages 933–942,2008.

[63] Sunita Sarawagi. Information extraction. Foundations and Trends in Databases, 1(3):261–377, 2008.

[64] Sunita Sarawagi and Anuradha Bhamidipaty. Interactive deduplication using active learning.In KDD ’02: Proceedings of the 8th International Conference on Knowledge Discovery andData Mining, pages 269–278, 2002.

[65] Sandeepkumar Satpal and Sunita Sarawagi. Domain adaptation of conditional probabilitymodels via feature subsetting. In PKDD 2007: Proceedings of the 11th European conferenceon Principles and Practice of Knowledge Discovery in Databases, pages 224–235, 2007.

[66] Fei Sha and Fernando Pereira. Shallow parsing with conditional random fields. In NAACL’03: Proceedings of the 2003 Conference of the North American Chapter of the Associationfor Computational Linguistics, pages 134–141, 2003.

97

[67] Warren Shen, AnHai Doan, Jeffrey F. Naughton, and Raghu Ramakrishnan. Declarativeinformation extraction using datalog with embedded extraction predicates. In VLDB ’07:Proceedings of the 33rd International Conference on Very Large Data Bases, pages 1033–1044, 2007.

[68] N. Shivakumar and H. Garcia-Molina. SCAM: a copy detection mechanism for digital docu-ments. In DL ’95: Proceedings of the Second Annual Conference on the Theory and Practiceof Digital Libraries, 1995.

[69] Stephen Soderland. Learning information extraction rules for semi-structured and free text.Maching Learning, 34(1-3):233–272, 1999.

[70] Fabian M. Suchanek, Gjergji Kasneci, and Gerhard Weikum. YAGO: a core of semanticknowledge. In WWW ’07: Proceedings of the 16th international conference on World WideWeb, pages 697–706, 2007.

[71] Charles Sutton, Andrew McCallum, and Khashayar Rohanimanesh. Dynamic conditionalrandom fields: factorized probabilistic models for labeling and segmenting sequence data.Journal Machine Learning Research, 8:693–723, 2007.

[72] Koichi Takeuchi and Nigel Collier. Use of support vector machines in extended named en-tity recognition. In COLING ’02: Proceedings of the 6th Conference on Natural LanguageLearning, pages 1–7, 2002.

[73] Daisy Zhe Wang, Eirinaios Michelakis, Michael J. Franklin, Minos N. Garofalakis, andJoseph M. Hellerstein. Probabilistic declarative information extraction. In ICDE ’10: Pro-ceedings of the 26th International Conference on Data Engineering, pages 173–176, 2010.

[74] Ben Wellner, Andrew McCallum, Fuchun Peng, and Michael Hay. An integrated, conditionalmodel of information extraction and coreference with application to citation matching. InUAI ’04: Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, pages593–601, 2004.

[75] Jennifer Widom. Research problems in data warehousing. In CIKM ’95: Proceedings ofthe 4th International Conference on Information and Knowledge Management, pages 25–30,1995.

[76] Fei Wu and Daniel S. Weld. Autonomously semantifying wikipedia. In CIKM ’07: Pro-ceedings of the 16th International Conference on Information and Knowledge Management,pages 41–50, 2007.

[77] Mohan Yang, Haixun Wang, Lipyeow Lim, and Min Wang. Optimizing content freshness ofrelations extracted from the web using keyword search. In SIGMOD ’10: Proceedings of the36th International Conference on Management of Data, pages 819–830, 2010.

98

[78] Jiangong Zhang and Torsten Suel. Efficient search in large textual collections with redun-dancy. In WWW ’07: Proceedings of the 16th International Conference on World Wide Web,pages 411–420, 2007.

[79] Yue Zhuge, Hector Garcıa-Molina, Joachim Hammer, and Jennifer Widom. View mainte-nance in a warehousing environment. SIGMOD Record, 24(2):316–327, 1995.

99

Appendix A: xlog Programs for Delex Experiments

In this section, we show how to derive α and β of individual IE blackboxes and entire IE pro-

grams used in our experiments. This includes the three DBLife IE programs and three Wikipedia IE

programs listed in Figure 3.7.(b), and one learning-based IE program “actor”. Their xlog programs

are showed in Figure A.1 and Figure A.2.

talk: The IE program “talk” consists of 1 IE blackbox exTalk that takes input as a data page d,

a name pattern n and a topic pattern t. It then extracts mentions of talk relationship as follows.

First it detects speaker mentions by finding occurrences of n in d. Then it detects topic mentions

by finding occurrences of t in d. Next, it detects keywords such as “seminar”,“lecture”, and “talk”.

Finally, it pairs up a speaker mention and a topic mention if they span at most 155 characters, and

a detected keyword either immediately precedes or is contained in the text spanned by the mention

pair. Therefore, we set α to 155 and β to 9, the maximal length of a keyword.

chair: The IE program “chair” contains 3 IE blackboxes exPerson, exConference and exChair-

Type. We derive their α and β as follows.

exPerson takes input as a data page d and a name pattern n. It then extracts person mentions

by detecting occurrences of n in d. Therefore, we set αexPerson to the maximal length of a person

mention and βexPerson to 0.

exConference operates similarly as exPerson. Accordingly, we set αexConference to the maximal

length of a conference mention and βexConference to 0.

exChairType takes input as a data page d and a chair-type pattern c. It then extracts a chair-

type mention by (a) detecting all occurrences of c , and (b) outputting an occurrence of c if it is

immediately followed by a keyword “chair”. Therefore, we set αexChairType to the maximal length

of a chair-type mention, and βexChairType to the length of the keyword “chair”.

Finally, the IE program “chair” outputs a chair mention by “stitching” a person mention, a

conference mention and a chair-type mention together, if (a) the conference mention precedes the

chair-type mention, (b) the chair-type mention precedes the person mention, and (c) the chair-type

mention and person mention span at most 20 characters. Therefore, we can set α of the entire IE

100

program to the maximal length of the text spanned by a chair mention. Since any text spanned by

a chair mention begins with the conference mention, and ends with the person mention, we set β

= max(βexConference, βexPerson).

advise: The IE program “advise” contains 5 IE blackboxes: exAdvisor, exAdvisee, exNameList,

exCoauthors, and exWorkOn. We derive their α and β as follows.

exAdvisor takes as input a data page d and a name pattern n. It then extracts an advisor mention

by (a) detecting all occurrences of n , and (b) outputting an occurrence of n if it is preceded by a

keyword such as “professor” or “prof.”. Therefore, we set αexAdvisor to the maximal length of an

advisor mention, and βexAdvisor to the maximal distance between the beginning of the keyword and

the beginning of the advisor mention.

exAdvisee operates similarly as exAdvisor, except that the keywords used are “student”, “PhD”

etc. Therefore, αexAdvisee and βexAdvisee are set in a similar manner.

exNameList takes as input a data page d and a name list pattern l. It then extracts a list of names

by detecting occurrences l in d. Therefore, we set αexNameList to the maximal length of a name

list, and βexNameList to 0.

exCoauthors takes as input a name list, and a pair of name patterns n1 and n2. It then extracts

coauthor mentions by (a) detecting occurrences of n1 and n2 and (b) “stitching” occurrences of

n1 and n2 together. Therefore, we set αexCoauthor to the maximal length of a text spanned by the

coauthor mention, and βexCoauthor to 0.

exWorkOn takes as input a data page d, a name pattern n and a topic pattern t. It then extracts

mentions of work-on relationship in three steps. First, it detects person mentions by finding occur-

rences of n in d. Then it detects topic mentions by finding occurrences of t in d. Finally, it pairs

up person and topic mentions if they span at most 60 characters. Therefore, we set αexWorkOn to

60, and βexWorkOn to 0.

Finally, the IE program “advise” outputs a mention of advise relationship by stitching the advi-

sor mention, advisee mention and work-on mention together, if (a) the advisor mention and advisee

mention approximately match the two names in a coauthor mention, and (b) the text spanned by

one of the names in the matched coauthor mention is the same text spanned by the name in the

101

work-on mention. Therefore, we set α of the entire IE program to the maximal length of the text

spanned by an advise mention. Furthermore, we set β of the entire IE program to be maxi(βi),

where i ∈ {exAdvisor, exAdvisee, exNameList, exCoauthors,

exWorkOn}.

blockbuster: The IE program “blockbuster” extracts famous movies from a data page. It contains

2 IE blackboxes exCareerSection and exMovie. We derive their α and β as follows.

exCareerSection extract career sections from a Wikipedia page by (a) detecting all sections

delimited by the section heading markups, then (b) outputting a section if keyword “career” is

present in the section title preceding the section. Therefore we set the context βexCareerSection to

the maximal distance between the the beginning of the section title and the beginning of the section.

Then we set αexCareerSection to the maximal number of characters in a career section.

exMovie takes input as a career section and a movie name pattern. It then extracts movie

mentions by detecting occurrences of the name patterns in the career section. Therefore we set the

context βexMovie to 0, and scope αexMovie to the maximal length of a movie name.

Finally, we set α of the entire IE program to the maximal length of a movie mention, and β to

the max of βexCareerSection and βexMovie.

play: The IE program “play” extracts who plays in which movies relationships from data pages.

It contains 4 IE blackboxes: exIntro, exActor, exCareerSection and exMovie. We derive their α and

β as follows.

exIntro extracts the introduction paragraphs from a Wikipedia page by (a) detecting all para-

graphs, and (b) outputting paragraphs that precede the first section heading markup. Therefore,

we set the context βexIntro to the maximal length of a section markup, and scope αexIntro to the

maximal length of an introduction paragraph.

exActor takes as input introduction paragraphs p and an actor name pattern n. It then outputs

actor mentions by finding occurrences of n in p. Therefore, we set αexActor to the maximal length

of spanned by an actor mention, and βexActor to 0.

exCareerSection and exMovie operate exactly the same as those in “blockbuster”. Therefore

their α and β are same as before.

102

Finally, the IE program stitches the actor mentions and movie mentions together. Therefore,

we set α of the entire IE program to the longest text spanned a paired actor and movie mention,

and β to be the longest βi.

award: The IE program “award” contains 6 IE blackboxes: exAwardSection, exBioSection, ex-

AwardItem exAward, exActor and exRole. We derive their α and β as follows.

exAwardSection extracts the award section from a Wikipedia page by (a) detecting all sections

delimited by the section heading markups, then (b) outputting a section if the keyword “award” is

present in the section heading preceding the section. Therefore we set the context βexAwardSection

to the maximal distance between the beginning of the section heading and the beginning of the

section. Furthermore,we set αexAwardSection to the maximal number of characters in any award

section.

exBioSection operates similarly as exAwardSection, thus αexBioSection and βexBioSection are es-

timated in a similar manner.

exAwardItem takes input as an award section, detects list item markups and outputs all list

items in the award section. Therefore we set αexAwardItem to the maximal length of a list item and

βexAwardItem to the maximal length of list item markups .

exAward takes input as an award list item i, a movie pattern m, and an award pattern a. It

then extracts movie and award mention pairs from i by (a) detecting all movie mentions by finding

occurrences of m, (b) detecting all award mentions by finding occurrences of a , and (c) pairing up

all movie mentions and award mentions. Therefore, we set αexAward to the maximal length of the

text spanned by a movie and award mention pair, and βexAward to 0.

exActor and exRole operates similarly as exAward. Thus, their scope and scope are estimated

in a similar manner.

Finally, we derive the α and β of the entire IE program using above α and β of the individual

IE blackboxes. Specifically, we set α of “award” to the maximum length of text spanned by an

award mention, and β of “award” to be the maximum of all βi.

actor: The IE program “actor”(shown in Figure A.2) is a learning-based IE program that extracts

mentions of actor entities from Wikipedia pages. It captures exactly the extraction workflow of

103

KYLIN, a machine learning system recently proposed by [76] to automatically construct infoboxes

for Wikipedia pages.

Following the workflow of KYLIN, “actor” operates in three steps. First, given a Wikipedia

page d, rule R1 extracts sentences from d using the IE blackbox exSentence. As in KYLIN, we im-

plemented exSentence using the sentence detector from openNLP library (http://opennlp.sourcefor

ge.net). This sentence detector employs a maximal entropy (ME) classifier to detect delimits of

sentences. Next, each rule from R2 − R5 employs an IE blackbox (in bold) to extract attribute

values of a distinct attribute from a sentence, if the sentence is predicted to contain some values of

that attribute at all. As in KYLIN, we implemented each IE blackbox in R2−R5 as a distinct con-

ditional random field (CRF) model, trained for each attribute, to extract the values of that attribute.

Specifically, we used the implementation from http://crf.sourceforge.net/ for the CRFs. Finally,

rule R6 “stitches” the attribute values extracted by R2 −R5 to produce actor mentions.

We now describe how to derive the α and β of each IE blackbox and the entire IE program. IE

blackbox exSentence takes input as a data page d, and extracts sentences from d by (a) identifying

candidate delimits such as “!”, “.” and “?”, (b) capturing features from the tokens surrounding those

delimits, and (c) employing an ME classifier to determine if candidate delimits are actual sentence

delimits based on the captured features. Obviously, as long as the tokens surrounding a candidate

delimit remain the same, the features captured from the tokens will also remain the same, and

thus the classification of the candidate delimit remains the same. Hence, we can set βME to the

maximal number of characters in the surrounding tokens. Furthermore, we can set scope αME to

the maximal number of characters in a sentence. In our experiment, we set βME to 16 and αME to

321.

Each of the four IE blackboxes exName, exBirthName, exBirthDate and exNotableRoles

employs a CRF model to extract attribute values from a sentence by (a) capturing features of each

token in the sentence, then (b) finding the most likely sequence of labels (indicating if a token is

part of an attribute value) of the sentence using the trained CRF model. The CRF models are very

complex and thus hard to derive tight values of αCRF and βCRF . However, it is always true that if

a given sentence remains the same, the sequence of labels of this sentence and thus the extracted

104

attribute values will remain the same. Therefore, we can set αCRF and βCRF to the length of the

CRF model’s longest input string, i.e., the longest sentence.

Finally, we estimate the α and β of the entire IE program using those of the IE blackboxes.

The scope α of the IE program is set to the length of the longest string spanned by an actor

mention. Additionally, for an actor mention m in page p, the string p[(m.start−βCRF )..(m.end+

βCRF )] must contain all sentences from which the attribute values of m are extracted. Therefore,

if p[(m.start − βCRF )..(m.end + βCRF )] remains the same, we can guarantee the same attribute

values will be extracted. Furthermore, if p[(m.start − βCRF − βME)..(m.end + βCRF + βME)]

remains the same, we can guarantee the same sentences spanned by m will also be extracted.

Therefore β is set to βME + βCRF . In our experiment, we set α to 17824 and β to 337.

105

R1: talk(d,speaker,topics) :- docs(d), namePatterns(n), topicPatterns(t), R1: careerSections(d,careerSection) :- docs(d), exCareerSection(d,careerSection).

exTalk(d,n,t,speaker,topics).

R2: blockbuster(d,movie) :- careerSections(careerSection), moviePatterns(m),

(a) talk exMovie(careerSection,m,movie) .

R1: people(d,person) :- docs(d), namePatterns(n), exPerson(d,n,person). (d) blockbuster

R2: conferences(d,conference):- docs(d), conferencePatterns(f), R1: introParagraphs(d,intro) :- docs(d), exIntro(d, intro).

exConference(d,f,conference).

R2: actors(d,actor) :- introParagraphs(d, intro), namePatterns(n), exActor(intro,n,actor).

R3: chairTypes(d,chairType) :- docs(d), chairTypePatterns(c),

exChairType(d,c,chairType). R3: careerSections(d, careerSection) :- docs(d), exCareerSection(d, careerSection).

R4: chair(d, person, conference, chairType) :- people(d,person), R4: movies(d,movie) :- careerSections(d,careerSection), moviePatterns(m),

conferences(d,conference), exMovie(carSection,m,movie).

chairTypes(d, chairType),

isBefore(conference,chairType), R5: play(d,actor,movie) :- actors(d,actor), movies(d,movie).

isBefore(chairType,person),

spanChar(chairType, person) < 20. (e) play

(b) chair R1: awardSections(d,awardSection) :- docs(d), exAwardSection(d,awardSection).

R1: advisors(d,advisor) :- docs(d), namePatterns(n), exAdvisor(d,n,advisor). R2: bioSections(d,bioSection) :- docs(d), exBioSection(d,bioSection).

R2: advisees(d,advisee) :- docs(d), namePatterns(n), exAdvisee(d,n,advisee) R3: awardItems(d,awardItem) :- awardSections(d,awardSection),

exAwardItem(awardSection,awardItem).

R3: nameLists(d,nameList) :- docs(d), nameListPatterns(l), exNameList(d,l,nameList).

R4: movieAwards(d,movie,award) :- awardItems(d,awardItem),

R4: coauthors(d,author1,author2) :- docs(d), nameLists(d,nameList), moviePatterns(m), awardPatterns(a),

namePatterns(n1), namePatterns(n2) exAward(awardItem,m,a,movie,award).

exCoauthor(nameList,n1,n2,author1,author2).

R5: actors(d,actor) :- bioSections(d, bioSection), namePatterns(n),

R5: workOn(d,person,topics) :- docs(d), namePatterns(n), topicPatterns(t), exActor(bioSection,n,actor).

exWorkOn(d,n,t,person,topics).

R6: roles(d,movie,role) :- docs(d), moviePatterns(m), exRole(d,m,movie,role).

R6: advise(d,advisor,advisee,topics) :- advisors(d,advisor), advisees(d,advisee)

coauthors(d,author1,author2), R7: award(d,actor,movie,role,award) :- roles(d,movie,role),

approMatch(advisor, author1), movieAwards(d,movie1,award),

approMatch(advisee, author2), match(movie,movie1),

distChar(author2, person) =0 actors(d,actor).

(c) advise (f) award

Figure A.1 xlog Programs for 6 IE tasks in Figure 3.7.(b). IE blackboxes are in bold.

R1: sentences(d,sentence) :- docs(d), exSentence(d,sentence).

R2: names(d,name) :- sentences(d,sentence), containingName(sentence),

exName(sentence,name).

R3: birthNames(d,birthName) :- sentences(d,sentence), containingBirthName(sentence),

exBirthName(sentence,birthName).

R4: birthDates(d,birthDate) :- sentences(d,sentence), containingBirthDate(sentence),

exBirthDate(sentence,birthDate).

R5: notableRoles(d,notableRoles) :- sentences(d,sentence), containingNotableRole(sentence),

exNotableRoles(sentence,notableRoles).

R6: actor(d,name,birthName,birthDate,notableRoles) :- names(d,name),

birthNames(d,birthName),

birthDates(d,birthDate),

notableRoles(d,notableRoles)

Figure A.2 The xlog program of “actor”. IE blackboxes are in bold.

Documents

OPTIMIZING INFORMATION EXTRACTION PROGRAMS OVER EVOLVING … · OPTIMIZING INFORMATION EXTRACTION PROGRAMS OVER EVOLVING TEXT Fei Chen Under the supervision of Associate Professor