1 Extracting Information from Text

1 Extracting Information from Text

AMIT BAGGAy, JOYCE CHAIz, and ALAN BIERMANN

Dept. of Computer ScienceDuke UniversityDurham, NC 27708-0129

1.1 INTRODUCTION

Information extraction technologies seek to aid a person in the efficient scanning oflarge volumes of text and the discovery of significant facts ( [12]). The traditionalapproach has been to specialize on a narrow semantic domain and to constructa processor aimed at retrieving information in that domain only. Researchers havegenerally believed that the information extraction problem is so difficult that each suchsystem must be hand constructed because of the many peculiarities of the individualdomains. Examples are those featured by the Message Understanding Conferences( [13], [14], [15]): terrorist attacks, management transitions in industries, and thelaunching of space vehicles.

In order to address the problem of creating such systems economically, someresearchers have looked at techniques for automatically or semi-automatically con-structing lexicons ( [23], [24], [26], [30]) or extraction rules for the domain( [7], [28], [20], [16], [17], [25]). Most of these techniques have applied ma-chine learning approaches to learn rules based on texts that have been semanticallyannotated. Either pre-annotation of text is done by a human expert, or the rules arepost-processed by a human expert.

The research reported here proposes that a domain independent information ex-traction system can be built if it can include a self modifying feature that enables itto automatically adapt to new domains. The theory is that the user should know thepeculiarities involved and be able to quickly train the system to gather the specifictarget facts. This is done by scanning a series of example articles using a special textprocessing interface and allowing the user to lead the system through the desired re-

yCurrently at General Electric Corporate Research and Development, Niskayuna, NYzCurrently at IBM T. J. Watson Research Center, Yorktown Heights, NY

i

ii EXTRACTING INFORMATION FROM TEXT

trieval steps. Then the system acquires the patterns that are to be used in the retrievaland applies them to any large database of text to extract the information of interest.

As an example, suppose a person wishes to collect information about programmingjobs offered by banks. They might use the following article to demonstrate to thesystem a desired retrieval.

Springfield’s largest bank, First Savings and Loan, is seeking high level program-mers with five or more years of C++ development experience. This respectedinstitution is offering three year contracts with competitive salaries and liberalbenefits.

Specifically, they can apply the system internal parser to break the text intosignificant phrases:

Springfield’s largest bankFirst Savings and Loanis seekinghigh level programmerswith five or more years of C++ development experienceThis respected institution. . .. . .etc.

Then they can use the system provided commands to indicate that the pattern"bank-is seeking-programmerswith experience" is going to be useful for the desiredretrievals. Finally, they can invoke an automatic generalization routine to loosen theconstraints on the pattern so that it can fire on a variety of specific wordings thatmay vary greatly from the original. For example, the user might want the system toextract some or all of these:

IBM Corporation has a need for computer scientists . . .The City of Boston is hiring computer analysts with Java experience . . .Several computer companies in the area need expertise in C++ and Java . . .Oracle is advertising for experts for the creation of new database products . . .First Federal Savings is looking for programmers to code security systems . . .

The creation of such a general purpose extraction system requires the solution ofmany problems. For example, the word "bank" in the example refers to a financialinstitution and not to the edge of a river. Any pattern designed to select financialinstitutions runs the risk of firing on river banks unless a mechanism is includedto distinguish them. This is called word sense disambiguation and it must be doneautomatically by the information extraction system. Another problem relates to factsthat are separated in the discourse across clauses or different sentences. Supposethe user is interested, in the above example, in the pattern “banks-offering-contracts”which requires that the system discover that the phrases with head nouns "bank" and"institution" refer to the same entity. This is called noun phrase coreference and itis another problem that must be solved automatically in the information extractionsystem. In the extreme case, the needed information may be spread across severaldifferent articles separated in time and space and the system may be expected togather it together. Finally, the problem of doing the generalization must be solved.

INFORMATION EXTRACTION SYSTEM iii

How does one discover that the First Federal Savings is a bank which is an institutionthat the user might be interested in. How does the system know to fire on words like“computer scientist”, “analyst”, “expert”, “worker”, or even “people” when it hasseen only the token “programmer”? Furthermore, when multiple documents all havefacts about “John Smith”, how does the system know it is actually talking about thesame person or different persons with the same name? These problems and relatedones constitute the core areas of study needed to create the domain independentsystem described above. Their solution is the subject of this chapter.

Specifically, we first describe in this chapter a new trainable information extrac-tion system ( [8]). With our approach, any user, based on his or her interests, cantrain the system on different domains. The system learns from the training and auto-matically outputs the structured target information. The system uses many powerfulmechanisms to create useful generalizations from the user’s inputs. For example,if the user specifies one example of a desired extraction, our system automaticallytries generalizations of the terms and every permutation of the ordering of significantwords. The system gives the user the opportunity to judge the level of generalizationsbeing made and to tune them to his or her particular needs. Where modificationsof the rules are deemed successful by the user, those rules are incorporated into theextraction set. Following the description of the information extraction system, wewill describe a system that resolves both noun phrase coreference as well as eventcoreference across documents. Although the two systems are currently independent,we plan to use the output of the cross-document coreference system to improve theperformance of the information extraction system. The coreference annotated outputcan be used to merge entities and events extracted across documents thereby makingeasier the task of the IE system by providing multiple documents regarding an entityor event of interest. This methodology adds tremendous power to an extraction sys-tem by enabling it to gather facts from disparate sources and bring them together toobtain more completeness to the retrieval.

1.2 INFORMATION EXTRACTION SYSTEM

1.2.1 System Overview

Our Trainable InforMation Extraction System (TIMES) includes four major sub-processes: Tokenization, Lexical Processing, Partial Parsing, and Rule Learning andGeneralization [1]. The general structure of TIMES is shown in Figure 1.11. Thefirst stage of processing is carried out by the Tokenizer which segments the inputtext into sentences and words. Next, the Lexical Process acquires lexical syntac-tic/semantic information for the words. To achieve this, a Preprocessor is used forsemantic classification. The Preprocessor can identify special semantic categoriessuch as email and web addresses, file and directory names, dates, times, and dollar

1Copyright AAAI

iv EXTRACTING INFORMATION FROM TEXT

Word SenseTraining Interface

Target Information

Rule Learning and Generalization

Extraction Rules WSD Rules

Rule Application

Target Information

Lexical Processing Lexical Processing

Tokenization

Training Phase

Semantic Classification

CELEX Database

WordNet

Partial Parsing Partial Parsing

Tokenization

Automated Rule Creation Phase

Scanning Phase

Fig. 1.1 System Overview

amounts, telephone numbers, zip codes, cities, states, countries, names of companies,and many others. The syntactic information is retrieved from the CELEX database2

and the semantic information is from WordNet [22]. Following the Lexical Pro-cessing, a Partial Parser, which is based on a set of finite-state rules, is applied toproduce a sequence of non-overlapping phrases as output. It identifies noun phrases(NG), verb phrases(VG) and prepositions (PG). The parser uses 14 finite-state rulesto identify noun phrases, 7 rules to identify verb phrases, and one rule to identifyprepositions. Verb phrases are further categorized as Verb Active, Verb Passive, VerbBe, Verb Gerund, and Verb Infinitive3. The last word of each phrase is identified asits headword. The phrases, together with their syntactic and semantic informationare used in Rule Learning and Generalization. Based on the training examples, thesystem automatically acquires and generalizes a set of rules for future use.

The system has three running modes which implement training, automated rulecreation, and scanning, respectively. In the training phase, the user is required totrain the system on sample documents based on his/her interests. That is, the userspecifically points out the desired information in the text. Because the WordNethierarchy depends on word senses, in the training phase, the system also requiressome semantic tagging by the user. This phase tends to use minimum linguistic anddomain knowledge so that it can be used by the casual user. The rule creation phasebuilds a set of useful rules, including both extraction rules for information of interest

2CELEX was developed by several universities and institutions in the Netherlands, and is distributed bythe Linguistic Data Consortium.3We like to thank Jerry Hobbs for providing these rules

INFORMATION EXTRACTION SYSTEM v

and word sense disambiguation (WSD) rules for sense identification. The scanningphase applies the learned rules to any new body of text in the domain.

We use Precision, Recall and F-measure [29] to give a quantitative evaluation ofthe system. The Recall score measures the ratio of information correctly extractedagainst all relevant information in the texts. The Precision score measures the ratio ofcorrect information extracted against all information extracted. To define Precisionand Recall more precisely, if n is the number of entities identified relevant by thesystem; m is the number of entities which are relevant from the text; a is the numberof entities correctly identified by the system, then: recall = a=m, precision = a=n.It is difficult to evaluate the system since the measures of recall and precision areoften equally important yet negatively correlated. F-measure combines the measureof precision and recall to get a single measure. Recall and precision can have relativeweights in the calculation of the F-measure giving it the flexibility to be used fordifferent applications. The formula for calculating the F-measure is:

F =(�2 + 1:0) � P �R

�2 � P +R(1.1)

where P is precision, R is recall, and � is the relative importance given to recall overprecision. If recall and precision are of equal weight, � = 1:0.

1.2.2 WordNet

In contrast with most IE systems which heavily rely on the hand crafted domainspecific knowledge, TIMES acquires lexical semantic information from WordNet.

WordNet is developed at Princeton University, Cognitive Center ( [22]). It pro-vides a tool to search dictionaries conceptually, rather than merely alphabetically.The 1.5 version contains approximately 95,600 different word forms organized intosome 70,100 word meanings, or sets of synonyms. The most useful feature of Word-Net to NLP community is its attempt to organize lexical information in terms ofword meanings (or concepts) and semantic relations. The meaning (or concept) isrepresented by synset , which is a set of synonymous words representing a certainmeaning. An example of a synset is f girl, miss, missy, gal, young lady, youngwoman, filleg. A semantic relation is a relation between meanings. For instance, fornouns, there are “part of”, “is a”, “member of”, etc relationships between concepts.

Consider the following example from [22]: the synsets fboard, plankg and fboard,committeeg each serve as unambiguous designators of two different meanings (orsenses) of the noun board. Also associated with each synset is a short Englishdescription (gloss) of the meaning expressed by the synset. So the synset fboardg(a person’s meals, provided regularly for money), consisting only of the noun boarditself, can be distinguished from the other senses by the gloss associated with it.

In addition, WordNet attempts to organize different senses of a word based on thefrequency of usage of the senses. For example, a listing of all the synonyms of boardyields 8 different synsets, each designating a different sense of the noun. The mostcommonly used sense (Sense 1) is fboard, plankg while the least commonly used

vi EXTRACTING INFORMATION FROM TEXT

sense (Sense 8) is fboardg (a flat portable surface (usually rectangular) designed forboard games).

An important relationship present in WordNet, that is used extensively in oursystem, is the hyponymy/hypernymy(the subset/superset, or the ISA) relationship. Aconcept represented by the synset fx, x1, . . . g is said to be a hyponym of the conceptrepresented by the synset fy, y1, . . . g if native speakers of English accept sentencesconstructed for such frames as An x is a (kind of) y. If this holds then we can also saythat fy, y1, . . . g is a hypernym of fx, x1, . . . g ( [22]). For example, fmapleg is ahyponym of ftreeg, and ftreeg is a hyponym of fplantg. Hyponymy is transitive andasymmetrical, and it generates a hierarchical semantic structure in which a hyponymis said to be below its superordinate. Such hierarchical representations are alsocalled inheritance systems: a hyponym inherits all the features of the more genericconcept and adds at least one feature that distinguishes it from its superordinate andfrom any other hyponyms of that superordinate. So, maple inherits features of itssuperordinate, tree, but is distinguished from other trees by the hardness of its wood,the shape of its leaves, etc.

In TIMES, WordNet is used to provide the sense information during the training.Based on the training, the system automatically generates a set of rules to disam-biguate word senses ( [10]). We found that, in most cases, in a specific domain,senses tend to remain the same. Furthermore, most frequently used senses are ap-plied very often. For example, we trained 24 articles with 1129 headwords from the“triangle.job” domain, and found that 91.7% of headwords were used as sense onein WordNet.

1.2.3 The Training Phase

TIMES provides a convenient training interface for the user. Through this interface,the user identifies the information of interest (i.e., the target information). One issuein training is the tradeoff between the user’s effort and the system’s learning cost.TIMES requires some minimum training from the user, which is to identify the targetinformation for the training articles and assign the correct sense to the words if theirsenses are not used as sense one in WordNet. (Sense one is the most frequently usedsense in WordNet. The training interface provides sense definitions so that the userwould know which sense to choose.) By default, the system will assign sense oneto the headwords if no other specific sense is given by the user. If the user performsthe minimum training (i.e., identifying target information and assigning senses), thenthe system will learn the rules based on every phrase in the training sentences. Sincethe system will try all kinds of combinations of those phrases, the resulting learningcost will be relatively high. If besides the minimum training, the user decides toselect the important phrases from the training examples, then the rules can be learnedbased only on the important phrases. Thus the training effort is increased, and thesystem learning cost is reduced. Furthermore, if the user has sufficient expertise anddecides to create rules for the system (the rules could be generated from the trainingarticles or from the user’s own knowledge), then more training effort is required andthe system learning cost is further reduced. In general, to make the system easily and

INFORMATION EXTRACTION SYSTEM vii

quickly adapted to a new domain, computational linguists and domain experts canapply their expertise if they prefer; casual users can provide the minimum trainingand rely on the system learning ability.

Table 1.1 Internal Structure for a Training Example. NG represents noun phrases; VGrepresents verb phrases and PG represents Prepositions

Important Phrases Target Sem. Type Headword Syn. Cat. Sen.

IBM COMPANY company type company NG 1

has none none has VG 1

a need none none need NG 1

for none none for PG 1

Java programmers POSITION none specialist NG 1

at none none at PG 1

Raleigh, NC LOCATION city type city NG 1

For example, suppose the training sentence is “ IBM has a need for Java program-mers to work at Raleigh, NC for one month.” The user of our system will employthe interface to indicate the key target information in this sentence which is to beextracted. The target information is: companies which have job openings (use COM-PANY to represent this target), positions available (POSITION), locations of thosepositions (LOCATION). Based on the user’s input, the system internally generatesa record as shown in Table 1.1. In this table, the first column lists seven importantphrases from the sentence; the second column is the target information specified bythe user. If a phrase is identified as a type of target information, then this phrase iscalled a target phrase. For example, “IBM,” “Java programmers” and “Raleigh, NC”are three target phrases. The third column lists the specific semantic types classifiedby the Preprocessor for each important phrase. The fourth column lists the headwordsfor the important phrases. If a phrase can be identified as a special semantic type,then the headword is the name for that semantic type; otherwise, the headword isthe last word in the phrase. The fifth column lists the syntactic categories identifiedby the Partial Parser. The last column is the sense number for the headwords. Fromthe information in Table 1.1, the system will create an initial template rule as shownin Figure 1.2 for doing extraction. Furthermore, the system applies a supervisedlearning algorithm to generate useful rules for disambiguating word senses based ondifferent context ( [10]).

viii EXTRACTING INFORMATION FROM TEXT

S(X1; fcompanyg; COMPANY ) ^ S(X2; fhasg; none)^S(X3; fneedg; none) ^ S(X4; fforg; none)^S(X5; fprogrammerg; POSITION) ^ S(X6; fatg; none)^S(X7; fcityg; LOCATION) �!FS(X1; COMPANY ); FS(X5; POSITION); FS(X7; LOCATION)

Fig. 1.2 Initial Template Rule /Most Specific Rule

1.2.4 Specific Rules

In general, rules in the system are pattern-action rules. The pattern, defined by theleft hand side (LHS) of a rule, is a conjunction (expressed by ^) of subsumptionfunctions S(X;�; target(�)). X is instantiated by a new phrase when the rule isapplied; � is the concept corresponding to the headword of an important phrase inthe training sentence; and target(�) is the type of target information identified for �.The action in the right hand side (RHS) of a rule, FS(X; target(�)) fills a templateslot, or in other words, assigns the type of target information target(�) to the phraseX .

The subsumption function S(X;�; target(�)) essentially looks for subsumptionof concepts. It returns true if the headword of X is subsumed to the concept �. Ifall subsumption functions return true, then the RHS actions will take place to extractX as a type target(�). In the following sections, the subsumption function will bereferred to as a rule entity.

For example, the initial template rule (i.e., the most specific rule) in Figure 1.2says that if a pattern of phrases X1; X2; :::; X7 is found in a sentence such that theheadword of X1 is subsumed by (or equal to) company, the headword of the secondphraseX2 is subsumed by has, and so on, then one has found the useful fact thatX1 isa COMPANY, X5 is a POSITION andX7 is a LOCATION. (In the most specific rule,the headwords of important phrases are used directly as � in subsumption functions.They are referred to as specific concepts.) Apparently, the initial template rule isvery specific and has tremendous limitations. Since the occurrence of the exact samepattern rarely happens in the unseen data, the initial template rule is not very useful.We need to generalize these specific rules.

1.2.5 Rule Generalization

The rule generalization is a two dimensional model that combines both syntactic(horizontal) and semantic (vertical) generalization ( [11]).

1.2.5.1 Syntactic Generalization From our early experiments, we noticed that theoptimum number of entities required in the rules varied for different types of targetinformation. In the job advertisement domain, when a token is classified as the dollaramount semantic type, it is the target salary information 95% of the time. A rulewith one rule entity suffices. Rules with more entities are too specific and will lower

INFORMATION EXTRACTION SYSTEM ix

1. S(X1; fcompanyg; COMPANY ) ^ S(X2; fneedg; none)^S(X3; fprogrammerg; POSITION) �! FS(X1; COMPANY )2. S(X1; fcompanyg; COMPANY ) ^ S(X2; fneedg; none)^S(X3; fprogrammerg; POSITION) �! FS(X3; POSITION)

Fig. 1.3 An Example of Syntactically Generalized Rules

1. S(X1; fneedg; none)^ S(X2; fprogrammerg; POSITION)^S(X3; fcompanyg; COMPANY ) �! FS(X3; COMPANY )2. S(X1; fneedg; none)^ S(X2; fprogrammerg; POSITION)^S(X3; fcompanyg; COMPANY ) �! FS(X2; POSITION)

Fig. 1.4 An Example of Permutation Rules

the performance. For example, in Figure 1.2, by removing the second, the fourth,the sixth, and the seventh entity, the most specific rule becomes two rules with threeentities in Figure 1.3. Since two target phrases remain and in our convention, eachgeneralized rule only corresponds to one target phrase, two rules are necessary tocapture two types of target information. When these two rules are applied for unseendata, the first one will extract the target information COMPANY and the second onewill extract POSITION. On the other hand, since the number of constraints on theLHS are reduced, more target information will be identified. By this observation,removing entities from specific rules results in syntactically generalized rules.

Furthermore, syntactic generalization is also aimed at tackling paraphrase prob-lems. For example, by reordering entities in the rule generated from the trainingsentence “Fox head Joe Smith ...”, it can further process new sentence portions suchas “Joe Smith, the head of Fox, ....”, “The head of Fox, Joe Smith,...”, “Joe Smith,Fox head,...” etc. This type of syntactic generalization is optional in the system. Thistechnique is especially useful when the original specific rules are generated from theuser’s knowledge.

Therefore, syntactic generalization is designed to learn the appropriate number ofentities, and the order of entities in a rule. By reordering and removing rule entities,syntactic generalization is achieved. More precisely, syntactic generalization isattained by a combination function and a permutation function. The combinationfunction selects a subset of rule entities in the most specific rule to form new rules.The permutation function re-orders rule entities to form new rules.

Combination Function If a training sentence has n important phrases, then thereare n corresponding rule entities, e1; :::; en (each ei is a subsumption function).The combination function selects k rule entities and form the LHS of the rule asei1 ^ ei2 ::^ eik (the order of the entities is the same as the order of the correspondingphrases in the training sentences). At least one of k entities corresponds to a targetphrase. If i (1 < i � k) entities are created from i target phrases, then i ruleswill be generated. These rules have the same LHS and different RHS, with each

x EXTRACTING INFORMATION FROM TEXT

1. S(X1; fgroup; :::g; COMPANY ) ^ S(X2; fneed; :::g; none)^S(X3; fprofessional; :::g; POSITION)�! FS(X1; COMPANY )2. S(X1; forganization; :::g; COMPANY ) ^ S(X2; fneed; :::g; none)^S(X3; fengineer; :::g; POSITION)�! FS(X3; POSITION)

Fig. 1.5 An Example of Semantically Generalized Rules

corresponding to one type of target information. Rules created by the combinationfunction are named combination rules. For example, in Figure 1.3, k = 3 and i = 2,therefore, two rules are necessary to identify two different types of target information.

Permutation Function The permutation function generates new rules by permut-ing rule entities in the combination rules. This function creates rules for processingparaphrases. Rules created by the permutation function are named permutation rules.For example, the permutation function can generate rules in Figure 1.4 by re-orderingrule entities in Figure 1.3. While rules in Figure 1.3 are created based on the sentence“IBM has a need for Java programmers ...”, the permutation rules can process aparaphrased sentence such as “There is a need for Java programmers at IBM.”

1.2.5.2 Semantic Generalization Syntactic generalization deals with the numberand the order of the rule entities. However, each rule entity is still very specific. Fora rule entity S(X;�; target(�)), semantic generalization is designed to replace �with a more general concept. Thus it will cover more semantically related instances.

A concept is defined in WordNet as a set of synonyms (synset). For a wordw, thecorresponding concept in WordNet is represented by fw;w1; :::; wng, where eachwi

is a synonym ofw. Given a wordw, its part-of-speech, and sense number, the systemcan locate a unique corresponding concept in the WordNet hierarchy if w exists inWordNet. For a word, especially a proper noun, if the Preprocessor can identify itas a special semantic type expressed in a concept in WordNet, then a virtual link iscreated to make the concept of this special noun as the hyponym of that semantictype. For example, “IBM” is not in WordNet, however, it is categorized as a kind offcompanyg. The system first creates the concept of “IBM” as fIBMg, then createsa virtual link between fIBMg and fcompanyg. For any other word w, if it is not inWordNet, and it is not identified as any semantic type, then the concept of w is fwgand is virtually put into WordNet, with no hierarchical structure. Therefore, everyheadword of an important phrase should have a corresponding concept in WordNet.

Let � be a concept in WordNet. The hypernym hierarchical structure provides apath for locating the superordinate concept of �, and it eventually leads to the mostgeneral concept above �. (WordNet is an acyclic structure, which suggests that asynset might have more than one hypernym. However, this situation doesn’t happenoften. In case it happens, the system selects the first hypernym path.)

Therefore, semantic generalization acquires the generalization for each rule entityby replacing the specific concept with a WordNet concept. For example, rules inFigure 1.3 could be generalized to rules in Figure 1.5. In our early work ( [9]), we

INFORMATION EXTRACTION SYSTEM xi

have applied a statistical Generalization Tree model to automatically determine theoptimal semantic generalization degree based on the user’s feedback. This approachmade it possible for the system to adapt to user’s needs.

1.2.5.3 Two Dimensional Generalization The two-dimensional generalizationmodel is a combination of semantic (vertical) generalization and syntactic (horizontal)generalization. Our method applies a brute force algorithm, which performs semanticgeneralization on top of syntactically generalized rules generated from the trainingset, then apply those rules on the training set again. Based on the training examplesand a threshold, the system selects useful rules.

The relevancy rate rel(ri) for rule ri is defined as the percentage of the correctinformation extracted by the ri. A threshold is predefined to control the process. Ifrel(ri) is greater or equal to the threshold, then ri will be put in the rule base forfuture use.

The procedure of generating generalized rules is the following:

� Predefine N , the maximum number of entities allowed in the rule.

� For each target phrase in the training sentence, based on the important phrasesin the sentence, generate all combinations rules with number of entities fromone to N , as well as their permutation rules.

� For every rule, generate all possible semantically generalized rules by replacingeach entity with a more general entity with different degree of generalization.

� Apply all rules to the training set. Based on the training examples, computerelevancy rate for each rule.

� Select rules with relevancy rate above the defined threshold. If two rules r1and r2 both have n entities, and if each entity of r1 corresponds to a moregeneral or the same concept as that of r2, then the system will choose r1 forfuture use if rel(r1) is greater than or equal to the threshold and eliminate r2.

� Sort the rules with the same number of entities to avoid rule repetition.

By following this procedure, the system will generate a set of useful rules thatare both syntactically and semantically generalized. Those rules will be applied tounseen documents. First, the system applies rules with N number of entities tothe unseen sentence. If some matches are found, the system identifies the targetinformation as that which is extracted by the most rules and then processes the nextsentence. If there is no match, the system applies rules with fewer entities until eitherthere are some matches or all the rules (including those with one entity) have beenapplied. By doing so, the system will first achieve the highest precision and thengradually increase recall without too much cost in precision.

1.2.6 Experiments

Our experiments were conducted based on the domain of triangle.jobs newsgroup.Eight types of target information were extracted.

xii EXTRACTING INFORMATION FROM TEXT

0

20

40

60

80

100

Perfo

rman

ce (%

)

Precision Recall F-measure

one entity rulestwo entity rulesthree entity rulesall rules

Fig. 1.6 Effect of Rules on Overall Performance in Pure Syntactic Generalization

� COMPANY (COM.): The name of the company which has job openings.

� POSITION (POS.): The name of the available position.

� SALARY (SAL.): The salary, stipend, compensation information.

� LOCATION (LOC.): The state/city where the job is located.

� EXPERIENCE (EXP.): Years of experience.

� CONTACT (CON.): The phone number or email address for contact.

� SKILLS (SKI.): The specific skills required, such as programming languages,operating systems, etc.

� BENEFITS (BEN.): The benefits provided by the company, such as health,dental insurance,.. etc.

There were 24 articles for training and 40 articles for testing. These 64 articleswere randomly selected. Training articles were grouped to three training sets. Thefirst training set contained 8 articles; the second contained 16 articles includingones in the first set; the third training set consisted of all 24 articles. In all of theexperiments, the relevancy rate threshold was set to 0.8 unless specified otherwise.

1.2.6.1 Syntactic Generalization The first experiment tested the impact of puresyntactic generalization. The system only generated combination rules and permu-tation rules without semantic generalization. Figure 1.6 shows that rules with puresyntactic generalization (represented by “All”) achieve better performance (in termsof F-measure) than those rules with one, two or three fixed number of entities.

In particular, different types of target information require different numbers ofentities in the rules. As shown in Figure 1.7, for the both target information COM-PANY and EXPERIENCE, no rule with one entity was learned from the training

INFORMATION EXTRACTION SYSTEM xiii

0

20

40

60

80

100

F-m

easu

re (%

)

COM

POS

SAL

LOC

EXP

CON

SKI

BEN one entity rulestwo entity rulesthree entity rulesall rules

Fig. 1.7 F-measure on Single Fact Extraction in Pure Syntactic Generalization

0 2 4 6

Limit of Semantic Generalization

0

20

40

60

80

100

Perfo

rman

ce (%

)

PrecisionRecallF-measure

Fig. 1.8 The Effect of Limit Semantic Generalization on Training Data

set. For the target information SALARY, the rules with one entity performed muchbetter (in terms of F-measure) than those with two entities or three entities. Forthe target information LOCATION, the rules with two entities performed better thanthose of one entity and three entities. For the target information COMPANY, therules with three entities perform the best. Thus, for different types of information,the best extraction requires different numbers of entities in the rules. However, theappropriate number of entities for each type of information is not known in advance.If applying rules with the fixed number of entities, a less than optimal number ofentities will cause a significant loss in the performance. By applying our approach,for each type of information, the performance is either better (as in the SALARYrow), or slightly worse (as in the COMPANY row). The overall performance of theapplication algorithm is better than that achieved by the rules with the fixed numberof entities.

xiv EXTRACTING INFORMATION FROM TEXT

0 2 4 6

Limit of Semantic Generalization

0

20

40

60

80

100

Perfo

rman

ce (%

)PrecisionRecallF-measure

Fig. 1.9 The Effect of Limit Semantic Generalization on Testing Data

1.2.6.2 Two Dimensional Generalization When the semantic generalization isadded to the model, the evaluation of the effectiveness of WordNet becomes impor-tant. In this experiment, the system generated rules both syntactically and semanti-cally. In those rules, some entities were only generalized to include the synonymsof the specific concept from the training sentence; some were generalized to directhypernyms (one level above the specific concept in the conceptual hierarchy), andsome were generalized to various higher degrees. The question is, even though therules have been learned from the training examples, are they reliable? Do we need toput some upper bound on the generalization degree for the entities in order to achievegood performance? To answer that, we used hmax as the limit for the degree ofgeneralization for each entity. We modified the rules to only generalize each entityto hmax level above the specific concept in WordNet hierarchy. We applied thoserules (the threshold was 1.0) with different limitations on semantic generalization tothe 24 training documents and 40 testing documents.

As in Figure 1.8, for the training set, with no upper bound on semantic general-ization degree in the rules, the system had an overall 86.6% F-measure, with veryhigh (92.7%) precision. When the various limits on the generalization degree wereapplied, the performance was still about the same. This indicated that the rules wereindeed learned from the training examples, and these rules sufficiently representedthe training examples.

We then applied the same set of rules on the testing data. As shown in Figure 1.9,without an upper bound on the semantic generalization, the system only achieved anF-measure of 54.2%. However, when the upper bound on the degree of generalizationwas hmax = 0, which only generalized the entities to include the synonyms of thespecific concepts, the overall performance was about 70%. When hmax = 0, theoverall performance was also about 70%. The results indicated further restrictionon the semantic generalization degree could enhance the performance. WordNethypernyms are useful in achieving high recall, but with a high cost in precision.

INFORMATION EXTRACTION SYSTEM xv

0 5 10 15 20

Number of Training Documents

0

20

40

60

80

Perfo

rman

ce (%

)

Precision-SynRecall-SynF-measure-SynPrecision-Syn-SemRecall-Syn-SemF-measure-Syn-Sem

Fig. 1.10 Performance vs. Training Effort in Both Syntactic Generalization and Two Di-mensional Generalization

WordNet synonyms and direct hypernyms are particularly useful in balancing thetradeoff between precision and recall and thus improving the overall performance.

We compared the experimental results from rules with pure syntactic generaliza-tion and the rules with two dimensional generalization (with a limit of hmax = 1 onthe semantic generalization). As shown in Figure 1.10, when the training set is small,the semantic generalization can be particularly helpful. The F-measure increasedabout 10% when the training set only had eight articles. The F-measure for training16 articles is about the same as that from training 24 articles. Thus no more trainingis necessary. This result implies that for the two-dimensional rule generalizationapproach, there is a performance upper bound in this domain. If we would like tobreak this upper bound, generalizing only concepts and orders of the rules may notbe enough. We should approach other strategies for generalization.

The semantic generalization can be particularly effective for extracting certaintypes of information. Comparing Figure 1.7 and Figure 1.11, we can see that thesemantic generalization is especially useful for extracting both LOCATION andBENEFITS facts. The performance was improved about 30% in F-measure for thosetwo facts.

1.2.7 Discussion

Automated rule learning from examples can also be found in other systems such asAutoSlog ( [23]), PALKA ( [21]), RAPIER ( [7]) and WHISK ( [28]). AutoSloguses heuristics to create rules from the examples and then requires the human expertto accept or reject the rules. PALKA applies a conceptual hierarchy to controlthe generalization or specification of the target slot. RAPIER uses inductive logicprogramming to learn the patterns that characterizes slot-fillers and their context.WHISK learns the rules by starting with a seed example and then selectively addingterms that appear in the seed example to a rule.

xvi EXTRACTING INFORMATION FROM TEXT

0

20

40

60

80

100

F-m

easu

re (%

)COM

POS

SAL

LOC

EXP

CON

SKI

BEN

one entity rulestwo entity rulesthree entity rulesall rules

Fig. 1.11 F-measure on Single Fact Extraction in Two Dimensional Generalization

TIMES differentiates itself in the sense that it learns rules by interleaving syntacticgeneralization and semantic generalization. It automatically decides the number, theorder and the generalization/specification of constraints. It learns these three aspectsin each rule while other systems concentrate on one or two aspects. Furthermore,most of those systems focus on the improvement of performance by refining learningtechniques in an environment of large databases of examples. TIMES is designed toprovide a paradigm where rule learning makes it possible to build an IE system basedon minimum training by a casual user. Indeed, TIMES emphasizes the usability tothe casual user. When a large amount of pre-annotated information is not availableand when the user is not experienced enough to tag the information, how does onemake an IE system effective based on minimum training? In our experiments, weintentionally chose a small training set since for a casual user, large amount of trainingis difficult. The experimental results suggest that the two dimensional generalizationobtains reasonable performance while the effort and time involved in the training isdramatically reduced.

Most information extraction systems are created based on hand-crafted domainspecific knowledge. This is because the lexical semantic definitions given by thegeneric resources sometimes cannot meet the actual needs of the specific domain.No use is made of existing general lexical semantic resources by any of the MUCsystems. NYU’s MUC-4 system ( [18]) made some attempt at using WordNet forsemantic classification. However, they ran into the problem of automated sense dis-ambiguation because the WordNet hierarchy is sense dependent. As a result, theygave up using WordNet. TIMES attempts to integrate WordNet with WSD techniques( [10]). The use of WordNet hypernyms in the two-dimensional generalization couldraise the recall performance from 65% to 76% (See Figure 1.9) at the cost of preci-sion. However, we found that the WordNet synonyms and the direct hypernyms areparticularly useful in balancing the tradeoff between the precision and recall. Theuse of WordNet enhanced overall performance by 5%. Despite the typographicalerrors, incorrect grammars and rare abbreviations in the free text collection which

CROSS-DOCUMENT COREFERENCE xvii

make information extraction more difficult, in this domain, the two-dimensional gen-eralization model based on both syntactic generalization and semantic generalizationachieved about 69% (see Figure 1.10) F-measure in overall performance.

In the remainder of this chapter, we discuss a different system: a system thatresolves cross-document coreference. We believe that the use of such a system willgreatly improve the performance of IE systems.

1.3 CROSS-DOCUMENT COREFERENCE

Cross-document coreference occurs when the same person, place, event, or conceptis discussed in more than one text source. Computer recognition of this phenomenonis important because it helps break “the document boundary” by allowing a user toexamine information about a particular entity or event from multiple text sources atthe same time.

Resolving cross-document coreference has been considered to be a difficult prob-lem that requires the output of an information extraction system[19]. However, recentresearch has shown that cross-document coreferences for both entities and events canbe resolved accurately[3], [4].

1.3.1 Cross-Document Coreference and Information Extraction

Despite the success of the Message Understanding Conferences, the performance ofinformation extraction system has hit a barrier commonly referred to as the “60%barrier” on MUC-like data sets. The two main causes for this barrier are:

1. There are too many ways of presenting the same fact (information) in text. Thetail of the frequency distribution of a fact is too long to justify the effort usedto manually train the system on all possible ways of phrasing a fact.

2. The systems are able to extract information written explicitly in text. Theextraction of more implicit information is much harder and currently beyondthe state of the art.

We feel that IE systems can break the “60% barrier” by using the output of across-document coreference system. Using a cross-document coreference systemwill enable an IE system to work with several articles about an entity or event (asopposed to one currently). The presence of several such articles will increase thechances that one or more of the articles will contain a pattern that the IE systemhas been trained on. In addition, it also increases the chance that the informationcontained in one or more of the articles will be present in a more explicit way.

1.3.2 Cross-Document Coreference: The Problem

Cross-document coreference is a distinct technology from Named Entity recogniz-ers like IsoQuest’s NetOwl and IBM’s Textract because it attempts to determine

xviii EXTRACTING INFORMATION FROM TEXT

whether name matches are actually the same individual (not all John Smiths are thesame). Neither NetOwl nor Textract have mechanisms which try to keep same-namedindividuals distinct if they are different people.

Cross-document coreference also differs in substantial ways from within-documentcoreference. Within a document there is a certain amount of consistency which can-not be expected across documents. In addition, the problems encountered duringwithin document coreference are compounded when looking for coreferences acrossdocuments because the underlying principles of linguistics and discourse contextno longer apply across documents. Because the underlying assumptions in cross-document coreference are so distinct, they require novel approaches.

1.3.3 Architecture and the Methodology

In this section we describe a cross-document coreference resolution system thatresolves coreferences for both entities and events using the Vector Space Model.Figure 1.12 shows the architecture of the system which is built upon the Universityof Pennsylvania’s within document coreference system, CAMP [5] [6].

Our system takes as input the coreference processed documents output by CAMP.It then passes these documents through the SentenceExtractor module which extracts,for each document, all the sentences relevant to a particular entity of interest. TheVSM-Disambiguate module then uses a vector space model algorithm to computesimilarities between the sentences extracted for each pair of documents.

Details about each of the main steps of the cross-document coreference algorithmare given below.

� First, for each article, University of Pennsylvania’s CAMP system is run onthe article. It produces coreference chains for all the entities mentioned inthe article. For example, consider the two extracts in Figures 1.13 and 1.15.The coreference chains output by CAMP for the two extracts are shown inFigures 1.14 and 1.16.

� Next, for the coreference chain of interest within each article (for example, thecoreference chain that contains “John Perry”), the Sentence Extractor mod-ule extracts all the sentences that contain the noun phrases which form thecoreference chain. In other words, the SentenceExtractor module produces a“summary” of the article with respect to the entity of interest. These summariesare a special case of the query sensitive techniques being developed at Pennusing CAMP. Therefore, for doc.36 (Figure 1.13), since at least one of thethree noun phrases (“John Perry,” “he,” and “Perry”) in the coreference chainof interest appears in each of the three sentences in the extract, the summaryproduced by SentenceExtractor is the extract itself. On the other hand, thesummary produced by SentenceExtractor for the coreference chain of interestin doc.38 is only the first sentence of the extract because the only element ofthe coreference chain appears in this sentence.

CROSS-DOCUMENT COREFERENCE xix

doc.01

doc.02

doc.nn

University of Pennsylvania’s

Coreference Chains for doc.01

Coreference Chains for doc.02

Coreference Chains for doc.nn

SentenceExtractor

summary.01

summary.02

summary.nn

VSM-Disambiguate

doc.01

doc.19

doc.20

doc.zzdoc.36

doc.38

Cross-Document Coreference Chains

CAMP Coreference System

Fig. 1.12 Architecture of the Cross-Document Coreference System

John Perry, of Weston Golf Club, announced his resignation yesterday. Hewas the President of the Massachusetts Golf Association. During his two yearsin office, Perry guided the MGA into a closer relationship with the Women’s GolfAssociation of Massachusetts.

Fig. 1.13 Extract from doc.36

John Perry

Perry

WestonGolf Club

MassachusettsGolfAssociation

MGA

Women’sGolfAssociation

he

Fig. 1.14 Coreference Chains for doc.36

� For each article, the VSM-Disambiguate module uses the summary extracted bythe SentenceExtractorand computes its similarity with the summaries extractedfrom each of the other articles. Summaries having similarity above a certainthreshold are considered to be regarding the same entity.

xx EXTRACTING INFORMATION FROM TEXT

Oliver “Biff” Kelly of Weymouth succeeds John Perry as president of theMassachusetts Golf Association. “We will have continued growth in the future,”said Kelly, who will serve for two years. “There’s been a lot of changes andthere will be continued changes as we head into the year 2000.”

Fig. 1.15 Extract from doc.38

MassachusettsGolf Association

John PerryOliver "Biff" Kelly

Kelly

Fig. 1.16 Coreference Chains for doc.38

1.3.4 University of Pennsylvania’s CAMP System

The coreference chains output by CAMP enable us to gather all the information aboutthe entity of interest in an article. This information about the entity is gathered bythe SentenceExtractor module and is used by the VSM-Disambiguate module fordisambiguation purposes. Consider the extract for doc.36 shown in Figure 1.13.We are able to include the fact that the John Perry mentioned in this article was thepresident of the Massachusetts Golf Association only because CAMP recognized thatthe “he” in the second sentence is coreferent with “John Perry” in the first. And it isthis fact which actually helps VSM-Disambiguate decide that the two John Perrys indoc.36 and doc.38 are the same person.

1.3.5 The Vector Space Model

The vector space model used for disambiguating entities across documents is thestandard vector space model used widely in information retrieval [27]. In this model,each summary extracted by the SentenceExtractor module is stored as a vector ofterms. The terms in the vector are in their morphological root form and are filteredfor stop-words (words that have no information content like a, the, of, an, . . . ). If S1

and S2 are the vectors for the two summaries extracted from documents D1 and D2,then their similarity is computed as:

Sim(S1; S2) =X

common terms tj

w1j � w2j

where tj is a term present in both S1 and S2, w1j is the weight of the term tj in S1and w2j is the weight of tj in S2.

CROSS-DOCUMENT COREFERENCE xxi

The weight of a term tj in the vector Si for a summary is given by:

wij =tf � log N

dfps2i1 + s2i2 + : : :+ s2in

where tf is the frequency of the term tj in the summary, N is the total number ofdocuments in the collection being examined, and df is the number of documentsin the collection that the term tj occurs in.

ps2i1 + s2i2 + : : :+ s2in is the cosine

normalization factor and is equal to the Euclidean length of the vector Si where s2inequals the square of the product tf � log N

dffor the term tn in vector Si.

The VSM-Disambiguate module, for each summary Si, computes the similarityof that summary with each of the other summaries. If the similarity computed isabove a pre-defined threshold, then the entity of interest in the two summaries areconsidered to be coreferent.

1.3.6 Experiments

We tested our cross-document system on two highly ambiguous test sets. The first setcontained 197 articles from the 1996 and 1997 editions of the New York Times, whilethe second set contained 219 articles from the 1997 edition of the New York Times.The sole criteria for including an article in the two sets was the presence of a stringmatching the “/John.*?Smith/”, and the “/resign/” regular expressions respectively.

The goal for the first set was to identify cross-document coreference chains aboutthe same John Smith, and the goal for the second set was to identify cross-documentcoreference chains about the same “resign” event. The system did not use any NewYork Times data for training purposes. The answer keys were manually created, butthe scoring was completely automated.

1.3.6.1 Analysis of the Data There were 35 different John Smiths mentioned inthe first set of articles. Of these, 24 of them only had one article which mentionedthem. The other 173 articles were regarding the 11 remaining John Smiths. Thebackground of these John Smiths , and the number of articles pertaining to each,varied greatly. Descriptions of a few of the John Smiths are: Chairman and CEO ofGeneral Motors, assistant track coach at UCLA, the legendary explorer, and the maincharacter in Disney’s Pocahontas, former president of the Labor Party of Britain. Inthe second set, there were 97 different “resign” events. Of these, 60 were involvedin chains of size 1. The articles were regarding resignations of several differentpeople including Ted Hobart of ABC Corp., Dick Morris, Speaker Jim Wright, andthe possible resignation of Newt Gingrich.

1.3.7 Scoring the Output

In order to score the cross-document coreference chains output by the system, wehad to map the cross-document coreference scoring problem to a within-documentcoreference scoring problem. This was done by creating a meta document consisting

xxii EXTRACTING INFORMATION FROM TEXT

Precisioni =# of correct elements in the output chain containing entityi

# of elements in the output chain containing entityi

Recalli =# of correct elements in the output chain containing entityi

# of elements in the truth chain containing entityi

Fig. 1.17 Definitions for Precision and Recall for an Entity i

of the file names of each of the documents that the system was run on. Assumingthat each of the documents in the two data sets was about a single John Smith, orabout a single “resign” event, the cross-document coreference chains produced bythe system could now be evaluated by scoring the corresponding within-documentcoreference chains in the meta document.

1.3.7.1 The B-CUBED Scoring Algorithm We used the B-CUBED scoring al-gorithm [2] to score the within-document coreference chains in the meta document.This algorithm looks at the presence/absence of entities from the chains produced.Therefore, we compute the precision and recall numbers for each entity in the docu-ment. The numbers computed with respect to each entity in the document are thencombined to produce final precision and recall numbers for the entire output.

For an entity, i, we define the precision and recall with respect to that entity inFigure 1.17.

The final precision and recall numbers are computed by the following two formu-lae:

Final Precision =

NX

i=1

wi � Precisioni

Final Recall =

NX

i=1

wi �Recalli

where N is the number of entities in the document, and wi is the weight assigned toentity i in the document. For all the examples and the experiments in this paper weassign equal weights to each entity i.e. wi = 1=N . But, we also plan to look at thepossibilities of using other weighting schemes.

1.3.8 Results

Figure 1.18 shows the precision, recall, and F-Measure[14] (with equal weights forboth precision and recall) using the scoring algorithm for the John Smith data set.The Vector Space Model in this case constructed the space of terms only from thesummaries extracted by SentenceExtractor. In comparison, Figure 1.19 shows theresults when the vector space model constructed the space of terms from the articlesinput to the system (it still used the summaries when computing the similarity). The

CROSS-DOCUMENT COREFERENCE xxiii

0

10

20

30

40

50

60

70

80

90

100

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Per

cent

age

Threshold

Precision/Recall vs Threshold

Our Alg: PrecisionOur Alg: Recall

Our Alg: F-Measure

Fig. 1.18 Precision, Recall, and F-Measure With Training On the Summaries for the JohnSmith Data Set

0

10

20

30

40

50

60

70

80

90

100

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Per

cent

age

Threshold



Our Alg: F-Measure

Fig. 1.19 Precision, Recall, and F-Measure With Training On Entire Articles for the JohnSmith Data Set

importance of using CAMP to extract summaries is verified by comparing the highestF-Measures achieved by the system for the two cases. The highest F-Measure for theformer case is 84.6% while the highest F-Measure for the latter case is 78.0%. Incomparison, for this task, named-entity tools (like NetOwl) would mark all the JohnSmiths the same. Their performance using the scoring algorithm is 23% precision,and 100% recall.

Similarly, Figure 1.20 shows the same three statistics for the “resign” data set. Thebest precision and recall achieved by the system, when using the scoring algorithm,

xxiv EXTRACTING INFORMATION FROM TEXT

0

10

20

30

40

50

60

70

80

90

100

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Per

cent

age

Threshold



Our Alg: F-Measure

Fig. 1.20 Precision, Recall, and F-Measure for the “resign” Data Set

was 90% and 87% respectively. This occurs when the threshold for the vector spacemodel was set to 0.2.

1.4 CONCLUSION

The ubiquity of machine readable free text on almost every topic creates a needfor information extraction software that can access it. Simultaneously, users havespecific needs that may not be serviced by software products aimed at particulardomains or by products that are based on keyword technologies. New approachesare needed which enable a user to style a search specifically to his or her own needsand which use modern information extraction methods to glean detailed facts froma large corpus. The methodologies of this paper provide a user with a powerful toolfor efficiently training a system to do just the search the user wants. It gives the userthe ability to scan sample articles and demonstrate exactly the extractions that he orshe wants. It gives the user the ability to adjust the generalization level as neededto either broadly gather all items that may have weak resemblance to the trainingor to filter out only a tightly specified few items that strongly resemble the trainingsamples. The methodology makes available to the user all the power that comesfrom utilizing a parser, WordNet, the coreferencing system, and the combination andpermutation mechanisms for rule generalization.

Tremendous additional power comes from such a system when it can do crossdocument coreference because the resources from disparate sources can be combinedto answer questions. A theory of cross document coreference is also described herethat can further enhance the information extraction capabilities of a system.

CONCLUSION xxv

REFERENCES

1. A. Bagga, J. Chai, and A. Biermann. The Role of WordNet in the Creation of aTrainable Message Understanding System. Proceedings of Ninth Conference onInnovative Applications of Artificial Intelligence (IAAI-97), 1997.

2. Amit Bagga and Breck Baldwin. Algorithms for Scoring Coreference Chains.In The First International Conference on Language Resources and EvaluationWorkshop on Linguistics Coreference, May 1998.

3. Amit Bagga and Breck Baldwin. Entity-Based Cross-Document CoreferencingUsing the Vector Space Model. In 36th Annual Meeting of the Association forComputational Linguistics and the 17th International Conference on Computa-tional Linguistics, pages 79–85, August 1998.

4. Amit Bagga and Breck Baldwin. Cross-Document Event Coreference: Annota-tions, Experiments, and Observations. In ACL’99 Workshop on Coreference andIts Applications, pages 1–8, June 1999.

5. Breck Baldwin et al. University of Pennsylvania: Description of the Universityof Pennsylvania System Used for MUC-6. In Sixth Message UnderstandingConference (MUC-6) [14], pages 177–191.

6. Breck Baldwin et al. Description of the University of Pennsylvania’s MUC-7System. In Seventh Message Understanding Conference (MUC-7) [15].

7. M. Califf and R. Mooney. Relational Learning of Pattern-Match Rules forInformation Extraction. Proceedings of Computational Language Learning’97,1997.

8. J. Chai. Learning and Generalization in the Creation of Information ExtractionSystems. PhD thesis, Department of Computer Science, Duke University, 1998.

9. J. Chai and A. Biermann. Corpus based statistical generalization tree in ruleoptimization. Proceedings of Fifth Workshop on Very Large Corpora (WVLC-5),1997.

10. J. Chai and A. Biermann. The Use of Word Sense Disambiguation in an Infor-mation Extraction System. Proceedings of Eleventh Conference on InnovativeApplications of Artificial Intelligence (IAAI-99), 1999.

11. J. Chai, A. Biermann, and C. Guinn. Two Dimensional Generalization in Infor-mation Extraction. Proceedings of Sixteenth National Conference on ArtificialIntelligence (AAAI-99), 1999.

12. J. Cowie and W. Lehnert. Information Extraction. Communications of ACM,January 1996.

13. DARPA: TIPSTER Text Program. Fourth Message Understanding Conference(MUC-4), June 1992.

xxvi EXTRACTING INFORMATION FROM TEXT

14. DARPA: TIPSTER Text Program. Sixth Message Understanding Conference(MUC-6), San Mateo, November 1995. Morgan Kaufmann Publishers, Inc.

15. DARPA: TIPSTER Text Program. Seventh Message Understanding Conference(MUC-7), April 1998.

16. D. Freitag. Toward General-Purpose Learning for Information Extraction. Pro-ceedings of the 36th Annual Meeting of the Association for Computational Lin-guistics, 1998.

17. M. Graven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, andS. Slattery. Learning to Extract Symbolic Knowledge from the World Wide Web.Proceedings of the Fifteenth National Conference on Artificial Intelligence, 1998.

18. R. Grishman, C. Macleod, and J. Sterling. New York University Proteus Sys-tem: MUC-4 test results and analysis. Proceedings of the Fourth MessageUnderstanding Conference, 1992.

19. Ralph Grishman. Whither Written Language Evaluation? In Human LanguageTechnology Workshop, pages 120–125, San Mateo, March 1994. Morgan Kauf-mann.

20. S. Huffman. Learning Information Extraction Patterns from Examples. Connec-tionist, Statistical, and symbolic Approaches to Learning for Natural LanguageProcessing, 1996.

21. J. Kim and D. Moldovan. Acquisition of semantic patterns for informationextraction from corpora. Proceedings of the Ninth IEEE Conference on ArtificialIntelligence for Applications, 1993.

22. G. Miller. WordNet: An on-line lexical database. International Journal ofLexicography, 1990.

23. E. Riloff. Automatically constructing a dictionary for information extractiontasks. Proceedings of the Eleventh National Conference on Artificial Intelligence,1993.

24. E. Riloff. An Empirical Study of Automated Dictionary Construction for Infor-mation Extraction in Three Domains. AI Journal, August 1996.

25. E. Riloff and R. Jones. Learning Dictionaries for Information Extraction byMulti-Level Bootstrapping. Proceedings of the Sixteenth National Conferenceon Artificial Intelligence, 1999.

26. B. Roark and E. Charniak. Noun-phrase Cooccurrence Statistics for Semi-automatic Semantic Lexicon Construction. Proceedings of the 36th AnnualMeeting of the Association for Computational Linguistics, 1998.

27. Gerard Salton. Automatic Text Processing: The Transformation, Analysis, andRetrieval of Information by Computer. Addison-Wesley, Reading, MA, 1989.

xxvii

28. S. Soderland. Learning Information Extraction Rules for Semi-Structured andFree Text. Machine Learning Journal Special Issues on Natural LanguageLearning, 1999.

29. B. Sundheim. Overview of the fourth message understanding evaluation andconference. Proceedings of Fourth Message Understanding Conference (MUC-4), June 1992.

30. C. Thompson and R. Mooney. Automatic Construction of Semantic Lexiconsfor Learning Natural Language Interfaces. Proceedings of the Sixteenth NationalConference on Artificial Intelligence, 1999.