Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
ILLINOIS-PROFILER: Knowledge Schemas at Scale
Zhiye Fei, Daniel Khashabi, Haoruo Peng, Hao Wu, Dan RothUniversity of Illinois, Urbana-Champaign, Urbana, IL, 61801
{zfei2,khashab2,hpeng7,haowu4,danr}@illinois.edu
Abstract
In many natural language processing tasks, contex-tual information from given documents alone is notsufficient to support the desired textual inference.In such cases, background knowledge about certainentities and concepts could be quite helpful. Whilemany knowledge bases (KBs) focus on combiningdata from existing databases, including dictionar-ies and other human generated knowledge, we ob-serve that in many cases the information needed tosupport textual inference involves detailed informa-tion about entities, entity types and relations amongthem; e.g., is the verb “fire” more likely to occurwith an organization or a location as its Subject? Inorder to facilitate reliable answers to these types ofquestions, we propose to collect large scale graph-based statistics from huge corpora annotated usingstate-of-the-art NLP tools. In order to systemati-cally design, acquire and access such a knowledgebase, we formalize a class of tree based knowl-edge schemas based on Feature Description Logic(FDL). We define a range of knowledge schemas,then extract and organize the resulting statisticalKB using a new tool, the PROFILER. Our experi-ments demonstrate that the PROFILER helps in clas-sification and textual inference. Specifically, wedemonstrate its application to co-reference resolu-tion, showing considerable improvements by care-ful use of the PROFILER’s statistical knowledge.
1 IntroductionKnowledge representation and acquisition are two centralproblems in the design of Natural Language Processing(NLP) systems. While these are separate processes, they areabsolutely dependent. A representation needs to be expres-sive enough, yet it should not compromise efficiency at acqui-sition and inference stages [Levesque and Brachman, 1987].Most importantly, usage of patterns of a KB should dictatehow knowledge is represented and how it is acquired.
Beyond relatively “static” and manually curated knowl-edge bases such as Wordnet and Freebase [Bollacker et al.,2008] the last few years have seen a significant body of
work attempting to acquire KBs. However, even these ef-forts typically aim at acquiring sets of rules or graphs; exam-ples include significant efforts on open domain informationextraction [Banko et al., 2007], using automatic acquisitionmethods to acquire structured knowledge bases such as theExtended WordNet and the YAGO ontology [Hoffart et al.,2013], and projects such as NELL [Carlson et al., 2010], ac-quiring a large body of “rules” and “facts”. However, almostall these efforts, while using huge amounts of data, are quitesimplistic from the representational perspective – often usingsimple relational patterns (e.g. X such as Y, Z) to extract infor-mation from texts; they rarely facilitate typing of information,do not support disambiguation (which Ford is it?) and do notuse the structure of the text in any significant way. Theseacquisition efforts have seen minimal success in supportingtextual inference, mostly, we believe, due to the knowledgerepresentation used – sets of rules or graphs that do not cor-respond well to inference patterns they need to support.
Consider the problem of co-reference resolution. For theco-reference resolution of the pronoun “he” in the sentence“Jimbo arrested Robert because he stole an elephant fromthe zoo”, there is a need to use some global statistical knowl-edge indicating that the Obj of arrest is more likely than itsSubj to be the Subj of stole. Providing an NLP system withsuch information requires that we think carefully about theknowledge representation and about the acquisition processthat could support it.
This paper proposes a general framework for relational rep-resentation of (statistical) knowledge that is acquired fromtext, and is designed to support textual inferences. Our rep-resentation is driven by the need to facilitate aggregation ofknowledge while taking into account typing of concepts andentities, their disambiguation, and the relational structure ofthe text. In particular, the need to support inferences of thekind shown above necessitates that the knowledge acquisitionis designed appropriately to facilitate it.
Our knowledge representation is designed based on Fea-ture Description Logic (FDL), a relational (frame-based)language that is expressive yet supports efficient inference[Cumby and Roth, 2003a; 2003b]. Variations of this languagewere shown to be useful in multiple applications, from pro-viding principled formulations for feature extraction in ma-chine learning, to formalisms for the semantic web [Baaderet al., 2008]. In this work, we use FDL to create a formal
Verbs in dependency path of length 2 Nouns in dependency path of length 2Seattle, WA Seattle Seahawks grow.03: agriculture grow.04: go from child to adult
base
include
serve
arrive
work
depart
know
form
win
become
bear
found
play
leave
join
make
locate
attend
run
hold
0.182
0.1
0.059
0.056
0.055
0.052
0.052
0.049
0.049
0.04
0.037
0.036
0.035
0.034
0.031
0.029
0.027
0.027
0.025
0.024
play
sign
49er
join
win
select
lead
make
release
defeat
run
become
place
announce
begin
go
beat
draft
hold
hire
0.233
0.1
0.072
0.07
0.067
0.055
0.053
0.035
0.033
0.031
0.031
0.029
0.027
0.027
0.025
0.025
0.022
0.022
0.022
0.022
area
plant
species
variety
region
soil
tree
garden
wine
crop
vegetable
fruit
flower
year
forest
seed
elevation
time
field
group
0.136
0.126
0.075
0.067
0.059
0.054
0.054
0.051
0.048
0.042
0.042
0.036
0.032
0.03
0.029
0.027
0.026
0.023
0.023
0.021
school
family
town
city
son
father
child
farm
area
village
suburb
brother
mother
year
neighborhood
parent
member
daughter
county
university
0.13
0.11
0.067
0.065
0.061
0.058
0.056
0.055
0.051
0.04
0.039
0.036
0.035
0.031
0.029
0.029
0.028
0.027
0.026
0.026
Table 1: Visualization of sample profiles. The tokens in the vertical axis are the most frequent co-occurring tokens in the cor-responding schema, and the associated numbers are normalized by the total number of co-occurrences. The detailed definitionof schema is presented in §2.
definition for different types of knowledge schemas definedbased on graphs, acquire information from data using theseschemas, and retrieve information from our KB using theseschemas to support textual inference decisions.
With this flexible schema definition, we are able to ex-tract many useful occurrence statistics on big corpora. Ta-ble 1 shows how we structure the extracted information inthe PROFILER. The statistics that share an important com-mon constituent are gathered into the same profiles. For ex-ample, all of the schema instances which contain the entity“Seattle” (the city) as one of their constituents are gathered inthe profile of “Seattle” (the city). Similarly we have profilesfor “Seattle” (Seahawks), “grow” (sense 3, meaning “produceby cultivation”), “grow” (sense 4, meaning “go from child toadult”), and so on. As can be seen, there can be multiple pro-file types, e.g. (Wikipedia based) entities, (Propbank based)Verbsense [Kingsbury and Palmer, 2002], etc. 1 Each pro-file has a set of keys that uniquely identifies it. For example,profiles of Wikipedia entities are uniquely identified by boththeir surface form and the Wikipedia url. In doing this, weare able to disambiguate different entities that have the samesurface form, as we show in the visualization.
Given an entity, we can look at various statistics gatheredfor it. Different schemas might be useful for different tasks.Here we give examples for the problems of named entityrecognition and co-reference resolution, although the usageof our schemas is not limited to these applications.
Consider the examples in Table 2. For each sentence andtask, a graph is provided to represent the useful schemas. Inthe first sentence we want to solve a co-reference problem,where the goal is to connect “it” to either “the tree” or “axe”.If we have statistics indicating whether it is more likely tohave the adjective “tall” applied to “the tree” or “axe”, we cansolve this problem. The schema graph of this knowledge isrepresented in the first example of Table 2, where w is a word
1Currently we only support the two mentioned profile types, al-though adding more profile types is just a matter of adding new inputannotations.
and can be instantiated as “the tree” and “axe”. The statisticsfrom the two graph instances provides information that is es-sential to addressing this inference task. In the second exam-ple sentence, we want to tag the constituents with NER labels.Having statistics on the pattern PER “bought” ORG, wherePER and ORG are possible NER labels, can be used to modelthe correlation between labeling of two local constituentsbased on their mutual context. In the graph, the arc labelsR1 and R2 could be substituted with any proper relationswhich approximate our goal, such as (R1, R2) =(before,after). Consider the co-reference problem in example 3.The knowledge that the subject of “steal” is more likely tobe the object of “arrest” rather than the subject of “arrest”would provide enough information for this instance. This cor-responds to the graph provided for this example in which Subjof “steal” is co-referred to Obj of “arrest”.
As can be observed from our examples, there is a widerange of patterns that can be defined for different problemsand applications. We create a unified language for expressingthese schemas, and we implement a scalable system that al-lows concise schema definitions and fast statistics extractionfrom gigantic corpora. To summarize the main contributionsof this paper:
1. We propose a formalization for graph-based representa-tion of knowledge schemas, based on FDL.
2. We create PROFILER, a publicly available tool whichcontains statistics of various knowledge schemas.
3. Finally, we show the application of our tool to datalessclassification of people-occupations. We also addresshard co-reference resolution problems and show consid-erable improvements.
The rest of the paper is organized as follows. We explainthe graph-based knowledge schemas formulation in §2. Thedetails of the acquisition system are given in §3. We reportour preliminary experimental results in §4.
# Sentence Schema Graph
1 “I chopped down [the tree] with my [axe] because [it] was tall.”
N1 N2
R
word (“tall”) word(w)
POS(ADJ) POS(N)
2 “[Larry Robbins], founder of Glenview Capital Management,bought shares of [Endo International Plc] ...”
R1 R2
NER(PER) word(“bought”) NER(ORG)
N1 N2 N3
3 “[Jimbo] arrested [Robert] because [he] stole an elephant fromthe zoo.”
objOf subjOf
word(“arrest”) word(“Robert”) word(“he”) word(“steal”)
N1 N2 N3 N4
Co-referred
Table 2: Example sentences and schema graphs.
Attributes (A) Values (V)Word Raw textLemma Raw textPOS labels form Penn TreebankNER { PER, ORG, LOC, MISC }
Wikifier Wikipedia urlsVerbsense Verb sense from Verbnet
Role { subj, obj }Table 3: Set of attribute-values in our current system.
2 Generic Knowledge DescriptionIn this section, we introduce a formal framework for char-acterizing word-tuple occurences defined based on conceptgraphs G(V,E), with certain labeling properties. In ourformalization, we use FDL as a means to represent rela-tional data as quantified propositions [Khardon et al., 1999;Cumby and Roth, 2003a], with some additional notation. Wefirst give a brief introduction of FDL2 and explain how we canscale our knowledge description with any schema by takingadvantage of FDL’s inductive nature.
Definition 1 Given a set of attributes A = {a1, a2, . . .}, aset of values V = {v1, v2 . . .} and a set of role alphabetsR = {r1, r2, . . .}, an FDL description is defined inductivelyas following:
1. For an attribute a ∈ A and a value v ∈ V , a(v) is adescription, and it represents the set x ∈ X for whicha(x, v) is True.
2. For a description D and a role r ∈ R, (r D) is a roledescription. Such description represents the set x ∈ Xsuch that r(x, y) is True, where y ∈ Y is described byD.
3. For given descriptions D1, . . . , Dk, then(AND D1, . . . , Dk) is a description, which repre-sents a conjunction of all values described by eachdescription.
Based on the rules of the FDL, the output of each descrip-tion is a set of elements. We will denote the descriptionof each node i with Di. Our goal is to describe the wholeconcept graph or, in other words, to find tuples of words(c1, c2, . . . , cn) (where n = |V |) which conform to the roles
2More details can be found in [Cumby and Roth, 2003a].
Gra
ph1
subjOf objOf
POS(N) word(“defeat”) POS(N)
( And (Pos(N)) (subjOf word(c)) )
Example: Djokovic beats Nadal
N1 N2 N3
Gra
ph2
subjOf after objOf
word(“rob”) POS(V)
( And (Pos(N)) (subjOf word(c)) )
Example: (subj, rob) à (obj, arrest)
N1 N2 N3 N4
Figure 1: Concept graphs for examples used here. Each con-cept graph gives a relational description of words. Each noderepresents a word. The label on each node is an attribute-value pair for that word. The labels on edges are the relations(or roles based on FDL notation) between words.
Roles (R)BeforeAfter
NearestBeforeNearestAfter
AdjacentToBeforeAdjacentToAfter
ExclusiveContainingHasOverlap
DependencyPath(l)Co-referredSubjectOf
IsSubjectOfObjectOf
IsObjectOf
Table 4: Set of roles in our current system.
and attribute-values defined based on our desired graph. Tothis end, we will cross-product the descriptions of individualnodes to get the description of the whole graph. For futureuse, we define Di1,...,ik , a set of k-element tuples, as the de-scription of nodes i1, . . . , ik ∈ J|V |K 3.
Before stating the complete definition of the schemas, wefirst give a couple of examples. For each example, a con-cept graph represents the set of roles on edges, and attributes-values on nodes. The concept graphs corresponding to ourexamples are depicted in Figure 1.
3JkK = {1, . . . , k}
c
Ni-1
Ni+1
Ni
ri-1
ri
ri+1
ai(vi)
⁞
⁞
c
ai-1(vi-1)
ai+1(vi+1)
Figure 2: One level of a hypothetical concept graph.
Example 1: Suppose we want to describe a pair of nounsserving as subject and object of a verb “defeat”, respectively.
D1 = (AND (POS(N)) (subjectOf word(“defeat”)))D2 = {word(“defeat”)}D3 = (AND (POS(N)) (objectOf word(“defeat”)))
In Graph 1, D1 describes N1 and D3 describes N3. Theircross-product, D1,2,3 = D1 ⊗D2 ⊗D3 represents the set oftuples.Example 2: Consider Graph 2 in Figure 1. Given the verb“rob”, we want to describe nodes N1, N3, N4.
D1 = (subjectOf word(“rob”))D2 = {word(“rob”)}D3 = (AND (POS(V)) (after word(“rob”)))D4(w) = (objectOf word(w)) ,∀w ∈ D3
D3,4 =⋃
w∈D3
({w} ⊗D4(w))
The cross product of the definitions gives the set of all possi-ble quadruples: D1,2,3,4 = D1 ⊗D2 ⊗D3,4.A general schema definition: We showed how to formallycharacterize the set of all elements satisfying a concept graphin our examples. Here we provide general characterizationprotocols. As mentioned before, in the concept graph theedges are roles r ∈ R, and nodes contain attributes a ∈ Afrom the set of values v ∈ V . Figure 2 shows one layer of ahypothetical concept graph. Assuming that the concept graphis a rooted tree (hence no cycles) the following rule induc-tively defines the description of each node in a concept graph:
Description of each node: If the parent is fixed (or isa single-element set), the description of each child nodeis inductively defined based on the description of its par-ent, given the role/attribute labels. The description of thenode Ni is:
Di = (AND (ai(vi)) (ri word(c)))
If the parent is described by a set, Dparent, the de-scription of each child node is inductively defined basedon each element of the parent description, given therole/attribute labels. The description of the node Ni:
Di(c) = (AND (ai(vi)) (ri word(c))),∀c ∈ Dparent
This definition could be further changed depending onthe concept graph. For example, if there is no attributeor role associated with the child, they could be omitted.Similarly, if there are multiple attributes or roles, theycan be combined with AND or any proper operators.
Given the description of each node, we want the joint de-scription of all nodes in the concept graph. In the followingwe formalize how to combine atomic descriptions and get aglobal description.
Combining atomic descriptions: For a fixed parent,the description of the parent-children nodes is the crossproduct of the child descriptions with the parent. If Irepresents the set of indices for children:
Dparent, child = Dparent ⊗
(⊗i∈I
Di
)Suppose a parent node is described by a set of elements,Dparent. If I represents the set of indices for children, thedescription of the parent-child nodes is:
Dparent, child =⋃
c∈Dparent
[{c} ⊗
(⊗i∈I
Di(c)
)]
A database of schema scores: So far, given a graph G(V,E)we have created a description D1,...,|V |. We define a scoringfunction S : G → R which maps a concept graph to a scorevalue. Now given a set of tuples of the form
(c1, . . . , c|V |
),
the basic score can be the number of distinct tuples. The dis-tinctness could be defined either based on the raw text, a nor-malized form or any of the attributes (Table 3), which definedthe level of abstraction.
In many applications, the raw occurrence score mightnot be very useful; instead, the conditional probabilities arehandy. The conditional probability in a graph could be de-fined in different ways. Consider a subgraph of G, defined asG′(E′, V ′), such that E′ ⊆ E and V ′ ⊆ V . A normalizedscore of G with respect to its subset is defined as
S(G \G′|G′) ,S(G)
S(G′)
Depending on the application, a different definition of G′
might be appropriate.Querying the database: Our generic knowledge descriptiongives us a uniform interface for both knowledge acquisitionand queries. In order to get the scores, the user only needs tosupply a specific instance of any schema definition G.
3 Knowledge Acquisition ProcedureIn this section we explain the knowledge acquisition proce-dure, the flowchart of which is shown in Figure 3. To sup-port fast data retrieval, we decide to pre-compute the statisticsfor our schemas, instead of querying document annotationsin real time.4 First, we use ILLINOISCLOUDNLP [Wu etal., 2014] to process raw documents with the annotations weneed in our schemas. ILLINOISCLOUDNLP is built around
4Pre-computing has the benefit of fast queries, although it limitsthe users in query time. It is possible to create a combination ofpre-computing and on the fly computation of statistics for arbitraryqueries, which is the subject of our future work.
Concept Graph Attributes = { Values } Relations # ofSchemas
N1
N2
r1
a1 (v1) a1 (v1)
word = { set of words } Possible rolesfrom Table 4exceptCo-referred
24POS = { Noun, Noun-Phrase,
Verb, Verb-Phrase, Modifier }Wikifier = { URLS }
Verbsense = { All verb senses }
r1 r2
a1(v1) a2(v2) a3(v3)
N1 N2 N3
word = { set of words }Subj, ObjOf
2POS = { Noun, Noun-Phrase,
Verb, Verb-Phrase, Modifier }Verbsense ={ All verb senses }
N1
N2
N3
N4
r1
r2 r3
a1(v1) a2(v2) a3(v3) a4(v4)
word = { All words }Subj, ObjOf,Co-referred
Verbsense = {All verb senses }8
N1
N2
N3
N4
N5
N6
r1 r2 r3 r4 r5
r1
r1
r1
r1
r1
a1(v1) a2(v2) a3(v3) a4(v4) a5(v5) a6(v6)
a1(v1)
a1(v1)
a1(v1)
a1(v1)
a1(v1)
()a1(v1)
word = { set of words }Subj, ObjOf,Co-referred
Verbsense = { All verb sense }4
Table 5: Categorization of knowledge schemas used in experiments based the structure of the concept graph.
S3
S3 Profiles EMR
Figure 3: Flowchart of Profiler data processing.
NLPCURATOR [Clarke et al., 2012] and Amazon Web Ser-vices’ Elastic Compute Cloud (EC2). It provides a straight-forward user interface to annotate large corpora with a vari-ety of NLP tools supported by NLPCURATOR, on EC2. Weuse Amazon Elastic MapReduce to aggregate the statistics.It takes input directly from Amazon’s distributed key-valuestorage S3, where ILLINOISCLOUDNLP stores its annotationoutput. Finally, we store our profiles in MongoDB as it sup-ports a flexible data model and scalable indexing.
In our experiments we annotate a large part of a Wikipediadump that contains 4,019,936 documents. The serialized an-notation output is 1,455 GB in size when uncompressed, andwe use it as the input of our Elastic MapReduce program. 200mid-end EC2 nodes are used for our MapReduce job, and thejob completes in 3 hours, at a cost of $420. The MapReduceresult is 198 GB in size, and it contains 3,636,263 profilesfor Wikipedia entities and 313,156 profiles for Verbsense en-tities.
The categorization shown in Table 5 is based on the struc-ture of the graphs. For each graph the set of possible attribute-values and roles are provided, although in practice we do notcompute all of the possible combinations. Based on the struc-ture of the problem, we pre-compute the ones which are morelikely to make improvements in the task.
4 ApplicationsIn this section we evaluate the data gathered based on knowl-edge schemas on two applications. The knowledge schemas
are expected to be applicable to any problem and are not lim-ited to the two applications we introduce bellow.
4.1 Pattern discovery inside profiles
With the knowledge schemas extracted from big corpora, weexpect to discover meaningful regularities in the data. Forexample, names of athletes tend to appear in similar context,although they might be playing on different teams or at dif-ferent positions. In other words, the profiles of athletes (or atleast some of the schemas) are expected to be similar. Withthis in mind, we expect to be able to distinguish the occupa-tion of people based on some aspects of their schemas. Forexample, sample statistics of Tom Brady (football player) areshown in Table 6. Many of the context words such as “pass”,“throw”, etc represent the profession component of its profile.Similarly, Nikola Tesla’s profile partly reflects his major ac-tivities. Given such information, we can distinguish betweenpeople with different occupations from each other without theneed of any test data specific training data. This has closeconnections with the Dataless Classification paradigm whichhas recently gained some popularity [Chang et al., 2008;Song and Roth, 2014]. In addition, the aggregated profileof all football players should be closer to individual footballplayers than that of entities with other occupations.
In order to test our hypothesis we create a dataset ofpeople-professions. First we prepare a list of people, eachwith his/her name, Wiki url, and profession from Wikipedia5.We use this list as starting point to extract profiles of entities.We have gathered a list of entities E = {e1, . . . , en} such thatfor each e we know its profession. For a given entity e ∈ E ,v(e) represents the vector of statistics given in the profile of
5Extracted from http://en.wikipedia.org/wiki/Lists_of_people_by_occupation for 2 levels of depth. Inall cases, we applied an NER filter to eliminate irrelevant profiles.
Profession PeopleFootball Players Inventors Tom Brady (football player) Nikola Tesla (Inventor)
say
run
play
throw
tackle
leave
return
go
man
take
rush
win
beat
write
see
catch
come
lead
sign
miss
0.245
0.094
0.083
0.081
0.054
0.052
0.048
0.037
0.032
0.031
0.031
0.029
0.025
0.024
0.023
0.023
0.022
0.021
0.021
0.021
invent
say
develop
die
bear
try
note
help
pay
write
design
sell
found
lead
sew
know
tell
employ
isolate
run
0.146
0.113
0.071
0.068
0.066
0.052
0.05
0.046
0.041
0.039
0.037
0.035
0.034
0.034
0.029
0.029
0.029
0.028
0.028
0.027
say
throw
pass
man
play
go
take
spend
look
win
spot
complete
order
come
start
lead
know
launch
lose
expect
0.254
0.16
0.06
0.052
0.048
0.045
0.037
0.037
0.034
0.034
0.034
0.031
0.029
0.028
0.023
0.021
0.021
0.017
0.017
0.017
say
develop
sell
buy
invent
build
continue
upgrade
play
help
magnify
suppress
alternate
hire
remove
intend
waste
seek
occur
tune
0.224
0.104
0.06
0.06
0.06
0.06
0.045
0.03
0.03
0.03
0.03
0.03
0.03
0.03
0.03
0.03
0.03
0.03
0.03
0.03
Table 6: Sample statistics in the profiles of two people and two occupations, retrieved from “Verb After” schema. The profilesof the occupations are created by averaging profile of many people with that occupation.
Dataset Winograd WinoCorefMetric Precision AntePreRahman et al [2012] 73.05 —–Peng et al [2015] 76.41 89.32Our paper 77.16 89.77
Table 7: Performance results on Winograd and WinoCorefdatasets. With knowledge from our proposed schema, we getperformance improvement compared to [Peng et al, 2015].
e6. Consider the set P = {p1, . . . , pm} which contains theset of professions. The vector of statistics for each professionv(pi) is the average of v(ej) ∈ E with the same profession.To make the experiment realistic we do 5 fold cross valida-tion, i.e. when making prediction for some target people, theprofiles of the occupations are created using the rest of them.To predict the occupation of an entity ei we assign it to theprofession with highest similarity measure. In our experi-ment, we use Okapi BM25 as our similarity measure, and in72.1% of the test cases, the correct answer is among the top-5predictions.
4.2 Improving Co-reference ResolutionTo examine the power of our approach in a real NLP applica-tion, we choose to apply our schemas in co-reference resolu-tion problems. Many hard co-reference resolution problemsrely heavily on external knowledge [Rahman and Ng, 2011;Ratinov and Roth, 2012; Rahman and Ng, 2012; Peng etal., 2015]. In particular, Winograd [1972] showed that smallchanges in context could completely change co-reference de-cisions. In the following examples, minor differences in oth-erwise identical sentences result in different references of thesame pronoun.
Ex.1 The [ball]e1 hit the [window]e2 and Bill repaired [it]pro .Ex.2 The [ball]e1 hit the [window]e2 and Bill caught [it]pro .In Ex.1, if we know “repair window” is more likely than
“repair ball”, we can decide that “it” refers to “window”.Likewise, in Ex.2, one needs to know “catch” is more likely
6Here we use only 4 schemas from Table 5: Nearest Noun After,Nearest Noun Before, Modifier Before, Neatest Verb After.
to be associated with “ball” than “window”. These referencescan be easily solved by humans, but are hard for todays’ com-puter programs. However, the schemas we proposed are de-signed to capture this kind of knowledge7.
The Winograd dataset in [Rahman and Ng, 2012] contains943 pairs of such sentences. It has a training set of 606 pairsand a testing set of 337 pairs. In each sentence, there are twoentities and a pronoun, and we model it as a binary classi-fication problem. Peng et al [2015] add more pronoun an-notations to the dataset, model it as a general co-referenceresolution problem and provide the WinoCoref dataset8. Inthis dataset, each co-referent cluster only contains 2-4 men-tions and are all within the same sentence. We cannot usetraditional co-reference metrics in this problem. Instead, wecan view predicted co-reference clusters as binary decisionson each antecedent-pronoun pair (linked or not). FollowingPeng et al [2015], we compute the ratio of correct decisionsover the total number of decisions made, and we call this met-ric AntePre.
We apply the knowledge extracted with our proposedschemas to the system described in Peng et al [2015] andtest on both Winograd and WinoCoref datasets. The sys-tem uses Integer Linear Programming (ILP) formulation tosolve co-reference problems, and it has a decision variable foreach co-reference link. The knowledge from the schemas weproposed are automatically turned into constraints with tunedthresholds at the decision time. We implement the system thesame way as described by Peng et al [2015]. By using our ad-ditional schemas as constraints, we improve the performancereported in [Peng et al., 2015]. Results in Table 7 show thatour schemas can be utilized as a reliable knowledge sourcein solving hard co-reference resolution problems. We expect
7In the PROFILER, we can tell that the number of co-occurrencesfor “catch” to be the nearest verb before “ball” is far larger than thatfor “catch” to be the nearest verb before “window”. Moreover, wealso capture that the probability for “ball” to be object of “catch” isfar larger than that for “window” to be object of “catch”.
8Available at: http://cogcomp.cs.illinois.edu/page/resource_view/96
the knowledge we gathered can also be applied to other NLPapplications that rely on external knowledge.
5 Concluding RemarksWe presented the PROFILER, a knowledge base created bylarge scale graph-based statistics extracted from huge cor-pora annotated using state-of-the-art NLP tools. We pro-posed a tree based formalization based on Feature Descrip-tion Logic (FDL). The representation takes into account typ-ing of concepts and entities, along with their disambiguation.We showed the application of our tool on dataless classifica-tion of people-occupations and hard co-reference resolution.More experiments need to be done to gain better insight intothis knowledge representation, the effect of disambiguation,and other applications which can benefit from it.Our current implementation includes a wide range of schemaswhich are efficiently processed on Amazon cloud service.However, this implementation can be generalized for on-demand extraction of desired schemas which is the subjectof our future work.
6 AcknowledgmentThe authors would like to thank Eric Horn, Alice Lai andChristos Christodoulopoulos for comments that helped to im-prove this work. This work is partly supported by DARPAunder agreement number FA8750-13-2-0008 and by a grantfrom Allen Institute for Artificial Intelligence (AI2). The U.S.Government is authorized to reproduce and distribute reprintsfor Governmental purposes notwithstanding any copyrightnotation thereon. The views and conclusions contained hereinare those of the authors and should not be interpreted as nec-essarily representing the official policies or endorsements, ei-ther expressed or implied, of DARPA or the U.S. Govern-ment.
References[Baader et al., 2008] F. Baader, I. Horrocks, and U. Sat-
tler. Description logics. Found. of Artificial Intelligence,3:135–179, 2008.
[Banko et al., 2007] M. Banko, M. Cafarella, M. Soderland,M. Broadhead, and O. Etzioni. Open information extrac-tion from the web. In Proceedings of the InternationalJoint Conference on Artificial Intelligence (IJCAI), 2007.
[Bollacker et al., 2008] K. Bollacker, C. Evans, P. Paritosh,T. Sturge, and J. Taylor. Freebase: a collaboratively cre-ated graph database for structuring human knowledge. InProceedings of ACM SIGMOD international conferenceon Management of data, 2008.
[Carlson et al., 2010] A. Carlson, J. Betteridge, B. Kisiel,B. Settles, E. R. Hruschka Jr, and T. M. Mitchell. To-ward an architecture for never-ending language learning.In Proceedings of the Conference on Artificial Intelligence(AAAI), 2010.
[Chang et al., 2008] M. Chang, L. Ratinov, D. Roth, andV. Srikumar. Importance of semantic represenation: Data-less classification. In Proceedings of the Conference onArtificial Intelligence (AAAI), 2008.
[Clarke et al., 2012] J. Clarke, V. Srikumar, M. Sammons,and D. Roth. An nlp curator (or: How i learned to stopworrying and love nlp pipelines). In International Con-ference on Language Resources and Evaluation (LREC),2012.
[Cumby and Roth, 2003a] C. Cumby and D. Roth. On ker-nel methods for relational learning. In Proceedings of theInternational Conference on Machine Learning (ICML),2003.
[Cumby and Roth, 2003b] C M Cumby and D. Roth. Learn-ing with feature description logics. In Inductive Logic Pro-gramming, pages 32–47. Springer, 2003.
[Hoffart et al., 2013] J. Hoffart, F. M. Suchanek,K. Berberich, and G. Weikum. Yago2: A spatiallyand temporally enhanced knowledge base from wikipedia(extended abstract). In Proceedings of the InternationalJoint Conference on Artificial Intelligence (IJCAI), 2013.
[Khardon et al., 1999] R. Khardon, D. Roth, and L. G.Valiant. Relational learning for nlp using linear thresholdelements. In Proceedings of the International Joint Con-ference on Artificial Intelligence, pages 911–917, 1999.
[Kingsbury and Palmer, 2002] P Kingsbury and M. Palmer.From treebank to propbank. In International Conferenceon Language Resources and Evaluation (LREC), 2002.
[Levesque and Brachman, 1987] H. J. Levesque and R. J.Brachman. Expressiveness and tractability in knowledgerepresentation and reasoning. Computational intelligence,3(1):78–93, 1987.
[Peng et al., 2015] H. Peng, D. Khashabi, and D. Roth. Solv-ing hard coreference problems. In Proceedings of the Con-ference of the North American Chapter of ACL (NAACL),2015.
[Rahman and Ng, 2011] A. Rahman and V. Ng. Coreferenceresolution with world knowledge. In Proceedings of theAnnual Meeting of ACL (NAACL). Association for Com-putational Linguistics, 2011.
[Rahman and Ng, 2012] A. Rahman and V. Ng. Resolvingcomplex cases of definite pronouns: the winograd schemachallenge. In Proceedings of the Conference on Empiri-cal Methods in Natural Language Processing (EMNLP),pages 777–789, 2012.
[Ratinov and Roth, 2012] L. Ratinov and D. Roth. Learning-based multi-sieve co-reference resolution with knowledge.In Proceedings of the Conference on Empirical Methodsfor Natural Language Processing (EMNLP), 2012.
[Song and Roth, 2014] Y. Song and D. Roth. On dataless hi-erarchical text classification. In Proceedings of the Con-ference on Artificial Intelligence (AAAI), 2014.
[Winograd, 1972] T. Winograd. Understanding natural lan-guage. Cognitive psychology, 3(1):1–191, 1972.
[Wu et al., 2014] H. Wu, Z. Fei, A. Dai, S. Mayhew,M. Sammons, and Roth D. Illinoiscloudnlp: Text analyt-ics services in the cloud. In International Conference onLanguage Resources and Evaluation(LREC), 5 2014.