View
216
Download
0
Embed Size (px)
Citation preview
1
OntoNotes: A Unified Relational Semantic Representation
Sameer Pradhan, Eduard Hovy, Mitchell Marcus, Martha Palmer, Lance Ramshaw, and Ralph Weischedel
http://www.bbn.com/ontonotes
2
Outline
Multiple layers of annotation and modeling capture useful elements of text meaning at 90% ITA– Syntax– Proposition– Word sense
– Ontology– Coreference– Names
An integrated relational database representation– Enforces consistency across the different annotations– Supports integrated models that can combine evidence from
different layers
Some practical issues Sensitivity to changes in layers
Adding a new layer to the data
Few lessons learned
3
Problems with Multiple Layers of Annotation
Not previously available – A number of these layers have not been available in significant
quantity before:• Word Sense • Coreference
Not previously integrated – Each layer encoded separately as individual files, requiring
supporting documentation for interpretation
Not previously completely consistent– Mismatches between Treebank and PropBank
Not previously user friendly– Raw text format
4
Unified Representation
Provide a bare-bones representation independent of the individual layer’s semantics that can– Efficiently capture intra- and inter- layer semantics– Maintain component independence (facilitate collaboration)– Provide mechanism for flexible integration (for an application)– Integrate information at the required level of granularity– Data storage as close as possible to an application backend– Adaptable in face of incremental representational changes– API extremely accessible (don’t need to be a hacker to use it)– Ability to easily perform cross-layer queries– Easily extensible– Capable of maintaining version information – Ideally at
different possible levels– …– …
Relational Database
+
Object Oriented API
5
Relational Representation
Corpus
Trees
Coreference Names
Propositions
Senses
6
Example: Database Representation of Syntax
• Treebank tokens (stored in the Token table) provide the common base• The Tree table stores the recursive tree nodes, each with its span• Subsidiary tables define the sets of function tags, phrase types, etc.
7
Object Oriented API
8
Using the API: Importing the modules
9
Using the API: Creating Skeleton Objects
10
Using the API: Creating Full-fledged Objects (I)
11
Using the API: Creating Full-fledged Objects (II)
12
Using the API: Writing to the database
13
Using the API: Reading form the Database
14
Data Loading Life-cycle
Database
15
OntoNotes Data: Current and Future
NW BN BC
Eng 300
Chi 250
Ara
OntoNotes 1.0
100Ara
300250Chi
200300Eng
BCBNNW
OntoNotes 2.0
200Ara
150300250Chi
200200500Eng
BCBNNW
OntoNotes 3.0
16
Advantages of an Integrated Representation
Clean, consistent layers– Resolve the inconsistencies and problems that this reveals
Well defined relationships– Database schema defines the merged structure efficiently
Extract individual views – Treebank, PropBank, etc.
SQL queries can extract examples based on multiple layers or define new views
Python Object-oriented API allows for programmatic access to tables and queries
17
Example of Database Query Function
for a_proposition in a_proposition_bank: if(a_proposition.lemma != "say"): arg_in_p_q = "select * from argument where proposition_id = '%s';" % (a_proposition.id) a_cursor.execute(arg_in_p_query) argument_rows = a_cursor.fetchall()
for a_argument_row in argument_rows: a_argument_id = a_argument_row["id"] a_argument_type = a_argument_row["type"]
if(a_argument_type != "ARG0"): n_in_arg_q = "select * from argument_node where argument_id = '%s';" % (a_argument_id) a_cursor.execute(n_in_arg_q) argument_node_rows = a_cursor.fetchall() for a_argument_node_row in argument_node_rows: a_node_id = a_argument_node_row["node_id"]
a_ne_node_query = "select * from name_entity where subtree_id = '%s';" % (a_node_id) a_cursor.execute(a_ne_node_query) ne_rows = a_cursor.fetchall()
for a_ne_row in ne_rows: a_ne_type = a_ne_row["type"] ne_hash[a_ne_type] = ne_hash[a_ne_type] + 1
a_tree = a_tree_document.get_tree(a_tree_id) a_node = a_tree.get_subtree(a_node_id)
for a_child in a_node.subtrees(): a_ne_subtree_query = "select * from name_entity where subtree_id = '%s';" % (a_child.id) subtree_ne_rows = a_cursor.execute(a_ne_subtree_query)
ne_subtree_rows = a_cursor.fetchall()
for a_ne_subtree_row in ne_subtree_rows: a_subtree_ne_type = a_ne_subtree_row["type"] ne_hash[a_subtree_ne_type] = ne_hash[a_subtree_ne_type] + 1
if (proposition.lemma == “say”):
query = “select * from argument where proposition_id = '%s';” ..
What is the distribution of named entities that are ARG0s of the predicate “say”?
if (argument_type == "ARG0"):
for child in node.subtrees():
......
15NORP
29Organization
34GPE
84Person
FrequencyName Entity
18
Reconciling Treebank and PropBank
We found several mis-matches between syntax and propositions– Sometimes PropBank was right– Sometimes Treebank was right
Guidelines modified to bring the two in line
Now each argument points to a single node in the tree– Secondary connections are made using Treebank trace chains– Almost no discontinuous arguments– Non-trace connections are explicitly identified
This greater consistency will make it easier to train models that predict argument structure
19
Sensitivity to Changes – PropBank changes
ARG2
ARG1ARGM-LOC
... major reductions and realignments of troops in central Europe – ...
NP
NP
JJ NNS CC NNS IN NP
NNS
PP
IN NP
JJ NNP
PP
S
20
Sensitivity to Changes – Treebank changes
... major reductions and realignments of troops in central Europe – ...
NP
NP
JJ NNS CC NNS IN NP
NNS
PP
IN NP
JJ NNP
PP
S
• If the node got deleted, remove associated annotation• if any node has a change in children or parent node, then update associated annotation. Print new propbank
21
Adding a new layer
1. What information do you want to capture?
2. Define relationship with the required layer
3. Design tables
4. Superimpose on existing machinery with respect to the anchor
5. Create a class in the corpora packagea. Define a few specific functions
• Create object from original annotation (Text Reader)• Write object to database (DB Writer)• Create object from database (DB Reader)• Write database to original format (Text Writer)• Pretty print function (Pretty Printer)
b. Write at least one alignment function at the level where the enrichment is required, or even multiple levels• Enrich Treebank/Document/…
22
Few Errors Found
Missing co-indices in Trees (found during loading) Invalid sense numbers (while checking against repository) Multiple sense definitions (in the repository) Validation errors in schemas Dead pointers in ontology Multiple coreference chain memberships Missing/Invalid predicate/argument pointers Invalid PB/TB merges Filename/Content mismatches Pinyin/Unicode inconsistencies Varying sentence breaks SLINK Errors Inconsistent TB Empty specifications in the merge process Typos (found through Type Tables) .. And, a few annotation Errors
23
Some Interesting Problems Addressed
Word sense annotation transferred from old Treebank to new Treebank
Coreference annotation transferred to new Treebank
Treebank/PropBank with or without NMLs reside in harmony
Various levels of data quality identified in the database
Varying styles of marking traces normalized
Language specific idiosyncrasies in inventories and frames normalized
Data generated for annotation– Eventive nouns– Coreference
24
Few Lessons Learned
Each layer should – abide by a minimum dependency principle– adhere to a well defined schema
Try to maintain consistency across representation of similar components
Use a centralized, version controlled repository
Need for single-point, push-button loading philosophy
25
Conclusion
Lot of annotation layers available, integrated using a relational schema
A extensible, relational/object oriented architecture available to the community
Easily Accessible– Through Python API– SQL queries
OntoNotes Release 2.0 available from LDC
unencumbered, open source!!
26
Backup
27
Syntax Layer
Identifies meaningful phrases in the text
Lays out the structure of how they are related
Concerns about the pace of the Vienna talks -- which are aimed at the destruction of some 100,000 weapons , as well as major reductions and realignments of troops in central Europe – also are being registered at the Pentagon .
S
major reductions and realignments of troopsin central Europe
... major reductions and realignments of troops in central Europe – ...
NP
NP
JJ NNS CC NNS IN NP
NNS
PP
IN NP
JJ NNP
PP
SYNTAX
28
ARG2
ARG1
ARGM-LOC
Propositional Structure
Tells who did what to whom
For both verbs and nouns
Concerns about the pace of the Vienna talks -- which are aimed at the destruction of some 100,000 weapons , as well as major reductions and realignments of troops in central Europe – also are being registered at the Pentagon .
... major reductions and realignments of troops in central Europe – ...
NP
NP
JJ NNS CC NNS IN NP
NNS
PP
IN NP
JJ NNP
PP
S
Concerns about the pace of the Vienna talks -- which are aimed at the destruction of some 100,000 weapons , as well as major reductions and realignments of troops in central Europe – also are being registered at the Pentagon .
29
reduce.01 – Make less
Predicate Frames
Concerns about the pace of the Vienna talks -- which are aimed at the destruction of some 100,000 weapons , as well as major reductions and realignments of troops in central Europe – also are being registered at the Pentagon .
reductionreduce.01 – Make less
ARG0 – Agent ARG1 – Thing fallingARG2 – Amount fallenARG3 – Starting pointARG4 – Ending point
Predicate frames define the meanings of the numbered arguments
- the troopsmajor--
30
Word Sense and Ontology
Meaning of nouns and verbs are specified using a catalog of possible senses
All the senses are annotatable at 90% ITA
Concerns about the pace of the Vienna talks -- which are aimed at the destruction of some 100,000 weapons , as well as major reductions and realignments of troops in central Europe – also are being registered at the Pentagon .
Word Sense
aim
1. Point or direct object, weapon, at something ...
2. Wish, purpose or intend to achieve something
Word Sense
register
1. Enter into an official record2. Be aware of, enter into someone’s
consciousness3. Indicate a measurement4. Show in one’s face
2. Wish, purpose or intend to achieve something
1. Enter into an official record
Concerns about the pace of the Vienna talks -- which are aimed at the destruction of some 100,000 weapons , as well as major reductions and realignments of troops in central Europe – also are being registered at the Pentagon .
Ontology links (currently being added) capture similarities between related senses of different words
31
Coreference
Identifies different mentions of the same entity within a document – especially links definite, referring noun phrases, and pronouns to their antecedents
Two types tagged – Identity and Attributive
Concerns about the pace of the Vienna talks -- which are aimed at the destruction of some 100,000 weapons , as well as major reductions and realignments of troops in central Europe – also are being registered at the Pentagon .
President Bushconventional arms talk
Pentagon He
e0 e1 e2
of some 100,000 weapons , as well as major reductions and realignments of troopsin central Europe
Vienna talks – which are aimed at the destruction
the Pentagon