View
213
Download
0
Category
Preview:
Citation preview
Attribute & Relationship Extraction for Co-reference Chains
Project Status Report, 10-707Bo Lin
Kevin Dela RosaRushin Shah
As part of our research, we are working on a cross-
document co-reference resolution system Co-reference Resolution: Extract all noun phrases from a
document (names, descriptions, pronouns), and cluster
them according to the real-world entity they describe. Each such cluster a chain Within-doc: Cluster NPs from a single document Cross-doc: Cluster NPs from different documents
Background – Co-reference Resolution
Run a WDC system on a document and extract chains corresponding to real-world entities. For each chain, track all sentences from which its mentions are obtained.
Features over pairs of such chains:◦ SoftTFIDF similarity between names◦ All words TFIDF cosine similarity (over sentences)◦ Named Entity (NE) TFIDF cosine similarity (over sentences)◦ NE SoftTFIDF cosine similarity (over sentences)◦ Semantic similarity between the NPs of each chain
Train an SVM that classifies pairs of chains as co-referent or not Use SVM to cluster all chains from all documents in the corpus. Store this clustering in a database. Each entity a list of chains
Background – Architecture of our CDC system
Augment our CDC system with the following: Attribute Extraction◦ For each chain, extract attributes such as gender, occupation,
nationality, birthdates if they exist◦ Use attributes to enable SVM to do better co-reference
Relationship Extraction◦ For each pair of chains, extract a relationship (e.g. part of,
role, etc), if it exists◦ Use relationships for better visualization of clusters
Goals
Patterns: Take seed examples of (entity, attribute) and learn extraction patterns
Position: Use typical document positions of attributes
Transitive: Use attributes of neighboring entities
Latent: Use document-level topic models to infer attributes
References: Ravinchandran & Hovy ‘02, Garera & Yarowsky ‘09
Proposed Techniques – Attribute Extraction
Attribute Extraction: Context/Pattern-Based Model
Idea: Attribute values should appear with linguistic clues around it, i.e.
it can be defined as a probabilistic language model describing the chance of a word being an attribute given the context. This idea is essentially the same as in KnowItAll or Brin ‘98.
Example: ◦Marked Text:
“<name> (born <dob>) is a former American <occupation>, active <occupation>…”
◦Generated Context:“(born X)” for X=<dob>, “a former American X” for X=<occupation> …
Idea: Certain biographical attributes tend to appear in
characteristic positions, often near top of article. Relative position /rank between attributes can be helpful information as well.
Example: “Birthdate” is often times the first date in a biography
text, or at least near the beginning of the article, and relatively speaking “Deathdate” almost always occurs sometime afterwards
Attribute Extraction: Position-Based Model
Attribute Extraction: Transitivity-Based Models Idea: Intuition is that a named entity is more likely
mentioned together with another entity with similar attribute values (the most applicable ones seem to be “occupation”, “religions”)
Example: “Michael Jordan” (the player) is mostly mentioned
together with “Wilt Chamberlain”, “Dennis Rodman” and fellow players.
Idea: Use “latent wide-document-context” models to detect
attributes that may not have been mentioned directly in article
Example: Words such as “songs, album, recorded” can all
collectively indicate an occupation of singer or musician
Attribute Extraction: Latent Models
Three options:◦ Extract variety of features and train YFCL◦ Define Kernels to measure similarity between instances, plug
into SVM◦ Semi-supervised approach. Start with seed examples, learn
patterns, iterate. Grow KB (E.g. KnowItAll, TextRunner)
Ideally would prefer semi-supervised, especially since it allows open domain IE, but very time and labor-intensive
Kernel approach more elegant, works better than YFCL Therefore, we proposed to use Kernels
Proposed Techniques – Relation Extraction
Attributes: NNDB seed pairs, Wikipedia pages
Relations: ACE Phase 2 top level relations
For our CDC system:
◦ John Smith corpus
◦WePS (person name disambiguation) corpora
Proposed Datasets
Data Preparation◦ Collected attribute extraction data set
Currently our seed pairs have occupation, date-of-birth, and date-of-death values we plan on collecting birthplace, gender, and nationality
◦ Implemented the program to parse simple Wikipedia pages
Framework◦ Implemented a pipe-line process which consists of
Loading the list of names and pages Parsing Wikipedia pages Sentence segmentation POS tagging / NE Tagging (If needed) Modeling and Extraction
◦ Provide capability to support different models for different attributes
Progress Report – Attribute Extraction
Implementations◦ Implemented context/pattern-based model, currently focusing
on occupations◦ Implemented position-based models (absolute & relative), on
occupation and date of birth attribute extraction◦ In progress of implementing transitivity-based model for
occupation extraction ◦ In progress of implementing latent wide-document-context
models for extracting implicit attributes
Progress Report – Attribute Extraction
IssuesActual implementation raises a lot of problems missed in the paper such as whether POS Tagging should be included in the model, the pipeline framework is intended to solve this issue.
Plan◦ To finish the implementations and compare with two baseline
models for each attributes◦ Extend code to work on chains and consolidate the attributes◦ Use attributes extracted as features in the CDC system’s SVM
classifier to help co-reference resolution
Progress Report – Attribute Extraction
Evaluated different kinds of Kernels (subsequence, dependency trees, etc).
Chose subsequence kernels as these perform well and robust to ill-formed documents
As a template, decided to follow the Bunescu and Mooney paper discussed in class
Observation: Although the idea in the paper is easy to understand, implementation is quite complex
Progress Report – Relation Extraction
For SVM, decided to use LibSVM. Bunescu has a modified version that can take custom Kernels, but based on older release
Made updates to LibSVM to reconcile different versions.
◦ Now accepts custom/pre-computed Kernels ◦ Allows richer representation of instances than numerical
values (e.g. sentence fragments, named entities, etc.)
Progress Report – Relation Extraction
As we know, hardest part of coding Machine Learning applications isn’t the classifier (plenty of libraries), but feature extraction
Wrote feature extraction code that extracts sentences from XML and produces instances for SVM, where each instance consists of:◦ The 2 Named Entities of interest◦ Sentence fragments before, between and after NEs
If a sentence consists of > 2 NEs, make NC2 copies
Progress Report – Relation Extraction
Implemented general subsequence Kernel algorithm
Currently working on adapting this for the specific case of relationship Kernels (as explained in Bunescu & Mooney)
Once done, will extend this code (which works on sentences), to chains, so it can be plugged into CDC system
Progress Report – Relation Extraction
Questions?
Recommended