Upload
darren-crawford
View
213
Download
0
Embed Size (px)
Citation preview
CS 652, Peter Lindes 1
4. Relationship Extraction
Part 4 of Information ExtractionSunita Sarawagi
9/7/2012
CS 652, Peter Lindes 2
The Problem
• Relate extracted entities – unstructured text not partitioned into records
• Various competitions– MUC– ACE– BioCreAtIvE II Protein-Protein Interaction
9/7/2012
CS 652, Peter Lindes 3
Groups of Relationships
• ACE:– located at, near, part, role, social for entities:– person, organization, facility, location, and geo-
political entity• Biomedical: gene-disease, protein-protein,
subcellular regularizations• NAGA knowledge base: 26 relationships such
as: isA, bornInYear, establishedInYear, hasWonPrize, locatedIn, politicianOf, …
9/7/2012
CS 652, Peter Lindes 4
Three Problem Levels
• First case:– Entities preidentified in unstructured text– Given a pair of entities, find type of relationship
• Second case:– Given relationship type r, entity name e– Extract entities with which e has relationship r
• Third case:– Open-ended corpus – the web– Given relationship type r, find entity pairs
9/7/2012
CS 652, Peter Lindes 5
Given Entity Pair, Find Relationship
• R: set of relationship types• : R plus a special member for “other”• x: a “snippet” of text (might be a sentence)• E1 and E2 in x
• Identify relationships in between E1 and E2 • Resources available:
– Surface Tokens– Part of Speech tags– Syntactic Parse Tree Structure– Dependency Graph
• Use these clues to classify (x, E1, E2) into one of
9/7/2012
CS 652, Peter Lindes 6
Parse Tree
9/7/2012
CS 652, Peter Lindes 7
Dependency Graph
9/7/2012
CS 652, Peter Lindes 8
Methods to Extract Relationships
• Feature-based methods– String form, orthographic type, POS tag, etc.– Features from Dependency Graph– Features from Word Sequence– Features from Parse Trees
• Kernel-based methods– Kernel function K(X, X’) captures similarity– Support Vector Machine (SVM) classifier
• Rule-based methods9/7/2012
CS 652, Peter Lindes 9
Given Relationship, Find Entity Pairs
• Given one or more relationship types• Find all occurrences in a corpus• Open document collection• No labeled unstructured training data• Instead, seeding for each relationship type is
used
9/7/2012
10
Seed Data for Relationship Type r
• The types of entities that are arguments of r– Often specified at a high level, eg. proper noun,
common noun, numeric, etc.– Types such as “Person” or “Company” require patterns
to recognize them• A seed database S of entities that have r– May include negative examples
• A seed set or manually coded patterns– Easy for generic relationships, eg. hypernym or
meronym (part-of)9/7/2012 CS 652, Peter Lindes
CS 652, Peter Lindes 11
3 Steps for Relationship Extraction
• Start with above seeding data– A corpus D– Relationship types r1,…,rk
– Entity types Tr1, Tr2 for each r
– A set S of examples (Ei1,Ei2,ri) 1 ≤ i ≤ N
• 1: Use S to learn extraction patterns M• 2: Use a subset of patterns to create candidates• 3: Validation: select a subset based on statistical
tests
9/7/2012
CS 652, Peter Lindes 12
Example Data
• Relationships: “IsPhDAdvisorOf”, “Acquired”• Entity types: “(Person, Person)”, “(Company,
Company)”
9/7/2012
CS 652, Peter Lindes 13
Learn Patterns from Seed Triples
• Assume only one relationship for each pair• Thus each example for r is negative for r’• 1: Find sentences with entity pairs– For (E1,E2,r) query for “E1 NEAR E2”
– Filter out where E1, E2 don’t match Tr1, Tr2
• 2: Filter sentences for the relationship• 3: Learn patterns from sentences
9/7/2012
CS 652, Peter Lindes 14
Filtering Sentences
• Example:
• Banko: a simple heuristic using the length of dependency links
• This fails for above example
9/7/2012
CS 652, Peter Lindes 15
Learn Patterns from Sentences
• Formulate as a standard classification problem• Two practical problems:– No guarantee of positive examples• Bunescu and Mooney: use SVM
– Many sentences for each pair• Bunescu and Mooney: down-weight correlated terms
9/7/2012
CS 652, Peter Lindes 16
Extract Candidate Entity Pairs
• Learned model M: (x,E1,E2) -> r• Simple method: sequential scan over D– Look for Tr1, Tr2, then apply M
• Large, indexed corpus: retrieve relevant sentences– Use keyword search• Pattern-based• Keyword-based• Agichtein and Gravano: iterative solution
9/7/2012
CS 652, Peter Lindes 17
Validate Extracted Relationships
• Extraction has high error rates• Validation based on corpus-wide statistics• Probabilities based on count of occurrences– Extract only high-confidence relationships
• Rare relationships:– Use contextual pattern– Alternative: correct entity boundary errors
9/7/2012
CS 652, Peter Lindes 18
Summary
• Setting 1: entities already marked– Feature-based and kernel-based methods– Clues from word sequence, parse trees, and dependency graphs– Training data with labeled relationships
• Setting 2: open corpus, given relationship types– No labeled unstructured data– Seed database of (E1,E2,r) examples– Bootstrapping from seed data– Filter based on relevancy
• Accuracy:– 50%-70% for closed benchmark datasets– Lots of special case handling for the web
9/7/2012
CS 652, Peter Lindes 19
Further Readings
• Concentrated here on binary relationships• Natural extension: records with multi-way
relationships• Requires cross-sentence analysis:– Co-reference resolution– Discourse analysis
• Much literature on this topic• Future research: discovering relevant
relationship types9/7/2012