May 11, 2005 WWW 2005 -- Chiba, Japan 1
Thresher: Automating the Unwrapping of
Semantic Content from the World Wide Web
Andrew HogueGoogle MIT CSAIL
May 11, 2005 WWW 2005 -- Chiba, Japan 2
Acknowledgments
• David Karger
• Haystack Group
(http://haystack.csail.mit.edu)
May 11, 2005 WWW 2005 -- Chiba, Japan 3
Agenda
• Overview
• Demo
• Details– Induction– Matching– Semantics– Heuristics
May 11, 2005 WWW 2005 -- Chiba, Japan 4
Agenda
• Overview
• Demo
• Details– Induction– Matching– Semantics– Heuristics
May 11, 2005 WWW 2005 -- Chiba, Japan 5
Unwrapping the Web
• Majority of semantic content in “deep web”
• Transformed into human-readable HTML by scripts
• HTML is difficult for automated agents to understand
• Little incentive for content providers to provide RDF markup
• How to “unwrap” this content?
May 11, 2005 WWW 2005 -- Chiba, Japan 6
Thresher
• Simple UI for wrapper induction on structured web content
• “Demonstrate” examples of objects
• Induce wrapper, or pattern, based on DOM
• User may also label properties with RDF
May 11, 2005 WWW 2005 -- Chiba, Japan 7
Thresher
• Built on Haystack Semantic Web client
• Everything is RDF
• Everything has context menus
• Thresher brings RDF into the web browser
• Wrappers reify web objects for full interaction
May 11, 2005 WWW 2005 -- Chiba, Japan 8
Thresher
• Underlying wrapper algorithm based on tree edit distance
• Align user’s examples
• Keep aligned nodes (layout elements)
• Wildcard non-aligned nodes (content)
• Pattern matching is also alignment
May 11, 2005 WWW 2005 -- Chiba, Japan 9
Agenda
• Overview
• Demo
• Details– Induction– Matching– Semantics– Heuristics
May 11, 2005 WWW 2005 -- Chiba, Japan 10
Agenda
• Overview
• Demo
• Details– Induction– Matching– Semantics– Heuristics
May 11, 2005 WWW 2005 -- Chiba, Japan 11
Wrapper Induction
• Wrapper: pattern created from examples
• User provides positive examples
• Generalize examples into reusable pattern
• Existing techniques:– head-left-right-tail (HLRT) descriptors– Hidden Markov models– Support Vector Machines– Other Machine Learning
May 11, 2005 WWW 2005 -- Chiba, Japan 12
Wrapper Induction
• Our approach: take advantage of hierarchical structure of HTML
• Each example picks out a subtree of DOM
• Calculate tree edit distance between examples
• Least-cost edit distance gives best mapping
• Remove unmapped nodes to make pattern
May 11, 2005 WWW 2005 -- Chiba, Japan 13
Tree Edit Distance
• Calculate cost ( ) of sequence of operations to transform one tree into the other
• Operations: insert, delete, change a node
• Cost of an operation = size of subtree it affects
• Least-cost set of operations gives best mapping between elements
May 11, 2005 WWW 2005 -- Chiba, Japan 14
Mapping Examples
May 11, 2005 WWW 2005 -- Chiba, Japan 15
Mapping Examples
May 11, 2005 WWW 2005 -- Chiba, Japan 16
Mapping Examples
May 11, 2005 WWW 2005 -- Chiba, Japan 17
Agenda
• Overview
• Demo
• Details– Induction– Matching– Semantics– Heuristics
May 11, 2005 WWW 2005 -- Chiba, Japan 18
Pattern Matching
• Look for document subtrees with similar structure
• Find alignments of wrapper in tree
• Require every node in wrapper be mapped to some node in document subtree
• Wildcards match zero or more times
• Each valid alignment is a match
May 11, 2005 WWW 2005 -- Chiba, Japan 19
Matching Example
May 11, 2005 WWW 2005 -- Chiba, Japan 20
Agenda
• Overview
• Demo
• Details– Induction– Matching– Semantics– Heuristics
May 11, 2005 WWW 2005 -- Chiba, Japan 21
Adding Semantics
• How to tie wrappers to semantic content?
• Assert RDF statements about unwrapped objects
• Tied to wrapper structure
• Classes bound to wrappers
• Properties bound to wildcards
May 11, 2005 WWW 2005 -- Chiba, Japan 22
Semantic Labels
May 11, 2005 WWW 2005 -- Chiba, Japan 23
Semantic Matching
May 11, 2005 WWW 2005 -- Chiba, Japan 24
Semantic Matching
May 11, 2005 WWW 2005 -- Chiba, Japan 25
Semantic Matching
[
<rdf:type> <TalkAnnouncement> ;
<series> “Dertouzos Lect…” ;
<dc:title> “Distributed Hash…” ;
<time> “3:30 PM”
]
May 11, 2005 WWW 2005 -- Chiba, Japan 26
Agenda
• Overview
• Demo
• Details– Induction– Matching– Semantics– Heuristics
May 11, 2005 WWW 2005 -- Chiba, Japan 27
• Find additional examples automatically • Consider nodes neighboring the example• Require low normalized cost:
• Often allows us to create wrappers with a single example
Automatically Adding Examples
May 11, 2005 WWW 2005 -- Chiba, Japan 28
Automatically Adding Examples
TR
T
May 11, 2005 WWW 2005 -- Chiba, Japan 29
List Collapse
• Current wrappers generalize well for single elements
• Will not recognize variable length lists
• Collapse neighboring nodes with low normalized cost
• For matching, allow nodes to match more than once
May 11, 2005 WWW 2005 -- Chiba, Japan 30
Wrapper Wrap-up
• Gather user example(s)
• Automatically find additional examples
• Generalize examples using best mapping
• Add semantic labels
• Match by finding alignments
• Overlay objects on the page for interaction
May 11, 2005 WWW 2005 -- Chiba, Japan 31
Additional Tools
• Wrapper Sharing
• RSS
• Web Operations
May 11, 2005 WWW 2005 -- Chiba, Japan 32
Our Contributions
• End-user wrapper induction
• Few examples required
• Bring object interaction into the browser
• Wrappers bridge syntactic-semantic gap
May 11, 2005 WWW 2005 -- Chiba, Japan 33
Future Work and Applications
• Document-level classes
• Page reformatting
• Autonomous agent interaction
• Negative examples
• Automatic wrapper induction
May 11, 2005 WWW 2005 -- Chiba, Japan 34
http://haystack.csail.mit.edu