Upload
cuthbert-fowler
View
214
Download
0
Tags:
Embed Size (px)
Citation preview
1
A Hierarchical Approach to Wrapper Induction
Presentation by Tim Chartrand of
A paper by
Ion Muslea, Steve Minton and
Craig Knoblock
2
Introduction
IE from web pages is important because of the amount of semistructured information on the web
IE depends on the construction of wrappers
Manual wrapper construction is tedious and hard
Previous wrapper learning systems require a lot of hand-marked training data
STALKER is a supervised learning algorithm for inducing wrappers
STALKER requires fewer triaining examples than other approaches and is able to wrap more pages
3
Structure of a document
Semistructured documents follow a formal grammar
An Embedded Catalog (EC) represents the structure of a document as a tree Leaves are items of interest Internal nodes are lists of k-tuples Each item of a k-tuple can be a leaf or another
list
5
Extraction Rules
Each tree node represents a sequence of tokens
The root node represents the entire document
Each node’s sequence is a subsequence of its parent’s
An extraction rule is associated with each edge of the tree Specifies how to extract the child content x from the parent content p In other words describes how to match the prefix of x w.r.t p --
Prefixx(p) Each child’s extraction rule is independent of its siblings
Use Landmarks – either tokens or wildcards (token classes)
Can be disjunctive – apply rule R1 or rule R2
6
Extraction Rules – examples
SkipTo(<b>)
SkipTo(Name)SkipTo(<b>)
SkipTo(Name Symbol HtmlTag)
SkipTo(</b>)
SkipTo(,)SkipUntil(Num)
SkipTo(AllCaps)NextLMark(Num)
7
Extraction Rules as Finite Automata
An extraction rule is equivalent to an FSA
Transition conditions correspond to the landmarks used in the extraction rules
Empty looping transitions are taken when a landmark has not been reached
R1 = SkipTo(()
R2 = SkipTo(Phone)SkipTo(<B>)
8
Learning Algorithm
Sequential Covering Algorithm
Choose the rule that covers most examples and remove the examples it covers
Return the disjunction of all rules found
11
Related Work
Manual wrapper construction TSIMMIS, procedural languages, etc Hard and error prone
Automatic wrapper construction WEIN
Less expressive – only uses the equivalent of SkipTo() without wildcards Not able to express arbitrarily deep tuples
SoftMealy Generates rules as finite transducers More expressive than WEIN but strictly less expressive than STALKER Must see all possible item orderings
WHISK, RAPIER, and SRV Use NLP Techniques Use landmarks similar to STALKER
Ontology approach – DEG Can handle lists with multiplicity constraints Character based rather than token based landmarks
12
Conclusions
STALKER uses EC formalism to turns a hard problem into several smaller ones Unseen permutations of data items can be recognized Arbitrarily long lists can be recognized The entire document can be interpreted as a list of tuples
STALKER rules use an expressive landmark based format
High accuracy wrappers can be induced automatically based on very few training examples compared with other systems
13
Related Work – BYU DEG
RAPIER rules correspond closely to DEG data frames. Data frames are finer-grained, based on character patterns,
whereas rules are based on word patterns Pre-filler and Post-filler patterns correspond closely to data frame
contexts and key words Semantic categories correspond closely with lexicons
Not mentioned how RAPIER handles multiple record documentsRapier data structure is given by the template (slots) defined in the input dataRAPIER is very similar in purpose to what Joe is trying to do – learn extraction rules based on a filled in form
14
Conclusions
Extracting desired pieces of information from NL text is importantManually constructing IE systems too hardRAPIER uses relational learning to build a set of pattern-match rules given a database of texts and filled templatesLearned patterns employ syntactic and semantic information to match slot fillers and contextFairly accurate results can be obtained for a real-world problem with relatively small datasetsRAPIER compares favorably with other IE learning systems