54
Machine-learning based Semi- structured IE Chia-Hui Chang Department of Computer Science & Information Engineering National Central University [email protected]

IE for Semi-structured Document: Supervised Approach

  • Upload
    butest

  • View
    1.067

  • Download
    0

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: IE for Semi-structured Document: Supervised Approach

Machine-learning based Semi-structured IE

Chia-Hui Chang Department of Computer Science & Information EngineeringNational Central [email protected]

Page 2: IE for Semi-structured Document: Supervised Approach

Wrapper Induction

Wrapper An extracting program to extract desired

information from Web pages.Semi-Structure Doc.– wrapper→ Structure Info.

Web wrappers wrap... “Query-able’’ or “Search-able’’ Web sites Web pages with large itemized lists

The primary issues are: How to build the extractor quickly?

Page 3: IE for Semi-structured Document: Supervised Approach

Semi-structured IE

Independently of the traditional IEThe necessity of extracting and integrating data from multiple Web-based sources

Page 4: IE for Semi-structured Document: Supervised Approach

Machine-Learning Based Approach

A key component of IE systems is a set of extraction patterns that can be generated by machine

learning algorithms.

Page 5: IE for Semi-structured Document: Supervised Approach

Related Work

Shopbot Doorenbos, Etzioni, Weld, AA-97

Ariadne Ashish, Knoblock, Coopis-97

WIEN Kushmerick, Weld, Doorenbos, IJCAI-97

SoftMealy wrapper representation Hsu, IJCAI-99

STALKER Muslea, Minton, Knoblock, AA-99 A hierarchical FST

Page 6: IE for Semi-structured Document: Supervised Approach

WIEN

N. Kushmerick, D. S. Weld, R. Doorenbos, University of Washington, 1997http://www.cs.ucd.ie/staff/nick/

Page 7: IE for Semi-structured Document: Supervised Approach

Example 1

Page 8: IE for Semi-structured Document: Supervised Approach

Extractor for Example 1

Page 9: IE for Semi-structured Document: Supervised Approach

HLRT

Page 10: IE for Semi-structured Document: Supervised Approach

Wrapper Induction

Induction: The task of generalizing from labeled

examples to a hypothesis

Instances: pagesLabels: {(Congo, 242), (Egypt, 20), (Belize, 501), (Spain, 34)}Hypotheses: E.g. (<p>, <HR>, <B>, </B>, <I>,

</I>)

Page 11: IE for Semi-structured Document: Supervised Approach

BuildHLRT

Page 12: IE for Semi-structured Document: Supervised Approach

Other Family

OCLR (Open-Close-Left-Right) Use Open and Close as delimiters for eac

h tupleHOCLRT Combine OCLR with Head and Tail

N-LR and N-HLRT Nested LR Nested HLRT

Page 13: IE for Semi-structured Document: Supervised Approach

Terminology

Oracles Page Oracle Label Oracle

PAC analysis is to determine how many examples are

necessary to build an wrapper with two parameters: accuracy and confidence :

Pr[E(w)<]>1-, or Pr[E(w)>]<

Page 14: IE for Semi-structured Document: Supervised Approach

Probably Approximate Correct (PAC) Analysis

With =0.1, =0.1, K=4, an average of 5 tuples/page, Build HLRT must examine at least 72 examples

1))1(21())1(21( 2||2

22

KT

Page 15: IE for Semi-structured Document: Supervised Approach

Empirical Evaluation

Extract 48% web pages successfully. Weakness:

Missing attributes, attributes not in order, tabular data, etc.

Page 16: IE for Semi-structured Document: Supervised Approach

SoftmealyChun-Nan Hsu, Ming-Tzung Dung, 1998Arizona State Universityhttp://kaukoai.iis.sinica.edu.tw/~chunnan/mypublications.html

Page 17: IE for Semi-structured Document: Supervised Approach

Softmealy Architecture

Finite-State Transducers for Semi-Structured Text Mining Labeling: use a interface to label ex

ample by manually. Learner: FST (Finite-State Transducer) Extractor: Demonstration

http://kaukoai.iis.sinica.edu.tw/video.html

Page 18: IE for Semi-structured Document: Supervised Approach

Softmealy Wrapper

SoftMealy wrapper representation Uses finite-state transducer where each d

istinct attribute permutations can be encoded as a successful path

Replaces delimiters with contextual rules that describes the context delimiting two adjacent attributes

Page 19: IE for Semi-structured Document: Supervised Approach

Example

Page 20: IE for Semi-structured Document: Supervised Approach

4 種情形

Label the Answer Key

Page 21: IE for Semi-structured Document: Supervised Approach

Finite State Transducer

b

M -A A

-N

N-UU

e

extract

extractextract

extractskip

skipskip

skip

skip多解決了(N, M) 、(N, A, M)2 個情形

Page 22: IE for Semi-structured Document: Supervised Approach

Find the starting position -- Single Pass

新增的定義

Page 23: IE for Semi-structured Document: Supervised Approach

Contextual based Rule Learning

TokensSeparators SL ::= … Punc(,) Spc(1) Html(<I>) SR ::= C1Alph(Professor) Spc(1) OAlph(of) …

Rule generalization Taxonomy Tree

Page 24: IE for Semi-structured Document: Supervised Approach

Tokens

All uppercase string: CALph An uppercase letter, followed by at least

one lowercase letter, C1Alph A lowercase letter, followed by zero or m

ore characters: OAlph HTML tag: HTML Punctuation symbol: Punc Control characters: NL(1), Tab(4), Spc(3)

Page 25: IE for Semi-structured Document: Supervised Approach

Rule Generalization

Page 26: IE for Semi-structured Document: Supervised Approach

Learning Algorithm

Generalize each column by replacing each token with their least common ancestor

Page 27: IE for Semi-structured Document: Supervised Approach

Taxonomy Tree

Page 28: IE for Semi-structured Document: Supervised Approach

Generating to Extract the Body

The contextual rules for the head and tail separators are:hL::=C1alpha(Staff) Html(</H2>) NL(1)Html(<HR>) NL(1) Html(<UL>)tR::=Html(</UL>) NL(1) Html(<HR>) NL(1) Html(<ADDRESS>) NL(1) Html(<I>) Clalpha(Please)

Page 29: IE for Semi-structured Document: Supervised Approach

More Expressive Power

Softmealy allows Disjunction Multiple attribute orders within tuples Missing attributes Features of candidate strings

Page 30: IE for Semi-structured Document: Supervised Approach

Stalker

I. Muslea, S. Minton, C. Knoblock, University of Southern California

http://www.isi.edu/~muslea/

Page 31: IE for Semi-structured Document: Supervised Approach

STALKER

Embedded Catalog Tree Leaves (primitive items): 所要擷取的東西。 Internal nodes (items):

Homogeneous list, or Heterogeneous tuple.

Page 32: IE for Semi-structured Document: Supervised Approach

EC Tree of a page

Page 33: IE for Semi-structured Document: Supervised Approach

Extracting Data from a Document

For each node in the EC Tree, the wrapper needs a rule that extracts that particular node from its parentAdditionally, for each list node, the wrapper requires a list iteration rule that decomposes the list into individual tuples.Advantages:

The hierarchical extraction based on the EC tree allows us to wrap information sources that have arbitrary many levels of embedded data.

Second, as each node is extracted independently of its siblings, our approach does not rely on there being a fixed ordering of the items, and we can easily handle extraction tasks from documents that may have missing items or items that appear in various orders.

Page 34: IE for Semi-structured Document: Supervised Approach

Extraction Rules as Finite Automata

• Landmarks• A sequence of tokens and wildcards

• Landmark automata• A non-deterministic finite automata

Page 35: IE for Semi-structured Document: Supervised Approach

Landmark Automata

• A linear LA has one accepting state

• from each non-accepting state, there are exactly two possible transitions: a loop to itself, and a transition to the next state;

• each non-looping transition is labeled by a landmarks;

• all looping transitions have the meaning “consume all tokens until you encounter the landmark that leads to the next state”.

Page 36: IE for Semi-structured Document: Supervised Approach

Rule Generating

1st : terminals: {; reservation _Symbol_ _Word_} Candidate:{; <i> _Symbol_ _HtmlTag_} perfect Disj:{<i> _HtmlTag_} positive example: D3, D42nd: uncover{D1, D2} Candicate:{; _Symbol_}

Extract Credit info.

Page 37: IE for Semi-structured Document: Supervised Approach

Possible Rules

Page 38: IE for Semi-structured Document: Supervised Approach
Page 39: IE for Semi-structured Document: Supervised Approach
Page 40: IE for Semi-structured Document: Supervised Approach

The STALKER Algorithm

Page 41: IE for Semi-structured Document: Supervised Approach
Page 42: IE for Semi-structured Document: Supervised Approach
Page 43: IE for Semi-structured Document: Supervised Approach

Features

Process is performed in a hierarchical manner.沒有 Attributes not in order 的問題。Use disjunctive rule 可以解決 Missing attributes 的問題。

Page 44: IE for Semi-structured Document: Supervised Approach

Multi-pass SoftmealyChun-Nan Hsu and Chian-Chi ChangInstitute of Information ScienceAcademia SinicaTaipei, Taiwan

Page 45: IE for Semi-structured Document: Supervised Approach

Multi-pass

Page 46: IE for Semi-structured Document: Supervised Approach

Tabular style document

(Quote Server)

Page 47: IE for Semi-structured Document: Supervised Approach

Tagged-list style document

(Internet Address Finder)

Page 48: IE for Semi-structured Document: Supervised Approach

Layout styles and learnability

Tabular style missing attributes, ordering as hints

Tagged-list style variant ordering, tags as hints

Prediction single-pass for tabular style multi-pass for tagged-list style

Page 49: IE for Semi-structured Document: Supervised Approach

Tabular result (Quote Server)

Page 50: IE for Semi-structured Document: Supervised Approach

Tagged-list result (Internet Address Finder)

Page 51: IE for Semi-structured Document: Supervised Approach

Comparison

Both : can handle irregular missing attributes. 對於未見過的 attribute ,需要 training

Single-pass : 允許的 attribute permutations 有限 Single-pass is good for tabular pages 比較快

Multi-pass: Attribute permutations 沒有影響 Multi-pass is good for tagged-list pages 比較慢

Page 52: IE for Semi-structured Document: Supervised Approach

Comparison

Quote Server Stalker: 10 example tuples, 79%, 500 test WIEN: the collection beyond learn’s capablity SoftMealy: multi-pass 85%, single-pass 97%

Internet Address Finder Stalker: 80% ~ 100%, 500 test WIEN: the collection beyond learn’s capablity SoftMealy: multi-pass 68%, single-pass 41%,

Page 53: IE for Semi-structured Document: Supervised Approach

Comparison

Okra(tabular pages) Stalker: 97%, 1 example tuple WIEN: 100% , 13 example tuples, 30 test SoftMealy: single-pass 100%, 1 example tuple, 30

testBig-book(tagged-list pages) Stalker: 97%, 8 example tuples WIEN: perfect, 18 example tuples, 30 test SoftMealy: single-pass 97%, 4 examples, 30 test multi-pass 100%, 6 examples, 30 test

Page 54: IE for Semi-structured Document: Supervised Approach

References

Kushmerick, N. (2000) Wrapper induction: Efficiency and expressiveness. Artificial Intelligence J. 118(1-2):15-68 (special issue on Intelligent Internet Systems). Chun-Nan Hsu and Ming-Tzung Dung. Generating finite-state transducers for semistructured data extraction from the web. Information Systems, 23(8):521-538, Special Issue on Semistructured Data, 1998. Ion Muslea, Steve Minton, Craig Knoblock.Hierarchical Wrapper Induction for Semistructured Information Sources, Journal of Autonomous Agents and Multi-Agent Systems, 4:93-114, 2001 .