Upload
reece-lucas
View
85
Download
2
Tags:
Embed Size (px)
DESCRIPTION
Information Extraction for Semi-structured Documents: From Supervised learning to Unsupervised learning. Chia-Hui Chang Dept. of Computer Science and Information Engineering, National Central University, Taiwan [email protected]. Outline. Problem Definition of Information Extraction - PowerPoint PPT Presentation
Citation preview
Information Extraction for Semi-structured Documents: From Supervised learning to Unsupervised learning
Chia-Hui Chang
Dept. of Computer Science and Information Engineering, National Central University,
Outline
Problem Definition of Information Extraction Semi-structured IE Plain Text Information Extraction
Methods Special designed programming language
W4F, Xwrap, Lixto Supervised learning approach
WIEN, Softmealy, Stalker Unsupervised learning approach
IEPAD Semi-supervised learning approach
OLERA Summary and Future Work
Introduction
Information Extraction (IE) is to identify relevant information from documents, pulling information from a variety of sources and aggregates it into a homogeneous form.
The output template of the IE task Several fields (slots) Several instances of a field
Problem Definition Plain Text Information Extraction
The task of locating specific pieces of data from a natural language document
To obtain useful structured information from unstructured text
DARPA’s MUC program
Semi-structured IE Different from traditional IE The necessity of extracting and integrating data from multip
le Web-based sources e.g. generating1000 wrappers/extractors
Types of IE from MUC
Named Entity recognition (NE) Finds and classifies names, places, etc.
Coreference Resolution (CO) Identifies identity relations between entities in texts.
Template Element construction (TE) Adds descriptive information to NE results.
Scenario Template production (ST) Fits TE results into specified event scenarios.
IE from Semi-structured Documents Output Template: k-tuple Multiple instances of a field Missing data Several permutation of attributes
Special-designed Programming Language Programming by users
General programming language Special-designed programming language
W4F, Xwrap, Lixto
How? Observing common delimiters as landmarks Writing extraction rules
Supervised Learning Approach
Wrapper induction WIEN, IJCAI-97
Kushmerick, Weld, Doorenbos, SoftMealy, IJCAI-99
Hsu STALKER, AA-99
Muslea, Minton, Knoblock
Key component of IE systems Interface for labeling Learning algorithm
Extraction rules: Rule format Extractor
Example
Labels: {(Congo, 242), (Egypt, 20), (Belize, 501), (Spain, 34)}
Labeling
Start and end positions for Scope Record Attribute
Example
Learning Algorithm
Token hierarchy for generalization Background knowledge
Learning Algorithms
Rule expression Delimiter-based
Consecutive landmark Sequential landmark
Context rule
Extractor Architecture WIEN
Single-pass Single-loop, no branch
STALKER Multi-pass Bi-directional scanning
Softmealy Single-pass or multi-pass Finite-state transducer
Pattern-discovery based IE (Unsupervised Learning Approach )
Motivation Display of multiple records often forms a repeated
pattern The occurrences of the pattern are spaced regularly and
adjacently
Now the problem becomes ... Find regular and adjacent repeats in a string
IEPAD Architecture
Pattern Discoverer
ExtractorExtraction Results
Html Page
Patterns
Pattern Viewer
Extraction Rule
Users
Html Pages
The Pattern Generator
Translator PAT tree construction Pattern validator Rule Composer
HTML Page
Token Translator
PAT TreeConstructor
Validator
Rule Composer
PAT trees andMaximal Repeats
Advenced Patterns
Extraction Rules
A Token String
1. Web Page Translation
Encoding of HTML source Rule 1: Each tag is encoded as a token Rule 2: Any text between two tags are translated to a
special token called TEXT (denoted by a underscore) HTML Example:
<B>Congo</B><I>242</I><BR>
<B>Egypt</B><I>20</I><BR>
Encoded token stringT(<B>)T(_)T(</B>)T(<I>)T(_)T(</I>)T(<BR>)
T(<B>)T(_)T(</B>)T(<I>)T(_)T(</I>)T(<BR>)
2. PAT Tree Construction PAT tree: binary suffix tree A Patricia tree constructed over all possible
suffix strings of a text Example
T(<B>) 000T(</B>) 001T(<I>) 010T(</I>) 011T(<BR>) 100 T(_) 110
000110001010110011100000110001010110011100
T(<B>)T(_)T(</B>)T(<I>)T(_)T(</I>)T(<BR>)T(<B>)T(_)T(</B>)T(<I>)T(_)T(</I>)T(<BR>)
Indexing position:suffix 1 000110001010110011100000110001010110011100$suffix 2 110001010110011100000110001010110011100$suffix 3 001010110011100000110001010110011100$suffix 4 010110011100000110001010110011100$suffix 5 110011100000110001010110011100$suffix 6 011100000110001010110011100$suffix 7 100000110001010110011100$suffix 8 000110001010110011100$suffix 9 110001010110011100$suffix10 001010110011100$suffix11 010110011100$suffix12 110011100$suffix13 011100$suffix14 100$
The Constructed PAT Tree
$
12
1
2 2
3 4 5
10
1 8 10
0
1
10000
1
$
0
147
0
5
3
22
$0
16
$0
3 13
7
$0
6
11
13
$
4
19
$0
92
a
b
c
d e
f
g
h
i
j k
l m
Figure 3. The PAT tree for the Congo Code
=0110001010110011100=1010110011100=01010110011100=0110011100=11100
Definition of Maximal Repeats
Let occurs in S in position p1, p2, p3, …, pk is left maximal if there exists at least one (i, j) pai
r such that S[pi-1]S[pj-1] is right maximal if there exists at least one (i, j) p
air such that S[pi+||]S[pj+||] is a maximal repeat if it it both left maximal and ri
ght maximal
3. Pattern Validator
Suppose a maximal repeat are ordered by its position such that suffix p1 < p2 < p3… < pk, where pi denotes the position of each suffix in the encoded token sequence.
Characteristics of a Pattern Regularity: Variance coefficient
Adjacency: Density}1|{
}1|{)(
1
1
kippMean
kippStdDevV
ii
ii
||
||*)(
1
pp
kD
k
4. Rule Composer Problem
Patterns with density less than 1 can extract only part of the information
Solution Align k-1 substrings among the k occurrences
A natural generalization of alignment for two strings which can be solved in O(n*m) by dynamic programming where n and m are string lengths.
Multiple String Alignment
Suppose “adc” is the discovered pattern for token string “adcwbdadcxbadcxbdadcb”
If we have the following multiple alignment for strings ``adcwbd'', ``adcxb'' and ``adcxbd'':
a d c w b d
a d c x b -
a d c x b d
The extraction pattern can be generalized as “adc[w|x]b[d|-]”
Pattern Viewer / User Interface Java-application based GUI
Web based GUI http://www.csie.ncu.edu.tw/~chia/WebIEPAD/
The Extractor
Matching the pattern against the encoding token string Knuth-Morris-Pratt’s algorithm Boyer-Moore’s algorithm
Alternatives in a rule matching the longest pattern
What are extracted? The whole record
Problem Deals only with multi-record pages Many patterns are composed due to
Multiple string alignment Unknown start position
Alignment error due to ignored text strings
Semi-supervised approach: OLERA An universal method for wrapping both
single-record pages or multi-record pages
OnLine Extraction Rule Analysis Drill-down/Roll up operations Encoding hierarchy
(What would you do?)
OLERA’s Framework
doc
Block Enclosing
Attribute Designation
Drill down/Roll up
ExtractionPatterns
Page Encoder
Approximate Matching
Multiple String Alignment
Page Encoder
Multiple String Alignment
Three simple operations Block enclosing Drill-down/Roll-up Attribute Designation
Block Enclosing
Multiple single-record pages
Enclosing (Cont.)
Different from labeling The number of enclosing operation is far less than the
number of training pages
Encoding
Approximate Matching Extension of global string alignment
String Alignment Enhanced matching function
Attribute Designation
Drill-down/Roll-up
Drill-down Encoding Multiple String Alignment Each column is given a identifier:
8_0, 8_1, 8_2 for drill down operation on column 8
Roll-up Several columns can be concatenated together
The corresponding identifiers are recorded
Extractors
Grammar Signature representation for alignment result Each drill-down and roll-up operations The columns to be extracted for each attribute
Matching signature pattern in testing pages Variation of approximate matching
Insertion and mismatch is not allowed Deletion is allowed only if indicated in the signature
pattern
Conclusion
The input of training page Annotated or unlabeled
The format of extraction rule Delimiter-based, content-based, contextual rule
The background knowledge Implicitly or explicitly
Problems
For different problems, different encoding scheme is needed
Designing unsupervised approach for both single-record and multi-record documents
References
Semi-structured IE C.H. Chang and S.C. Kuo,
OLERA: On-Line Extraction Rule Analysis for Semi-structured Documents, Submitted for publication.
C.H. Chang. and S.C. Lui. IEPAD: Information Extraction based on Pattern Discovery, WWW10, pp. 681-688, May 2-6, 2001, Hong Kong.