Chia-Hui Chang

Information Extraction for Semi-structured Documents: From Supervised learning to Unsupervised learning

Chia-Hui Chang

Dept. of Computer Science and Information Engineering, National Central University,

[email protected]

Outline

Problem Definition of Information Extraction Semi-structured IE Plain Text Information Extraction

Methods Special designed programming language

W4F, Xwrap, Lixto Supervised learning approach

WIEN, Softmealy, Stalker Unsupervised learning approach

IEPAD Semi-supervised learning approach

OLERA Summary and Future Work

Introduction

Information Extraction (IE) is to identify relevant information from documents, pulling information from a variety of sources and aggregates it into a homogeneous form.

The output template of the IE task Several fields (slots) Several instances of a field

Problem Definition Plain Text Information Extraction

The task of locating specific pieces of data from a natural language document

To obtain useful structured information from unstructured text

DARPA’s MUC program

Semi-structured IE Different from traditional IE The necessity of extracting and integrating data from multip

le Web-based sources e.g. generating1000 wrappers/extractors

Types of IE from MUC

Named Entity recognition (NE) Finds and classifies names, places, etc.

Coreference Resolution (CO) Identifies identity relations between entities in texts.

Template Element construction (TE) Adds descriptive information to NE results.

Scenario Template production (ST) Fits TE results into specified event scenarios.

IE from Semi-structured Documents Output Template: k-tuple Multiple instances of a field Missing data Several permutation of attributes

Special-designed Programming Language Programming by users

General programming language Special-designed programming language

W4F, Xwrap, Lixto

How? Observing common delimiters as landmarks Writing extraction rules

Supervised Learning Approach

Wrapper induction WIEN, IJCAI-97

Kushmerick, Weld, Doorenbos, SoftMealy, IJCAI-99

Hsu STALKER, AA-99

Muslea, Minton, Knoblock

Key component of IE systems Interface for labeling Learning algorithm

Extraction rules: Rule format Extractor

Example

Labels: {(Congo, 242), (Egypt, 20), (Belize, 501), (Spain, 34)}

Labeling

Start and end positions for Scope Record Attribute

Example

Learning Algorithm

Token hierarchy for generalization Background knowledge

Learning Algorithms

Rule expression Delimiter-based

Consecutive landmark Sequential landmark

Context rule

Extractor Architecture WIEN

Single-pass Single-loop, no branch

STALKER Multi-pass Bi-directional scanning

Softmealy Single-pass or multi-pass Finite-state transducer

Pattern-discovery based IE (Unsupervised Learning Approach )

Motivation Display of multiple records often forms a repeated

pattern The occurrences of the pattern are spaced regularly and

adjacently

Now the problem becomes ... Find regular and adjacent repeats in a string

IEPAD Architecture

Pattern Discoverer

ExtractorExtraction Results

Html Page

Patterns

Pattern Viewer

Extraction Rule

Users

Html Pages

The Pattern Generator

Translator PAT tree construction Pattern validator Rule Composer

HTML Page

Token Translator

PAT TreeConstructor

Validator

Rule Composer

PAT trees andMaximal Repeats

Advenced Patterns

Extraction Rules

A Token String

1. Web Page Translation

Encoding of HTML source Rule 1: Each tag is encoded as a token Rule 2: Any text between two tags are translated to a

special token called TEXT (denoted by a underscore) HTML Example:

Congo242 

Egypt20 

Encoded token stringT()T(_)T()T()T(_)T()T( )

T()T(_)T()T()T(_)T()T( )

2. PAT Tree Construction PAT tree: binary suffix tree A Patricia tree constructed over all possible

suffix strings of a text Example

T() 000T() 001T() 010T() 011T( ) 100 T(_) 110

000110001010110011100000110001010110011100

T()T(_)T()T()T(_)T()T( )T()T(_)T()T()T(_)T()T( )

Indexing position:suffix 1 000110001010110011100000110001010110011100$suffix 2 110001010110011100000110001010110011100$suffix 3 001010110011100000110001010110011100$suffix 4 010110011100000110001010110011100$suffix 5 110011100000110001010110011100$suffix 6 011100000110001010110011100$suffix 7 100000110001010110011100$suffix 8 000110001010110011100$suffix 9 110001010110011100$suffix10 001010110011100$suffix11 010110011100$suffix12 110011100$suffix13 011100$suffix14 100$

The Constructed PAT Tree

$

12

1

2 2

3 4 5

10

1 8 10

0

1

10000

1

$

0

147

0

5

3

22

$0

16

$0

3 13

7

$0

6

11

13

$

4

19

$0

92

a

b

c

d e

f

g

h

i

j k

l m

Figure 3. The PAT tree for the Congo Code

=0110001010110011100=1010110011100=01010110011100=0110011100=11100

Definition of Maximal Repeats

Let occurs in S in position p1, p2, p3, …, pk is left maximal if there exists at least one (i, j) pai

r such that S[pi-1]S[pj-1] is right maximal if there exists at least one (i, j) p

air such that S[pi+||]S[pj+||] is a maximal repeat if it it both left maximal and ri

ght maximal

3. Pattern Validator

Suppose a maximal repeat are ordered by its position such that suffix p1 < p2 < p3… < pk, where pi denotes the position of each suffix in the encoded token sequence.

Characteristics of a Pattern Regularity: Variance coefficient

Adjacency: Density}1|{

}1|{)(

1

1

kippMean

kippStdDevV

ii

ii

||

||*)(

1

pp

kD

k

4. Rule Composer Problem

Patterns with density less than 1 can extract only part of the information

Solution Align k-1 substrings among the k occurrences

A natural generalization of alignment for two strings which can be solved in O(n*m) by dynamic programming where n and m are string lengths.

Multiple String Alignment

Suppose “adc” is the discovered pattern for token string “adcwbdadcxbadcxbdadcb”

If we have the following multiple alignment for strings `àdcwbd'', `àdcxb'' and `àdcxbd'':

a d c w b d

a d c x b -

a d c x b d

The extraction pattern can be generalized as “adc[w|x]b[d|-]”

Pattern Viewer / User Interface Java-application based GUI

Web based GUI http://www.csie.ncu.edu.tw/~chia/WebIEPAD/

The Extractor

Matching the pattern against the encoding token string Knuth-Morris-Pratt’s algorithm Boyer-Moore’s algorithm

Alternatives in a rule matching the longest pattern

What are extracted? The whole record

Problem Deals only with multi-record pages Many patterns are composed due to

Multiple string alignment Unknown start position

Alignment error due to ignored text strings

Semi-supervised approach: OLERA An universal method for wrapping both

single-record pages or multi-record pages

OnLine Extraction Rule Analysis Drill-down/Roll up operations Encoding hierarchy

(What would you do?)

OLERA’s Framework

doc

Block Enclosing

Attribute Designation

Drill down/Roll up

ExtractionPatterns

Page Encoder

Approximate Matching


Page Encoder


Three simple operations Block enclosing Drill-down/Roll-up Attribute Designation

Block Enclosing

Multiple single-record pages

Enclosing (Cont.)

Different from labeling The number of enclosing operation is far less than the

number of training pages

Encoding

Approximate Matching Extension of global string alignment

String Alignment Enhanced matching function

Attribute Designation

Drill-down/Roll-up

Drill-down Encoding Multiple String Alignment Each column is given a identifier:

8_0, 8_1, 8_2 for drill down operation on column 8

Roll-up Several columns can be concatenated together

The corresponding identifiers are recorded

Extractors

Grammar Signature representation for alignment result Each drill-down and roll-up operations The columns to be extracted for each attribute

Matching signature pattern in testing pages Variation of approximate matching

Insertion and mismatch is not allowed Deletion is allowed only if indicated in the signature

pattern

Conclusion

The input of training page Annotated or unlabeled

The format of extraction rule Delimiter-based, content-based, contextual rule

The background knowledge Implicitly or explicitly

Problems

For different problems, different encoding scheme is needed

Designing unsupervised approach for both single-record and multi-record documents

References

Semi-structured IE C.H. Chang and S.C. Kuo,

OLERA: On-Line Extraction Rule Analysis for Semi-structured Documents, Submitted for publication.

C.H. Chang. and S.C. Lui. IEPAD: Information Extraction based on Pattern Discovery, WWW10, pp. 681-688, May 2-6, 2001, Hong Kong.