View
213
Download
0
Tags:
Embed Size (px)
Citation preview
04/18/23 1
Table Structure Understanding
by Sibling Page Comparison
Cui Tao
Data Extraction Group
Department of Computer Science
Brigham Young University
Supported by NSF
04/18/23 2
Table Structure Understanding
Motivation Many documents contain tables Data extraction Data integration Ontology evolution
Solution Locate tables Locate table labels Locate table values Find label/value associations
04/18/23 3
Table Structure Understanding
04/18/23 4
Table Structure Understanding
1
2
(Gene Model, 1) = F18H3.5a
(Gene Model, 2) = F18H3.5b
:
:
04/18/23 5
04/18/23 6
04/18/23 7
Sibling Pages
Generated output pages user query results in predefined page structure
Same web site ~ same structure
04/18/23 8
Problems
Data rich area --- discard the irrelevant parts Find table correspondences Find mappings between table cells Find structure patterns
04/18/23 9
HTML Table Components
04/18/23 10
Data Rich Area
04/18/23 11
Table Unnesting
04/18/23 12
DOM Tree
04/18/23 13
Simple Tree Matching
Simple Tree Matching (STM) Yang91 Maximum matching pairs of nodes O(mn)
label
Value
04/18/23 14
Table Structure Pattern
04/18/23 15
Table Structure Pattern
04/18/23 16
Experimental Results
Initial Test General pattern extraction
Molecular biology: 95.6% Car ad: 100%
Dynamic adjustment Unseen structure Structure variations