View
218
Download
0
Category
Tags:
Preview:
Citation preview
TARTARInformation Extraction
Transforming Arbitrary Tables into F-Logic Frames with TARTARAleksander Pivk, York Sure, Philipp Cimiano,Matjaz Gams, Vladislav Rajkovic, Rudi Studer
Presented By Stephen Lynn
TARTARInformation Extraction
Information Extraction Free-form Text
Linguistic/NLP approaches
Tabular StructuresTable comprehension task
html, excel, pdf, text, etc.Semantic interpretation taskMore effort???
TARTARInformation Extraction
TARTAR Architecture
TARTARInformation Extraction
Semantic Representation Frame Logic (F-Logic)
Model-theoretic semanticsComplete resolution-based proof theoryExpressive power of logicAvailability of efficient reasoning tools
TARTARInformation Extraction
F-Logic Frame
TARTARInformation Extraction
Table Comprehension Dimensions – a grouping of cells representing
similar entities
TARTARInformation Extraction
Table Comprehension Stub – dimension with headers used to index
elements in body
TARTARInformation Extraction
Table Comprehension Box head – column headers (often nested)
TARTARInformation Extraction
Table Comprehension Body – data values
TARTARInformation Extraction
Table Classes 1D, 2D, Complex
TARTARInformation Extraction
Methodology
TARTARInformation Extraction
Cleaning & Canonicalization Clean DOM tree
CyberNeko HTML Parser
Rowspan/Colspan expansion
TARTARInformation Extraction
Structure Detection Token Type Hierarchy Assign Functional Types and Probabilities
TARTARInformation Extraction
Structure Detection Detect Logical Table Orientation
TARTARInformation Extraction
Structure Detection Discover and Level Regions
Logical Units
TARTARInformation Extraction
FTM Building Functional Table Model (FTM)
Arrange regions into a treeLeaf nodes are data
TARTARInformation Extraction
Semantic Enriching of FTM Labeling
WordNet and GoogleSets
Map FTM to a frame
TARTARInformation Extraction
Evaluation Crawl, extract, filter web tables
135 tables85.4% success rateMostly problems with complex tables
Compare auto-generated frames with human generated frames14 people transformed 3 tables each21 total tables (each done twice)Syntactic/Semantic correctness (Strict and Soft)
TARTARInformation Extraction
Results
Inter-annotator agreement
System-annotator agreement
TARTARInformation Extraction
Benefits Fully automated knowledge formalization Arbitrary tables Independent of domain knowledge Independent of document type Explicit semantics of generated frames Query answering over heterogeneous tables
Recommended