View
216
Download
1
Tags:
Embed Size (px)
Citation preview
An Abstract Framework for Extraction Plans An Abstract Framework for Extraction Plans and Heuristics in a Data Extraction Systemand Heuristics in a Data Extraction System
Alan WessmanBrigham Young University
Based on research supported by NSF
2
Data ExtractionData Extraction
• Goal: Find useful information in documents without known formal structure
• Primary tasks:– Locate data of interest to application– Map identified data to an ontology
3
OntosOntos
• BYU approach to data extraction• Domain knowledge encoded as ontology
– Defines target data structure– Contains data recognition rules (“data frames”)
• Heuristics map extracted values to ontology– Populate sets of objects and relationships– Infer nonlexical objects– Satisfy ontology constraints
• Ontos algorithm puts it all together
4
Current HeuristicsCurrent Heuristics
--- OBITUARIES ONTOLOGY ---Marriage Date matches [20] keyword "\bmarried\b";end;
Funeral Date matches [20] keyword "\bfuneral\b";end;
-- Deceased PersonDeceased Person [-> object];Deceased Person [0:1] has Marriage Date [1:*];Deceased Person [0:1] has Funeral [1];...-- FuneralFuneral [0:1] is on Funeral Date [1:*];...
-- Generalization/SpecializationsMarriage Date, Funeral Date : Date;
Lemar K. Adamsonage 84, of Tucson, died September 30, 1998. He was born June 12, 1914 in Salt Lake City, Utah. He is survived by wife, Cindy; daughters, Elvia, Gloria, Irene, Isabel, Jewel, and Jessica; sons, Paul, John, Jeffery, and Louis; brothers, Kirk, Justin, Ivan, Hubert and Grover. Funeral service at 10:00 a.m. Monday, October 5, 1998 at Silverbell Ward, 1540 E. Linden. Burial in City Cemetery. Friends may call from 9:00 a.m. to 10:00 a.m. Monday, at the church. Arrangements by BRING'S MEMORIAL CHAPEL, 236 S. Scott
•Object sets processed in order of appearance
•Accept-or-reject: Early bad choice prevents later better choices
5
Additional ProblemsAdditional Problems
• Generalization/specialization• Previously extracted data• Complex document structure• Overlapping value domains• Tunable parameters and extraction algorithm
6
Generalization/SpecializationGeneralization/Specialization
7
Previously Extracted DataPreviously Extracted Data
235. Foundations of Computer Science 1. (4:4:1) F, W, Sp, Su Prerequisite: CS 142. Iteration, induction, recursion, lists, trees, sets, relations, functions; mathematical analysis of algorithms and data models; object-oriented implementation of abstract data types.
236. Foundations of Computer Science 2. (4:4:1) F, W, Sp, Su Prerequisite: CS 235. Continuation of CS 235; relations, graphs, automata, grammars, propositional and predicate logic. Implementation of object-oriented algorithms.
8
Complex Document StructureComplex Document Structure
•Major sections with varying internal structures
•Nested lists with unstructured text
•Headings interspersed among records
•Icons, hyperlinks, etc.
9
Overlapping Value DomainsOverlapping Value Domains
student at Lincoln High School, won the state
thought Lincoln himself was probably rolling over in his grave at the idea
drove all the way to Lincoln, where we ate at
When his history lesson about Abraham Lincoln finally ended, Steve left Lincoln High and drove his Lincoln Continental down to Lincoln, Nebraska.
10
Tunable Parameters & AlgorithmTunable Parameters & Algorithm
• Confidence values– Names: William = 0.9; Rose = 0.6; Spatula = 0.03
• Weighted heuristics– Empirically, heuristic A is 2.3 times better than heuristic B
• Acceptance thresholds– “If ConfidenceValue(Name) > 0.5, accept”
• Candidate ranking– Heuristics vote; combine results; order candidate values and
accept top n
• Algorithm– When to retrieve, parse, extract, or populate target
11
Our ApproachOur Approach
We can remedy deficiencies in the Ontos heuristics by defining an abstract framework that allows the ontology designer to:
1. Implement more accurate and powerful heuristics (specific to the ontology’s needs), and
2. Control elements of the extraction plan (order in which documents are retrieved and parsed, heuristics are applied, etc.)
12
Framework OverviewFramework Overview
+setInitializationParameters()+initialize()+doExtraction()
#extractionPlan : ExtractionPlan
DataExtractionEngine
+setExtractionPlan(in plan : OntosExtractionPlan)
Ontos
+execute()
ExtractionPlan
#documentRetriever : DocumentRetriever#documentParser : DocumentStructureParser#ontology : Ontology
OntosExtractionPlan
Ontology
DocumentRetriever
DocumentStructureParser
ObjectSet RelationshipSet
ValueMappingHeuristic
13
ProgressProgress
• Researched HMM-based heuristics• Constructed XML Schema for ontologies• Solidified specialization semantics• Provided for directly populating ontology with
extracted values
• Implementation is proceeding…