13
An Abstract Framework for Extraction Plans An Abstract Framework for Extraction Plans and Heuristics in a Data Extraction System and Heuristics in a Data Extraction System Alan Wessman Brigham Young University Based on research supported by NSF

An Abstract Framework for Extraction Plans and Heuristics in a Data Extraction System Alan Wessman Brigham Young University Based on research supported

  • View
    216

  • Download
    1

Embed Size (px)

Citation preview

Page 1: An Abstract Framework for Extraction Plans and Heuristics in a Data Extraction System Alan Wessman Brigham Young University Based on research supported

An Abstract Framework for Extraction Plans An Abstract Framework for Extraction Plans and Heuristics in a Data Extraction Systemand Heuristics in a Data Extraction System

Alan WessmanBrigham Young University

Based on research supported by NSF

Page 2: An Abstract Framework for Extraction Plans and Heuristics in a Data Extraction System Alan Wessman Brigham Young University Based on research supported

2

Data ExtractionData Extraction

• Goal: Find useful information in documents without known formal structure

• Primary tasks:– Locate data of interest to application– Map identified data to an ontology

Page 3: An Abstract Framework for Extraction Plans and Heuristics in a Data Extraction System Alan Wessman Brigham Young University Based on research supported

3

OntosOntos

• BYU approach to data extraction• Domain knowledge encoded as ontology

– Defines target data structure– Contains data recognition rules (“data frames”)

• Heuristics map extracted values to ontology– Populate sets of objects and relationships– Infer nonlexical objects– Satisfy ontology constraints

• Ontos algorithm puts it all together

Page 4: An Abstract Framework for Extraction Plans and Heuristics in a Data Extraction System Alan Wessman Brigham Young University Based on research supported

4

Current HeuristicsCurrent Heuristics

--- OBITUARIES ONTOLOGY ---Marriage Date matches [20] keyword "\bmarried\b";end;

Funeral Date matches [20] keyword "\bfuneral\b";end;

-- Deceased PersonDeceased Person [-> object];Deceased Person [0:1] has Marriage Date [1:*];Deceased Person [0:1] has Funeral [1];...-- FuneralFuneral [0:1] is on Funeral Date [1:*];...

-- Generalization/SpecializationsMarriage Date, Funeral Date : Date;

Lemar K. Adamsonage 84, of Tucson, died September 30, 1998. He was born June 12, 1914 in Salt Lake City, Utah. He is survived by wife, Cindy; daughters, Elvia, Gloria, Irene, Isabel, Jewel, and Jessica; sons, Paul, John, Jeffery, and Louis; brothers, Kirk, Justin, Ivan, Hubert and Grover. Funeral service at 10:00 a.m. Monday, October 5, 1998 at Silverbell Ward, 1540 E. Linden. Burial in City Cemetery. Friends may call from 9:00 a.m. to 10:00 a.m. Monday, at the church. Arrangements by BRING'S MEMORIAL CHAPEL, 236 S. Scott

•Object sets processed in order of appearance

•Accept-or-reject: Early bad choice prevents later better choices

Page 5: An Abstract Framework for Extraction Plans and Heuristics in a Data Extraction System Alan Wessman Brigham Young University Based on research supported

5

Additional ProblemsAdditional Problems

• Generalization/specialization• Previously extracted data• Complex document structure• Overlapping value domains• Tunable parameters and extraction algorithm

Page 6: An Abstract Framework for Extraction Plans and Heuristics in a Data Extraction System Alan Wessman Brigham Young University Based on research supported

6

Generalization/SpecializationGeneralization/Specialization

Page 7: An Abstract Framework for Extraction Plans and Heuristics in a Data Extraction System Alan Wessman Brigham Young University Based on research supported

7

Previously Extracted DataPreviously Extracted Data

235. Foundations of Computer Science 1. (4:4:1) F, W, Sp, Su Prerequisite: CS 142. Iteration, induction, recursion, lists, trees, sets, relations, functions; mathematical analysis of algorithms and data models; object-oriented implementation of abstract data types.

236. Foundations of Computer Science 2. (4:4:1) F, W, Sp, Su Prerequisite: CS 235. Continuation of CS 235; relations, graphs, automata, grammars, propositional and predicate logic. Implementation of object-oriented algorithms.

Page 8: An Abstract Framework for Extraction Plans and Heuristics in a Data Extraction System Alan Wessman Brigham Young University Based on research supported

8

Complex Document StructureComplex Document Structure

•Major sections with varying internal structures

•Nested lists with unstructured text

•Headings interspersed among records

•Icons, hyperlinks, etc.

Page 9: An Abstract Framework for Extraction Plans and Heuristics in a Data Extraction System Alan Wessman Brigham Young University Based on research supported

9

Overlapping Value DomainsOverlapping Value Domains

student at Lincoln High School, won the state

thought Lincoln himself was probably rolling over in his grave at the idea

drove all the way to Lincoln, where we ate at

When his history lesson about Abraham Lincoln finally ended, Steve left Lincoln High and drove his Lincoln Continental down to Lincoln, Nebraska.

Page 10: An Abstract Framework for Extraction Plans and Heuristics in a Data Extraction System Alan Wessman Brigham Young University Based on research supported

10

Tunable Parameters & AlgorithmTunable Parameters & Algorithm

• Confidence values– Names: William = 0.9; Rose = 0.6; Spatula = 0.03

• Weighted heuristics– Empirically, heuristic A is 2.3 times better than heuristic B

• Acceptance thresholds– “If ConfidenceValue(Name) > 0.5, accept”

• Candidate ranking– Heuristics vote; combine results; order candidate values and

accept top n

• Algorithm– When to retrieve, parse, extract, or populate target

Page 11: An Abstract Framework for Extraction Plans and Heuristics in a Data Extraction System Alan Wessman Brigham Young University Based on research supported

11

Our ApproachOur Approach

We can remedy deficiencies in the Ontos heuristics by defining an abstract framework that allows the ontology designer to:

1. Implement more accurate and powerful heuristics (specific to the ontology’s needs), and

2. Control elements of the extraction plan (order in which documents are retrieved and parsed, heuristics are applied, etc.)

Page 12: An Abstract Framework for Extraction Plans and Heuristics in a Data Extraction System Alan Wessman Brigham Young University Based on research supported

12

Framework OverviewFramework Overview

+setInitializationParameters()+initialize()+doExtraction()

#extractionPlan : ExtractionPlan

DataExtractionEngine

+setExtractionPlan(in plan : OntosExtractionPlan)

Ontos

+execute()

ExtractionPlan

#documentRetriever : DocumentRetriever#documentParser : DocumentStructureParser#ontology : Ontology

OntosExtractionPlan

Ontology

DocumentRetriever

DocumentStructureParser

ObjectSet RelationshipSet

ValueMappingHeuristic

Page 13: An Abstract Framework for Extraction Plans and Heuristics in a Data Extraction System Alan Wessman Brigham Young University Based on research supported

13

ProgressProgress

• Researched HMM-based heuristics• Constructed XML Schema for ontologies• Solidified specialization semantics• Provided for directly populating ontology with

extracted values

• Implementation is proceeding…