91
David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge

WoK: A Web of Knowledge

  • Upload
    hogan

  • View
    36

  • Download
    0

Embed Size (px)

DESCRIPTION

WoK: A Web of Knowledge. David W. Embley Brigham Young University Provo, Utah, USA. A Web of Pages  A Web of Facts. Birthdate of my great grandpa Orson Price and mileage of red Nissans, 1990 or newer Location and size of chromosome 17 US states with property crime rates above 1%. - PowerPoint PPT Presentation

Citation preview

Page 1: WoK: A Web of Knowledge

David W. EmbleyBrigham Young University

Provo, Utah, USA

WoK: A Web of Knowledge

Page 2: WoK: A Web of Knowledge

A Web of Pages A Web of FactsBirthdate of my great

grandpa Orson

Price and mileage of red Nissans, 1990 or newer

Location and size of chromosome 17

US states with property crime rates above 1%

Page 3: WoK: A Web of Knowledge

• Fundamental questions– What is knowledge?– What are facts?– How does one know?

• Philosophy– Ontology– Epistemology– Logic and reasoning

Toward a Web of Knowledge

Page 4: WoK: A Web of Knowledge

• Existence asks “What exists?”• Concepts, relationships, and constraints with

formal foundation

Ontology

Page 5: WoK: A Web of Knowledge

• The nature of knowledge asks: “What is knowledge?” and “How is knowledge acquired?”• Populated conceptual model

Epistemology

Page 6: WoK: A Web of Knowledge

• Principles of valid inference – asks: “What is known?” and “What can be inferred?”• For us, it answers: what can be inferred (in a

formal sense) from conceptualized data.

Logic and Reasoning

Find price and mileage of red Nissans, 1990 or newer

Page 7: WoK: A Web of Knowledge
Page 8: WoK: A Web of Knowledge

• Distill knowledge from the wealth of digital web data• Annotate web pages

• Need a computational alembic to algorithmically turn raw symbols contained in web pages into knowledge

Making this Work How?

Fact

Fact

Fact

AnnotationAnnotation

Page 9: WoK: A Web of Knowledge

Turning Raw Symbols into Knowledge

• Symbols: $ 11,500 117K Nissan CD AC• Data: price(11,500) mileage(117K)

make(Nissan)• Conceptualized data:– Car(C123) has Price($11,500)– Car(C123) has Mileage(117,000)– Car(C123) has Make(Nissan)– Car(C123) has Feature(AC)

• Knowledge– “Correct” facts– Provenance

Page 10: WoK: A Web of Knowledge

Actualization (with Extraction Ontologies)

Find me the price and mileage of all red Nissans – I want a 1990 or newer.

Page 11: WoK: A Web of Knowledge

Data Extraction Demo

Page 12: WoK: A Web of Knowledge

Semantic Annotation Demo

Page 13: WoK: A Web of Knowledge

Free-Form Query Demo

Page 14: WoK: A Web of Knowledge

Explanation: How it Works

• Extraction Ontologies• Semantic Annotation• Free-Form Query Interpretation

Page 15: WoK: A Web of Knowledge

Extraction Ontologies

Object sets

Relationship sets

Participation constraints

Lexical

Non-lexical

Primary object set

Aggregation

Generalization/Specialization

Page 16: WoK: A Web of Knowledge

Extraction Ontologies

External Rep.: \s*[$]\s*(\d{1,3})*(\.\d{2})?

Key Word Phrase

Left Context: $

Data Frame:

Internal Representation: float

Values

Key Words: ([Pp]rice)|([Cc]ost)| …

Operators

Operator: >

Key Words: (more\s*than)|(more\s*costly)|…

Page 17: WoK: A Web of Knowledge

Generality & Resiliency ofExtraction Ontologies

• Generality: assumptions about web pages– Data rich– Narrow domain– Document types

• Single-record documents (hard, but doable)• Multiple-record documents (harder)• Records with scattered components (even harder)

• Resiliency: declarative– Still works when web pages change– Works for new, unseen pages in the same domain– Scalable, but takes work to declare the extraction

ontology

Page 18: WoK: A Web of Knowledge

Semantic Annotation

Page 19: WoK: A Web of Knowledge

Free-Form Query Interpretation

• Parse Free-Form Query(with respect to data extraction ontology)

• Select Ontology• Formulate Query Expression• Run Query Over Semantically Annotated Data

Page 20: WoK: A Web of Knowledge

Parse Free-Form Query “Find me the and of all s – I want a ”

price

mileage

red

Nissan

1996

or newer

>= Operator

Page 21: WoK: A Web of Knowledge

Select Ontology“Find me the price and mileage of all red Nissans – I want a 1996 or newer”

Page 22: WoK: A Web of Knowledge

• Conjunctive queries and aggregate queries• Mentioned object sets are all of interest.• Values and operator keywords determine conditions.– Color = “red”– Make = “Nissan”– Year >= 1996

>= Operator

Formulate Query Expression

Page 23: WoK: A Web of Knowledge

For

Let

Where

Return

Formulate Query Expression

Page 24: WoK: A Web of Knowledge

Run QueryOver Semantically Annotated Data

Page 25: WoK: A Web of Knowledge

• How do we create extraction ontologies?– Manual creation requires several dozen person hours– Semi-automatic creation

• TISP (Table Interpretation by Sibling Pages)• TANGO (Table ANalysis for Generating Ontologies)• Nested Schemas with Regular Expressions• Synergistic Bootstrapping• Form-based Information Harvesting

• How do we scale up?– Practicalities of technology transfer and usage– Millions of queries over zillions of facts for thousands of

ontologies

Great!But Problems Still Need Resolution

Page 26: WoK: A Web of Knowledge

Manual Creation

Page 27: WoK: A Web of Knowledge

Manual Creation

Page 28: WoK: A Web of Knowledge

Manual Creation

-Library of instance recognizers-Library of lexicons

Page 29: WoK: A Web of Knowledge

Automatic Annotation with TISP(Table Interpretation with Sibling Pages)

• Recognize tables (discard non-tables)• Locate table labels• Locate table values• Find label/value associations

Page 30: WoK: A Web of Knowledge

Recognize Tables

Data Table

Layout Tables (discard)

NestedData Tables

Page 31: WoK: A Web of Knowledge

Locate Table LabelsExamples: Identification.Gene model(s).Protein Identification.Gene model(s).2

Page 32: WoK: A Web of Knowledge

Locate Table LabelsExamples: Identification.Gene model(s).Gene Model Identification.Gene model(s).2

12

Page 33: WoK: A Web of Knowledge

Locate Table Values

Value

Page 34: WoK: A Web of Knowledge

Find Label/Value AssociationsExample:(Identification.Gene model(s).Protein, Identification.Gene model(s).2) = WP:CE28918

12

Page 35: WoK: A Web of Knowledge

Interpretation Technique:Sibling Page Comparison

Page 36: WoK: A Web of Knowledge

Interpretation Technique:Sibling Page Comparison

Same

Page 37: WoK: A Web of Knowledge

Interpretation Technique:Sibling Page Comparison

Almost Same

Page 38: WoK: A Web of Knowledge

Interpretation Technique:Sibling Page Comparison

Different

Same

Page 39: WoK: A Web of Knowledge

Technique Details

• Unnest tables• Match tables in sibling pages– “Perfect” match (table for layout discard )– “Reasonable” match (sibling table)

• Determine & use table-structure pattern– Discover pattern– Pattern usage– Dynamic pattern adjustment

Page 40: WoK: A Web of Knowledge

Generated RDF

Page 41: WoK: A Web of Knowledge

WoK Demo (via TISP)

Page 42: WoK: A Web of Knowledge

Semi-Automatic Annotation with TANGO (Table Analysis for Generating Ontologies)

• Recognize and normalize table information• Construct mini-ontologies from tables• Discover inter-ontology mappings• Merge mini-ontologies into a growing ontology

Page 43: WoK: A Web of Knowledge

Recognize Table Information

Religion Population Albanian Roman Shi’a SunniCountry (July 2001 est.) Orthodox Muslim Catholic Muslim Muslim other

Afganistan 26,813,057 15% 84% 1%Albania 3,510,484 20% 70% 10%

Page 44: WoK: A Web of Knowledge

Construct Mini-Ontology Religion Population Albanian Roman Shi’a SunniCountry (July 2001 est.) Orthodox Muslim Catholic Muslim Muslim other

Afganistan 26,813,057 15% 84% 1%Albania 3,510,484 20% 70% 10%

Page 45: WoK: A Web of Knowledge

Discover Mappings

Page 46: WoK: A Web of Knowledge

Merge

Page 47: WoK: A Web of Knowledge

BootstrappingCost-effective and Accurate Extraction

• Focus on semi-structured elements first

• Bootstrap synergistically– Extract from semi-structured elements– Learn extraction ontologies– Extract from plain text

Page 48: WoK: A Web of Knowledge

ListReader:Wrapper Induction for Lists

Page 49: WoK: A Web of Knowledge

Part I: Semi-supervised

Page 50: WoK: A Web of Knowledge

OCR

newline First row, left to right: C. Paulson, G. Whaley, E Eastlund, B. Krohg, D. Bakken, R. Norgaard, 0. Bakken, A. Vig, newline H. Megorden, D Wynne newline Second row- Mr. See bach, D. Colligan, J. Wogsland, F Knudson, A. Hagen, R. Myhrum, R. Nienaber, J. Mittun, newline Mr. Bohnsack. newline Third row: G. Carlm, R. Reterson, K Larson, J Skatvold, A. Enckson, R Roysland, L.Johnson, L. Nystrom. newLine Fourth row: R. Kvare, H. Haugen, R. Lubken, R Larson, A. Carlson, A. Nienaber, W Ram bo I, V Hanson, K. Ny- newline newline QootLaM "leam newline newline Captain Donald "Dude" Bakken ............... Right Half Back newline LeRoy "Sonny' Johnson ..................,.... Lcft Half Back newline Orley Bakken ...........,...........,.......... Quarter Back newline Roger Myhrum ................................... Full Back newline Bill "Schnozz" Krohg .............................. Center newline Howard "Little Huby" Megorden ................ Right Guard newline Royce "Shorty" Norgaard ....................... Left Guard newline Eugene "Mad Russian" Easthind ............... Right Tackle newline Alvin "Stuben" Hagen ......................... Left Tackle newline Richard "Dick" Nienabcr ........................ Right End newline James "Oakie" Wogsland .......................... Lcft End newline newline Other lettermen were- newline Glenn "Doc" Whaley newline Allen "Swede" Enckson newline James "Snooky" Mittun newline Curtis "Curt" Paulson newline Arthur "Art" Vig newline Forrest "Forry" Knudson newline Robert "Bobby" Roysland newline Page 26 newline

Page 51: WoK: A Web of Knowledge

HandForm

Creation &

Labeling

Page 52: WoK: A Web of Knowledge

HandForm

Creation &

Labeling

Page 53: WoK: A Web of Knowledge

HandForm

Creation &

Labeling

Donald√

Page 54: WoK: A Web of Knowledge

HandForm

Creation &

Labeling

Donald Bakken√

Page 55: WoK: A Web of Knowledge

HandForm

Creation &

Labeling

Donald Bakken Dude√

Page 56: WoK: A Web of Knowledge

HandForm

Creation &

Labeling

Donald Bakken Dude Right Half Back√

Page 57: WoK: A Web of Knowledge

Generate Wrapper for First Record

Captain Donald "Dude" Bakken ............... Right Half Back newline LeRoy "Sonny' Johnson ..................,.... Lcft Half Back newline Orley Bakken ...........,...........,.......... Quarter Back newline Roger Myhrum ................................... Full Back newline Bill "Schnozz" Krohg .............................. Center newline Howard "Little Huby" Megorden ................ Right Guard newline Royce "Shorty" Norgaard ....................... Left Guard newline Eugene "Mad Russian" Easthind ............... Right Tackle newline Alvin "Stuben" Hagen ......................... Left Tackle newline Richard "Dick" Nienabcr ........................ Right End newline James "Oakie" Wogsland .......................... Lcft End newline newline Other lettermen were- newline Glenn "Doc" Whaley newline Allen "Swede" Enckson newline James "Snooky" Mittun newline Curtis "Curt" Paulson newline Arthur "Art" Vig newline Forrest "Forry" Knudson newline Robert "Bobby" Roysland newline Page 26 newline

1. Captain, 2. Given Name, 3. Nickname, 4. Surname, 5. Position(Captain) (\w{6,6}) "(\w{4,4})" (\w{6,6}) \.{14,14} ((\w{4,5}){3,3})\n

Page 58: WoK: A Web of Knowledge

Update Wrapper &Annotate Records

Captain Donald "Dude" Bakken ............... Right Half Back newline LeRoy "Sonny' Johnson ..................,.... Lcft Half Back newline Orley Bakken ...........,...........,.......... Quarter Back newline Roger Myhrum ................................... Full Back newline Bill "Schnozz" Krohg .............................. Center newline Howard "Little Huby" Megorden ................ Right Guard newline Royce "Shorty" Norgaard ....................... Left Guard newline Eugene "Mad Russian" Easthind ............... Right Tackle newline Alvin "Stuben" Hagen ......................... Left Tackle newline Richard "Dick" Nienabcr ........................ Right End newline James "Oakie" Wogsland .......................... Lcft End newline newline Other lettermen were- newline Glenn "Doc" Whaley newline Allen "Swede" Enckson newline James "Snooky" Mittun newline Curtis "Curt" Paulson newline Arthur "Art" Vig newline Forrest "Forry" Knudson newline Robert "Bobby" Roysland newline Page 26 newline

2. Captain, 3. Given Name, 5. Nickname, 6. Surname, 7. Position((Captain) )?(\w{5,6})( "(\w{4,5}) ['"] )? (\w{6,7}) [\.,]{14,34} ((\w{4,7} ){2,3})\n

Page 59: WoK: A Web of Knowledge

Final Wrapperand Annotation

Captain Donald "Dude" Bakken ............... Right Half Back newline LeRoy "Sonny' Johnson ..................,.... Lcft Half Back newline Orley Bakken ...........,...........,.......... Quarter Back newline Roger Myhrum ................................... Full Back newline Bill "Schnozz" Krohg .............................. Center newline Howard "Little Huby" Megorden ................ Right Guard newline Royce "Shorty" Norgaard ....................... Left Guard newline Eugene "Mad Russian" Easthind ............... Right Tackle newline Alvin "Stuben" Hagen ......................... Left Tackle newline Richard "Dick" Nienabcr ........................ Right End newline James "Oakie" Wogsland .......................... Lcft End newline

2. Captain, 3. Given Name, 5. Nickname, 7. Surname, 8. Position((Captain) )?(\w{4,7})( “((\w{4,7}){1,2})['"] )? (\w{5,8} ) [\.,]{14,34} ((\w{4,7} ){1,3})\n

Page 60: WoK: A Web of Knowledge

Part II: Weakly-supervised

Page 61: WoK: A Web of Knowledge

Apply Extraction Ontologies

Page 62: WoK: A Web of Knowledge

Find List and Generate WrapperBase list finding on whether a wrapper can be generated.Base wrapper generation on best-labeled record.

Page 63: WoK: A Web of Knowledge

Extract Synergistically from Text

Page 64: WoK: A Web of Knowledge

Extract Synergistically from Text

Page 65: WoK: A Web of Knowledge

Form CreationBasic form-construction facilities:• single-entry field• multiple-entry field• nested form• …

Page 66: WoK: A Web of Knowledge

Created Sample Form

Page 67: WoK: A Web of Knowledge

Generated Ontology View

Page 68: WoK: A Web of Knowledge

Source-to-Form Mapping

Page 69: WoK: A Web of Knowledge

Source-to-Form Mapping

Page 70: WoK: A Web of Knowledge

Source-to-Form Mapping

Page 71: WoK: A Web of Knowledge

Source-to-Form Mapping

Page 72: WoK: A Web of Knowledge

Almost Ready to Harvest

• Need reading path: DOM-tree structure• Need to resolve mapping problems– Split/Merge– Union/Selection

Page 73: WoK: A Web of Knowledge

Almost Ready to Harvest …

• Need reading path: DOM-tree structure• Need to resolve mapping problems– Split/Merge– Union/Selection

Voltage-dependent anion-selective channel protein 3VDAC-3hVDAC3Outer mitochondrial membrane Protein porin 3

Name

Page 74: WoK: A Web of Knowledge

Almost Ready to Harvest …

• Need reading path: DOM-tree structure• Need to resolve mapping problems– Split/Merge– Union/Selection

Voltage-dependent anion-selective channel protein 3VDAC-3hVDAC3Outer mitochondrial membrane Protein porin 3

Name

Page 75: WoK: A Web of Knowledge

Almost Ready to Harvest …

• Need reading path: DOM-tree structure• Need to resolve mapping problems– Split/Merge– Union/Selection

Name

T-complex protein 1 subunit thetaTCP-1-thetaCCT-thetaRenal carcinoma antigen NY-REN-15

Page 76: WoK: A Web of Knowledge

Almost Ready to Harvest …

• Need reading path: DOM-tree structure• Need to resolve mapping problems– Split/Merge– Union/Selection

Name

T-complex protein 1 subunit thetaTCP-1-thetaCCT-thetaRenal carcinoma antigen NY-REN-15

Page 77: WoK: A Web of Knowledge

Can Now Harvest

Name

Page 78: WoK: A Web of Knowledge

Can Now Harvest

Name

14-3-3 protein epsilonMitochondrial import stimulation factor LsubunitProtein kinase C inhibitor protein-1KCIP-114-3-3E

Page 79: WoK: A Web of Knowledge

Can Now Harvest

Name

Voltage-dependent anion-selective channel protein 3VDAC-3hVDAC3Outer mitochondrial membrane Protein porin 3

Page 80: WoK: A Web of Knowledge

Can Now Harvest

Name

Tryptophanyl-tRNA synthetase, mitochondrial precursorEC 6.1.1.2Tryptophan—tRNA ligaseTrpRS(Mt)TrpRS

Page 81: WoK: A Web of Knowledge

Harvesting Populates Ontology

Page 82: WoK: A Web of Knowledge

Harvesting Populates Ontology

Also helps adjust ontology constraints

Page 83: WoK: A Web of Knowledge

Can Harvest from Additional Sites

Name

T-complex protein 1 subunit thetaTCP-1-thetaCCT-thetaRenal carcinoma antigen NY-REN-15

Page 84: WoK: A Web of Knowledge

AutomatingExtraction Ontology Creation

Lexicons

Name

14-3-3 protein epsilonMitochondrial import stimulation factor LsubunitProtein kinase C inhibitor protein-1KCIP-114-3-3E

Name

T-complex protein 1 subunit thetaTCP-1-thetaCCT-thetaRenal carcinoma antigen NY-REN-15

Name

Tryptophanyl-tRNA synthetase, mitochondrial precursorEC 6.1.1.2Tryptophan—tRNA ligaseTrpRS(Mt)TrpRS

…14-3-3 protein epsilonMitochondrial import stimulation factor LsubunitProtein kinase C inhibitor protein-1KCIP-114-3-3E…T-complex protein 1 subunit thetaTCP-1-thetaCCT-thetaRenal carcinoma antigen NY-REN-15…Tryptophanyl-tRNA synthetase, mitochondrial precursorEC 6.1.1.2Tryptophan—tRNA ligaseTrpRS(Mt)TrpRS…

Page 85: WoK: A Web of Knowledge

AutomatingExtraction Ontology Creation

Instance RecognizersNumber Patterns Context Keywords and Phrases

Page 86: WoK: A Web of Knowledge

Automatic Source-to-Form Mapping

Page 87: WoK: A Web of Knowledge

Automatic Semantic Annotation

Recognize and annotate with respect to an ontology

Page 88: WoK: A Web of Knowledge

• Advanced free-form queries with disjunction and negation

• Form-based query language• Table-based query languages• Graphical query languages

Practicalities: WoK Query Interfaces(Future Work)

Page 89: WoK: A Web of Knowledge

• Won’t just happen without sufficient content• Niche applications– Historical Data (e.g. Genealogy)– Topical Blogs

• Local WoKs– Intra-organizational effort– Individual interests

Practicalities: Bootstrapping the WoK(Future Work)

Page 90: WoK: A Web of Knowledge

• Potential Rapid growth– Thousands of ontologies– Millions of simultaneous queries– Billions of annotated pages– Trillions of facts

• Search-engine-like caching & query processing

Practicalities: Scalability(Future Work)

Page 91: WoK: A Web of Knowledge

• Automatic (or near automatic) creation of extraction ontologies

• Automatic (or near automatic) annotation of web pages

• Simple but accurate query specification without specialized training

Key to Success:Simplicity via Automation

www.deg.byu.edu