78
David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge

David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge

  • View
    215

  • Download
    0

Embed Size (px)

Citation preview

Page 1: David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge

David W. EmbleyBrigham Young University

Provo, Utah, USA

WoK: A Web of Knowledge

Page 2: David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge

A Web of Pages A Web of FactsBirthdate of my great

grandpa Orson

Price and mileage of red Nissans, 1990 or newer

Location and size of chromosome 17

US states with property crime rates above 1%

Page 3: David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge

Find me an image that is red, dark, scary, and beautiful.

Page 4: David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge

Learn rules to recognize names, even under less-than-ideal OCR’d documents.

• Seed models:– Prefix: “Mrs”, Miss”, “Mr”– Initials: “A”, “B”, “C”, …– Given Name: “Charles”,

“Francis”, Herbert”– Surname: “Goodrich”,

Wells”, White”– Stopword: “Jewell”, “Graves”

• Updates: – Prefix: first token in line– Given Name: between

‘Prefix’ and ‘Initial’– Surname: between initial

and </S>

M RS CHARLES A JEWELLMRS FRANCIS B COOI ENMRS P W ELILSWVORTMRs HERBERT C ADSWVORTHMRS HENRY E TAINTORMR DANIEl H WELLSMRS ARTHUR L GOODRICHMiss JOSEPHINE WHITEMss JULIA A GRAVESMs H B LANGDONMiss MARY H ADAMSMiss ELIZA F Mix'MRs MIARY C ST )NECMIRS AI I ERT H PITKIN

Page 5: David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge

Annotating Music and Lyrics

Find something soothing but energetic:Good for recovering patients.

How about Mozart’s 40th Symphony?

Page 6: David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge

Build a knowledge bundle for checking the association between tp53 polymorphism and lung cancer.

Page 7: David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge

U.S. President Barack Obama visited Iraq Monday in a stop that was overshadowed by the question of when U.S. troops should go

home. Obama made his opposition to the U.S.-led invasion of Iraq five years ago a centerpiece of his campaign and was in Baghdad to assess security in Iraq, where violence has fallen to its lowest

level since early 2004.

When has Barack Obama visited Iraq?Which U.S. President’s have visited Iraq?

Page 8: David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge

Find names, locations, events, and dates and associations among them for my great grandma

Margaret Haines.I GERTRUDE SMITH (Mrs William E Haines deceased) Married shortly after graduation Died at age of 22 Was musician and taught piano lessons 1898 HOBART L BENEDICT Millburn Essex County N J Graduated from Rutgers 1902 and from New York Law School in 1904 with degrees of B Sc M Sc and LL B Married April 9 1907 to Martha C Bunnell One daughter Elizabeth Benedict Counsellor at law with offices in Elizabeth and Millburn MARTHA BUNNELL (Mrs Hobart L Benedict) Millburn Essex County N J Married to Hobart L Benedict on date above 1899 CORA SMITH (Mrs Louis Slingerland) 557 Third St South St Peters- burg Florida Married Louis Slingerland a former pupil of Connec- Farms High School Mr Slingerland is engaged in building business in St Petersburg JENNIE HAINES Elmwood Ave Union Union Co N J Graduated from State Normal School Trenton N J in 190 5 Principal of Hurden Looker School in Hillside Township formerly a part of Union Town- ship STELLA ILLSLEY (Mrs Harry Engel) Hollis Long Island N Y WALTER BOSCHEN Morris Ave Union N J Completed fourth year at Battin High School in 1900 Attended Rutgers College taking up civil engineering course Has been successful in the business world President of the W G Boschen Sales Co Inc manufacturer general agents for mechanical line GEORGE McQUAIDE Springfield N J Was employed by Morris County Traction Company 1900 No graduates 1901 No graduates 1902 MARGARET HAINES Elmwood Ave Union N J Took up stenography and typewriting and is now employed as private secretary of the Correspondence Department of the Singer Manufacturing Com- pany of Elizabeth N J ABBY HEADLEY (Mrs Leslie Ward) 5 Rose St Newark N J CLARENCE GRIGGS Stuyvesant Ave Union N J Graduated from Trenton State Normal School in 1905 having specialized in manual training Taught one half year at Neshanic N J one year at Lin- coin School Roselle N J Teaching manual &#39;training and mechanical drawing in Newark N J Has taken special courses in Columbia University 34

Page 9: David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge

Who was the first person to land on the Moon?

Page 10: David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge

2002 Jeep Liberty

$7,995 Toll free 1-800-423-0334

Alert! Alert!I found your Jeep Liberty for under $8,000.

Page 11: David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge

• Fundamental questions– What is knowledge?– What are facts?– How does one know?

• Philosophy– Ontology– Epistemology– Logic and reasoning

Toward a Web of Knowledge

Page 12: David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge

• Existence asks “What exists?”• Concepts, relationships, and constraints with

formal foundation

Ontology

Page 13: David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge

• The nature of knowledge asks: “What is knowledge?” and “How is knowledge acquired?”

• Populated conceptual model

Epistemology

Page 14: David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge

• Principles of valid inference – asks: “What is known?” and “What can be inferred?”

• For us, it answers: what can be inferred (in a formal sense) from conceptualized data.

Logic and Reasoning

Find price and mileage of red Nissans, 1990 or newer

Page 15: David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge

• Distill knowledge from the wealth of digital web data• Annotate web pages

• Need a computational alembic to algorithmically turn raw symbols contained in web pages into knowledge

Making this Work How?

Fact

Fact

Fact

AnnotationAnnotation

Page 16: David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge

Turning Raw Symbols into Knowledge

• Symbols: $ 11,500 117K Nissan CD AC• Data: price(11,500) mileage(117K)

make(Nissan)• Conceptualized data:

– Car(C123) has Price($11,500)– Car(C123) has Mileage(117,000)– Car(C123) has Make(Nissan)– Car(C123) has Feature(AC)

• Knowledge– “Correct” facts– Provenance

Page 17: David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge

Actualization (with Extraction Ontologies)

Find me the price and mileage of all red Nissans – I want a 1990 or newer.

Page 18: David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge

Data Extraction Demo

Page 19: David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge

Semantic Annotation Demo

Page 20: David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge

Free-Form Query Demo

Page 21: David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge

Explanation: How it Works

• Extraction Ontologies• Semantic Annotation• Free-Form Query Interpretation

Page 22: David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge

Extraction Ontologies

Object sets

Relationship sets

Participation constraints

Lexical

Non-lexical

Primary object set

Aggregation

Generalization/Specialization

Page 23: David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge

Extraction Ontologies

External Rep.: \s*[$]\s*(\d{1,3})*(\.\d{2})?

Key Word Phrase

Left Context: $

Data Frame:

Internal Representation: float

Values

Key Words: ([Pp]rice)|([Cc]ost)| …

Operators

Operator: >

Key Words: (more\s*than)|(more\s*costly)|…

Page 24: David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge

Generality & Resiliency ofExtraction Ontologies

• Generality: assumptions about web pages– Data rich– Narrow domain– Document types

• Single-record documents (hard, but doable)• Multiple-record documents (harder)• Records with scattered components (even harder)

• Resiliency: declarative– Still works when web pages change– Works for new, unseen pages in the same domain– Scalable, but takes work to declare the extraction

ontology

Page 25: David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge

Semantic Annotation

Page 26: David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge

Free-Form Query Interpretation

• Parse Free-Form Query(with respect to data extraction ontology)

• Select Ontology• Formulate Query Expression• Run Query Over Semantically Annotated Data

Page 27: David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge

Parse Free-Form Query “Find me the and of all s – I want a ”

price

mileage

red

Nissan

1996

or newer

>= Operator

Page 28: David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge

Select Ontology“Find me the price and mileage of all red Nissans – I want a 1996 or newer”

Page 29: David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge

• Conjunctive queries and aggregate queries• Mentioned object sets are all of interest.• Values and operator keywords determine conditions.

– Color = “red”– Make = “Nissan”– Year >= 1996

>= Operator

Formulate Query Expression

Page 30: David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge

For

Let

Where

Return

Formulate Query Expression

Page 31: David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge

Run QueryOver Semantically Annotated Data

Page 32: David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge

• Automating content annotation– Extraction-ontology creation: a few dozen person hours– Semi-automatic creation

• FOCIH (Form-based Ontology Creation and Information Harvesting)• TISP (Table Interpretation by Sibling Pages)• TANGO (Table ANalysis for Generating Ontologies)

• Stepping up to the envisioned Web of Knowledge– Current & future work

• Semi-automatic annotation via synergistic bootstrapping • Knowledge bundles for research studies

– Practicalities

Great!But Problems Still Need Resolution

Page 33: David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge

Manual Creation

Page 34: David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge

Manual Creation

Page 35: David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge

Manual Creation

-Library of instance recognizers-Library of lexicons

Page 36: David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge

Craig’s List Alerter

• Constructed as a “short” class project– 10 applications– a few dozen hours

• Demo

Page 37: David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge

FOCIH: Form-based Ontology Creation and Information Harvesting

• Forms (general familiarity)• Information Harvesting• Semi-automatic extraction ontology creation

– Form-based generation of conceptual model– Instance-recognizer creation

• Lexicons• Some pre-existing instance recognizers

Page 38: David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge

FOCIH Form Creation

Page 39: David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge

FOCIH Ontology Generation

Page 40: David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge

FOCIH Information Harvesting

Page 41: David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge

FOCIHInformation-Harvesting Demo

Page 42: David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge

TISP:Table Interpretation with Sibling Pages

Page 43: David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge

Interpretation Technique:Sibling Page Comparison

Same

Page 44: David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge

Interpretation Technique:Sibling Page Comparison

Almost Same

Page 45: David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge

Interpretation Technique:Sibling Page Comparison

Different

Same

Page 46: David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge

Technique Details

• Unnest tables• Match tables in sibling pages

– “Perfect” match (table for layout discard )– “Reasonable” match (sibling table)

• Determine & use table-structure pattern– Discover pattern– Pattern usage– Dynamic pattern adjustment

Page 47: David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge

Table Unnesting

Page 48: David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge

Simple Tree Matching Algorithm

Labels

Values

[Yang91]

Match Score Categorization: Exact/Near-Exact, Sibling-Table, False

Page 49: David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge

Table Structure Patterns

Regularity Expectations:

• (<tr><(td|th)> {L} <(td|th)> {V})n

• <tr>(<(td|th)> {L})n

(<tr>(<(td|th)> {V})n)+

• …

Pattern combinations are also possible.

Page 50: David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge

Pattern Usage

(Location.Genetic Position) = X:12.69 +/- 0.000 cM [mapping data](Location.Genomic Position) = X:13518823..13515773 bp

Page 51: David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge

Dynamic Pattern Adjustment

<tr>(<(td|th)> {L})5 (<tr>(<(td|th)> {V})5)+

<tr>(<(td|th)> {L})5 (<tr>(<(td|th)> {V})5)+ | <tr>(<(td|th)> {L})6 (<tr>(<(td|th)> {V})6)+

Page 52: David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge

TISP Demo

Page 53: David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge

TISP/FOCIHExtraction Ontology Creation

• Reverse engineer with TISP• Adjust with FOCIH• Data frames

– Initialize lexicons with harvested data– Library of data frames—select and specialize

Page 54: David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge

TISP/FOCIHExtraction Ontology Creation

Page 55: David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge

TISP/FOCIHExtraction Ontology Creation

Page 56: David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge

TISP/FOCIHExtraction Ontology Creation

Page 57: David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge

TISP/FOCIHExtraction Ontology Creation

Page 58: David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge

TISP/FOCIHExtraction Ontology Creation

Page 59: David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge

TISP/FOCIHExtraction Ontology Creation

Page 60: David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge

TANGO:Table Analysis for Generating Ontologies

• Recognize and normalize table information• Construct mini-ontologies from tables• Discover inter-ontology mappings• Merge mini-ontologies into a growing ontology

Page 61: David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge

Recognize Table Information

Religion Population Albanian Roman Shi’a SunniCountry (July 2001 est.) Orthodox Muslim Catholic Muslim Muslim other

Afganistan 26,813,057 15% 84% 1%Albania 3,510,484 20% 70% 10%

Page 62: David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge

Construct Mini-Ontology Religion Population Albanian Roman Shi’a SunniCountry (July 2001 est.) Orthodox Muslim Catholic Muslim Muslim other

Afganistan 26,813,057 15% 84% 1%Albania 3,510,484 20% 70% 10%

Page 63: David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge

Discover Mappings

Page 64: David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge

Merge

Page 65: David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge

TANGO Demo

Page 66: David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge

• Build a page-layout, pattern-based annotator• Automate layout recognition based on examples• Auto-generate examples with extraction

ontologies• Synergistically run pattern-based annotator &

extraction-ontology annotator

Semi-Automatic Annotation viaSynergistic Bootstrapping

(Based on Nested Schemas with Regular Expressions)

Page 67: David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge

PatML Editor

Browser-Rendered Page

Page Source Text

InformationStructure Tree

Page 68: David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge
Page 69: David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge

Synergistic ExecutionExtraction Ontology

Document

Conceptual Annotator

(ontology-based annotation)

PartiallyAnnotated Document

Structural Annotator

(layout-driven annotation)

Annotated Document

Layout Patterns

Pattern Generation

Page 70: David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge

Knowledge Bundles forResearch Studies

To do a recent study about associations between lung cancer and tp53 polymorphism,researchers needed to: (1) do a keyword-based search on the SNP data repositoryfor ``tp53'' within organism "homo sapiens"; (2) from the returned records, open eachrecord page one by one and find those coding SNPs that have a minor allelefrequency greater than 1%; (3) for each qualifying SNP, record the SNP ID and manyproperties of the SNP; (4) perform a keyword search in PubMed and skim thehundreds of manuscripts found to determine which manuscripts are related to theSNPs of interest and fit their search criteria; and (5) extract the information of interest(e.g., the statistical information, patient information, and treatment information) andorganize it.

Page 71: David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge

Knowledge Bundles forResearch Studies

(1): Search, (2): Filter, (3): Record information

Page 72: David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge

Knowledge Bundles forResearch Studies

(4): High precision literature search

Page 73: David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge

Knowledge Bundles forResearch Studies

(5): Extract and organize

Page 74: David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge

Knowledge Bundles forResearch Studies

Page 75: David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge

Knowledge Bundles forResearch Studies

Research Challenge:

“I believe that a good biomedical scenario would beto select a topic which already large structured database(gene extraction, vitamins, blood), and then search for and find web pages that augment, support or refute specific aspects of that database.”

– GN

Page 76: David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge

• Won’t just happen without sufficient content• Niche applications

– Historical Data (e.g. Genealogy)– Bio-research studies

• Local WoKs– Intra-organizational effort– Individual interests

Practicalities: Bootstrapping the WoK(Future Work)

Page 77: David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge

• Potential Rapid growth– Thousands of ontologies– Millions of simultaneous queries– Billions of annotated pages– Trillions of facts

• Search-engine-like caching & query processing

Practicalities: Scalability(Future Work)

Page 78: David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge

• Automatic (or near automatic) creation of extraction ontologies

• Automatic (or near automatic) annotation of web pages

• Simple but accurate query specification without specialized training

Key to Success:Simplicity via Automation

www.deg.byu.eduwww.tango.byu.edu