View
215
Download
0
Embed Size (px)
Citation preview
David W. EmbleyBrigham Young University
Provo, Utah, USA
WoK: A Web of Knowledge
A Web of Pages A Web of FactsBirthdate of my great
grandpa Orson
Price and mileage of red Nissans, 1990 or newer
Location and size of chromosome 17
US states with property crime rates above 1%
Find me an image that is red, dark, scary, and beautiful.
Learn rules to recognize names, even under less-than-ideal OCR’d documents.
• Seed models:– Prefix: “Mrs”, Miss”, “Mr”– Initials: “A”, “B”, “C”, …– Given Name: “Charles”,
“Francis”, Herbert”– Surname: “Goodrich”,
Wells”, White”– Stopword: “Jewell”, “Graves”
• Updates: – Prefix: first token in line– Given Name: between
‘Prefix’ and ‘Initial’– Surname: between initial
and </S>
M RS CHARLES A JEWELLMRS FRANCIS B COOI ENMRS P W ELILSWVORTMRs HERBERT C ADSWVORTHMRS HENRY E TAINTORMR DANIEl H WELLSMRS ARTHUR L GOODRICHMiss JOSEPHINE WHITEMss JULIA A GRAVESMs H B LANGDONMiss MARY H ADAMSMiss ELIZA F Mix'MRs MIARY C ST )NECMIRS AI I ERT H PITKIN
Annotating Music and Lyrics
Find something soothing but energetic:Good for recovering patients.
How about Mozart’s 40th Symphony?
Build a knowledge bundle for checking the association between tp53 polymorphism and lung cancer.
U.S. President Barack Obama visited Iraq Monday in a stop that was overshadowed by the question of when U.S. troops should go
home. Obama made his opposition to the U.S.-led invasion of Iraq five years ago a centerpiece of his campaign and was in Baghdad to assess security in Iraq, where violence has fallen to its lowest
level since early 2004.
When has Barack Obama visited Iraq?Which U.S. President’s have visited Iraq?
Find names, locations, events, and dates and associations among them for my great grandma
Margaret Haines.I GERTRUDE SMITH (Mrs William E Haines deceased) Married shortly after graduation Died at age of 22 Was musician and taught piano lessons 1898 HOBART L BENEDICT Millburn Essex County N J Graduated from Rutgers 1902 and from New York Law School in 1904 with degrees of B Sc M Sc and LL B Married April 9 1907 to Martha C Bunnell One daughter Elizabeth Benedict Counsellor at law with offices in Elizabeth and Millburn MARTHA BUNNELL (Mrs Hobart L Benedict) Millburn Essex County N J Married to Hobart L Benedict on date above 1899 CORA SMITH (Mrs Louis Slingerland) 557 Third St South St Peters- burg Florida Married Louis Slingerland a former pupil of Connec- Farms High School Mr Slingerland is engaged in building business in St Petersburg JENNIE HAINES Elmwood Ave Union Union Co N J Graduated from State Normal School Trenton N J in 190 5 Principal of Hurden Looker School in Hillside Township formerly a part of Union Town- ship STELLA ILLSLEY (Mrs Harry Engel) Hollis Long Island N Y WALTER BOSCHEN Morris Ave Union N J Completed fourth year at Battin High School in 1900 Attended Rutgers College taking up civil engineering course Has been successful in the business world President of the W G Boschen Sales Co Inc manufacturer general agents for mechanical line GEORGE McQUAIDE Springfield N J Was employed by Morris County Traction Company 1900 No graduates 1901 No graduates 1902 MARGARET HAINES Elmwood Ave Union N J Took up stenography and typewriting and is now employed as private secretary of the Correspondence Department of the Singer Manufacturing Com- pany of Elizabeth N J ABBY HEADLEY (Mrs Leslie Ward) 5 Rose St Newark N J CLARENCE GRIGGS Stuyvesant Ave Union N J Graduated from Trenton State Normal School in 1905 having specialized in manual training Taught one half year at Neshanic N J one year at Lin- coin School Roselle N J Teaching manual 'training and mechanical drawing in Newark N J Has taken special courses in Columbia University 34
Who was the first person to land on the Moon?
2002 Jeep Liberty
$7,995 Toll free 1-800-423-0334
Alert! Alert!I found your Jeep Liberty for under $8,000.
• Fundamental questions– What is knowledge?– What are facts?– How does one know?
• Philosophy– Ontology– Epistemology– Logic and reasoning
Toward a Web of Knowledge
• Existence asks “What exists?”• Concepts, relationships, and constraints with
formal foundation
Ontology
• The nature of knowledge asks: “What is knowledge?” and “How is knowledge acquired?”
• Populated conceptual model
Epistemology
• Principles of valid inference – asks: “What is known?” and “What can be inferred?”
• For us, it answers: what can be inferred (in a formal sense) from conceptualized data.
Logic and Reasoning
Find price and mileage of red Nissans, 1990 or newer
• Distill knowledge from the wealth of digital web data• Annotate web pages
• Need a computational alembic to algorithmically turn raw symbols contained in web pages into knowledge
Making this Work How?
Fact
Fact
Fact
AnnotationAnnotation
…
…
Turning Raw Symbols into Knowledge
• Symbols: $ 11,500 117K Nissan CD AC• Data: price(11,500) mileage(117K)
make(Nissan)• Conceptualized data:
– Car(C123) has Price($11,500)– Car(C123) has Mileage(117,000)– Car(C123) has Make(Nissan)– Car(C123) has Feature(AC)
• Knowledge– “Correct” facts– Provenance
Actualization (with Extraction Ontologies)
Find me the price and mileage of all red Nissans – I want a 1990 or newer.
Data Extraction Demo
Semantic Annotation Demo
Free-Form Query Demo
Explanation: How it Works
• Extraction Ontologies• Semantic Annotation• Free-Form Query Interpretation
Extraction Ontologies
Object sets
Relationship sets
Participation constraints
Lexical
Non-lexical
Primary object set
Aggregation
Generalization/Specialization
Extraction Ontologies
External Rep.: \s*[$]\s*(\d{1,3})*(\.\d{2})?
Key Word Phrase
Left Context: $
Data Frame:
Internal Representation: float
Values
Key Words: ([Pp]rice)|([Cc]ost)| …
Operators
Operator: >
Key Words: (more\s*than)|(more\s*costly)|…
Generality & Resiliency ofExtraction Ontologies
• Generality: assumptions about web pages– Data rich– Narrow domain– Document types
• Single-record documents (hard, but doable)• Multiple-record documents (harder)• Records with scattered components (even harder)
• Resiliency: declarative– Still works when web pages change– Works for new, unseen pages in the same domain– Scalable, but takes work to declare the extraction
ontology
Semantic Annotation
Free-Form Query Interpretation
• Parse Free-Form Query(with respect to data extraction ontology)
• Select Ontology• Formulate Query Expression• Run Query Over Semantically Annotated Data
Parse Free-Form Query “Find me the and of all s – I want a ”
price
mileage
red
Nissan
1996
or newer
>= Operator
Select Ontology“Find me the price and mileage of all red Nissans – I want a 1996 or newer”
• Conjunctive queries and aggregate queries• Mentioned object sets are all of interest.• Values and operator keywords determine conditions.
– Color = “red”– Make = “Nissan”– Year >= 1996
>= Operator
Formulate Query Expression
For
Let
Where
Return
Formulate Query Expression
Run QueryOver Semantically Annotated Data
• Automating content annotation– Extraction-ontology creation: a few dozen person hours– Semi-automatic creation
• FOCIH (Form-based Ontology Creation and Information Harvesting)• TISP (Table Interpretation by Sibling Pages)• TANGO (Table ANalysis for Generating Ontologies)
• Stepping up to the envisioned Web of Knowledge– Current & future work
• Semi-automatic annotation via synergistic bootstrapping • Knowledge bundles for research studies
– Practicalities
Great!But Problems Still Need Resolution
Manual Creation
Manual Creation
Manual Creation
-Library of instance recognizers-Library of lexicons
Craig’s List Alerter
• Constructed as a “short” class project– 10 applications– a few dozen hours
• Demo
FOCIH: Form-based Ontology Creation and Information Harvesting
• Forms (general familiarity)• Information Harvesting• Semi-automatic extraction ontology creation
– Form-based generation of conceptual model– Instance-recognizer creation
• Lexicons• Some pre-existing instance recognizers
FOCIH Form Creation
FOCIH Ontology Generation
FOCIH Information Harvesting
FOCIHInformation-Harvesting Demo
TISP:Table Interpretation with Sibling Pages
Interpretation Technique:Sibling Page Comparison
Same
Interpretation Technique:Sibling Page Comparison
Almost Same
Interpretation Technique:Sibling Page Comparison
Different
Same
Technique Details
• Unnest tables• Match tables in sibling pages
– “Perfect” match (table for layout discard )– “Reasonable” match (sibling table)
• Determine & use table-structure pattern– Discover pattern– Pattern usage– Dynamic pattern adjustment
Table Unnesting
Simple Tree Matching Algorithm
Labels
Values
[Yang91]
Match Score Categorization: Exact/Near-Exact, Sibling-Table, False
Table Structure Patterns
Regularity Expectations:
• (<tr><(td|th)> {L} <(td|th)> {V})n
• <tr>(<(td|th)> {L})n
(<tr>(<(td|th)> {V})n)+
• …
Pattern combinations are also possible.
Pattern Usage
(Location.Genetic Position) = X:12.69 +/- 0.000 cM [mapping data](Location.Genomic Position) = X:13518823..13515773 bp
Dynamic Pattern Adjustment
<tr>(<(td|th)> {L})5 (<tr>(<(td|th)> {V})5)+
<tr>(<(td|th)> {L})5 (<tr>(<(td|th)> {V})5)+ | <tr>(<(td|th)> {L})6 (<tr>(<(td|th)> {V})6)+
TISP Demo
TISP/FOCIHExtraction Ontology Creation
• Reverse engineer with TISP• Adjust with FOCIH• Data frames
– Initialize lexicons with harvested data– Library of data frames—select and specialize
TISP/FOCIHExtraction Ontology Creation
TISP/FOCIHExtraction Ontology Creation
TISP/FOCIHExtraction Ontology Creation
TISP/FOCIHExtraction Ontology Creation
TISP/FOCIHExtraction Ontology Creation
TISP/FOCIHExtraction Ontology Creation
TANGO:Table Analysis for Generating Ontologies
• Recognize and normalize table information• Construct mini-ontologies from tables• Discover inter-ontology mappings• Merge mini-ontologies into a growing ontology
Recognize Table Information
Religion Population Albanian Roman Shi’a SunniCountry (July 2001 est.) Orthodox Muslim Catholic Muslim Muslim other
Afganistan 26,813,057 15% 84% 1%Albania 3,510,484 20% 70% 10%
Construct Mini-Ontology Religion Population Albanian Roman Shi’a SunniCountry (July 2001 est.) Orthodox Muslim Catholic Muslim Muslim other
Afganistan 26,813,057 15% 84% 1%Albania 3,510,484 20% 70% 10%
Discover Mappings
Merge
TANGO Demo
• Build a page-layout, pattern-based annotator• Automate layout recognition based on examples• Auto-generate examples with extraction
ontologies• Synergistically run pattern-based annotator &
extraction-ontology annotator
Semi-Automatic Annotation viaSynergistic Bootstrapping
(Based on Nested Schemas with Regular Expressions)
PatML Editor
Browser-Rendered Page
Page Source Text
InformationStructure Tree
Synergistic ExecutionExtraction Ontology
Document
Conceptual Annotator
(ontology-based annotation)
PartiallyAnnotated Document
Structural Annotator
(layout-driven annotation)
Annotated Document
Layout Patterns
Pattern Generation
Knowledge Bundles forResearch Studies
To do a recent study about associations between lung cancer and tp53 polymorphism,researchers needed to: (1) do a keyword-based search on the SNP data repositoryfor ``tp53'' within organism "homo sapiens"; (2) from the returned records, open eachrecord page one by one and find those coding SNPs that have a minor allelefrequency greater than 1%; (3) for each qualifying SNP, record the SNP ID and manyproperties of the SNP; (4) perform a keyword search in PubMed and skim thehundreds of manuscripts found to determine which manuscripts are related to theSNPs of interest and fit their search criteria; and (5) extract the information of interest(e.g., the statistical information, patient information, and treatment information) andorganize it.
Knowledge Bundles forResearch Studies
(1): Search, (2): Filter, (3): Record information
Knowledge Bundles forResearch Studies
(4): High precision literature search
Knowledge Bundles forResearch Studies
(5): Extract and organize
Knowledge Bundles forResearch Studies
Knowledge Bundles forResearch Studies
Research Challenge:
“I believe that a good biomedical scenario would beto select a topic which already large structured database(gene extraction, vitamins, blood), and then search for and find web pages that augment, support or refute specific aspects of that database.”
– GN
• Won’t just happen without sufficient content• Niche applications
– Historical Data (e.g. Genealogy)– Bio-research studies
• Local WoKs– Intra-organizational effort– Individual interests
Practicalities: Bootstrapping the WoK(Future Work)
• Potential Rapid growth– Thousands of ontologies– Millions of simultaneous queries– Billions of annotated pages– Trillions of facts
• Search-engine-like caching & query processing
Practicalities: Scalability(Future Work)
• Automatic (or near automatic) creation of extraction ontologies
• Automatic (or near automatic) annotation of web pages
• Simple but accurate query specification without specialized training
Key to Success:Simplicity via Automation
www.deg.byu.eduwww.tango.byu.edu