Upload
arabella-waters
View
215
Download
2
Embed Size (px)
Citation preview
The Semantic Web:there and back again
Tim FininUniversity of Maryland, Baltimore County
Joint work with Lushan Han, Varish Mulwad, Anupam Joshi
http://ebiq.org/r/353
LOD 123: Making the semantic web easier to use
Tim FininUniversity of Maryland, Baltimore County
Joint work with Lushan Han, Varish Mulwad, Anupam Joshi
Semantic Web: then and now
• Ten years ago the we developed complex ontologies used to encode and reason over small datasets of 1000s of facts
• Recently the focus has shifted to using simple ontologies and minimal reasoning over very large datasets of 100s of millions of facts
• Major companies are moving: Google Know-ledge Graph, Facebook Open Graph, Microsoft Satori, Apple Siri KB, IMB Watson KB Linked open data or “Things, not strings”
Linked Open Data (LOD)• Linked data is just RDF data, lots of it,
with a small schema• RDF data is a graph of triples (subject, predicate
– URI URI String: dbr:Barack_Obama dbo:spouse “Michelle Obama”
– URI URI URI: dbr:Barack_Obama dbo:spouse dbpedia:Michelle_Obama
• Best linked data practice prefers 2nd pattern, using nodes rather than strings for “entities”– Things, not strings!
• Linked open data is just linked data freely acces-sible on the Web along with their ontologies
Semantic web technologies allow machines to share data and knowledge using common web language and protocols.
~ 1997
Semantic Web
Semantic Web beginning
Use Semantic Web Technology to publish shared data & knowledge
2007
Semantic Web => Linked Open DataUse Semantic Web Technology to publish shared data & knowledge
Data is inter-linked to support inte-gration and fusion of knowledge
LOD beginning
2008
Semantic Web => Linked Open DataUse Semantic Web Technology to publish shared data & knowledge
Data is inter-linked to support inte-gration and fusion of knowledge
LOD growing
2009
Semantic Web => Linked Open DataUse Semantic Web Technology to publish shared data & knowledge
Data is inter-linked to support inte-gration and fusion of knowledge
… and growing
Linked Open Data
2010
LOD is the new Cyc: a common source of background
knowledge
Use Semantic Web Technology to publish shared data & knowledge
Data is inter-linked to support inte-gration and fusion of knowledge
…growing faster
Linked Open Data
2011: 31B facts in 295 datasets interlinked by 504M assertions on ckan.net
LOD is the new Cyc: a common source of background
knowledge
Use Semantic Web Technology to publish shared data & knowledge
Data is inter-linked to support inte-gration and fusion of knowledge
Exploiting LOD not (yet) Easy
• Publishing or using LOD data hasinherent difficulties for the potential user
– It’s difficult to explore LOD data and to query it for answers
– It’s challenging to publish data using appropriate LOD vocabularies & link it to existing data
• Problem: O(104) schema terms, O(1011) instances
• I’ll describe two ongoing research projects that are addressing these problems
GoRelations:Intuitive Query System
for Linked Data
Research with Lushan Han
http://ebiq.org/j/93
Dbpedia is the Stereotypical LOD•DBpedia is an important example of Linked Open Data–Extracts structured data from Infoboxes in Wikipedia –Stores in RDF using custom ontologies Yago terms
•The major integration point for the entire LOD cloud•Explorable as HTML, but harder to query in SPARQL
DBpedia
Browsing DBpedia’s
Mark Twain
Why it’s hard to query LOD• Querying DBpedia requires a lot of a user– Understand the RDF model– Master SPARQL, a formal query language– Understand ontology terms: 320 classes & 1600 properties !– Know instance URIs (>2M entities !)– Term heterogeneity (Place vs. PopulatedPlace)
• Querying large LODsets overwhelming
• Natural languagequery systems stilla research goal
Goal
• Let users with a basic understanding of RDF query DBpedia and other LOD collections– Explore what data is in the system– Get answers to question– Create SPARQL queries for reuse or adaptation
• Desiderata– Easy to learn and to use– Good accuracy (e.g., precision and recall)– Fast
Key Idea
Structured keyword queries reduce problem complexity:
– User enters a simple graph, and– Annotates the nodes and arcs with
words and phrases
Structured Keyword Queries
• Nodes are entities and links binary relations• Entities described by two unrestricted terms:
name or value and type or concept• Outputs marked with ? • Compromise between a natural language Q&A
system and formal query–Users provide compositional structure of the question–Free to use their own terms to annotate structure
Translation – Step Onefinding semantically similar ontology terms
For each graph concept/relation, generate k most semantically similar ontology classes/properties
Lexical similarity metric based on distributional similarity, LSA, and WordNet
Semantic similarity: http://bit.ly/SEMSIM
Semantic Textual Similarity task• 2013 lexical and computational semantics conference• Do two sentences have same meaning (0…5)
1: “The woman is playing the violin” vs. “The young lady enjoys listening to the guitar”4: "In May 2010, the troops attempted to invade Kabul” vs. "The US army invaded Kabul on May 7th last year, 2010"
• 2012: 35 teams, 88 runs, 2013: 36 teams, 89 runs• 2250 sentence pairs
from four domains• Our three runs
#1, #2 and #4
Dataset PairingWords Galactus Saiyan
Headlines (750 pairs) 0.7642 (3) 0.7428 (7) 0.7838 (1)
OnWN (561 pairs) 0.7529 (5) 0.7053 (12) 0.5593 (36)
FNWN (189 pairs) 0.5818 (1) 0.5444 (3) 0.5815 (2)
SMT (750 pairs) 0.3804 (8) 0.3705 (11) 0.3563 (16)
Weighted mean 0.6181 (1) 0.5927 (2) 0.5683 (4)
• Assemble best interpretation using statistics of the data
• Use pointwise mutual informa-tion (PMI) between RDF terms in the LOD collection
Measures degree to which two RDF terms co-occur in knowledge base
• In a good interpretation, ontology terms associate like their corresponding user terms connect in the query
Translation – Step Twodisambiguation algorithm
Three aspects are combined to derive an overall goodness measure for each candidate interpretation
Translation – Step Twodisambiguation algorithm
Joint disam-biguation
Resolvingdirection
Link reason-ableness
Translation resultConcepts: Place => Place, Author => Writer, Book => BookProperties: born in => birthPlace, wrote => author (inverse direction)
The translation of a semantic graph query to SPARQL is straightforward given the mappings
SPARQL Generation
Concepts•Place => Place•Author => Writer•Book => Book
Relations•born in => birthPlace•wrote => author
Evaluation• 33 test questions from 2011 Workshop on Question
Answering over Linked Data answerable using DBpedia• Three human subjects unfamiliar with DBpedia translated
the test questions into semantic graph queries• Compared with two top natural language QA systems:
PowerAqua and True Knowledge
Current work
• Baseline system works well for Dbpedia; we’re testing a second use case now
• Current work– Better entity matching– Relaxing the need for type information– A better Web interface with user feedback & advice
• See http://ebiq.org/93 for more information & try our alpha version at http://ebiq.org/GOR
Generating Linked Databy Inferring the
Semantics of Tables
Research with Varish Mulwad
http://ebiq.org/j/96
Goal: Table => LOD*
Name Team Position Height
Michael Jordan Chicago Shooting guard 1.98
Allen Iverson Philadelphia Point guard 1.83
Yao Ming Houston Center 2.29
Tim Duncan San Antonio Power forward 2.11
http://dbpedia.org/class/yago/NationalBasketballAssociationTeams
http://dbpedia.org/resource/Allen_Iverson Player height in meters
dbprop:team
* DBpedia
Goal: Table => LOD*
Name Team Position HeightMichael Jordan Chicago Shooting guard 1.98
Allen Iverson Philadelphia Point guard 1.83
Yao Ming Houston Center 2.29
Tim Duncan San Antonio Power forward 2.11
@prefix dbpedia: <http://dbpedia.org/resource/> .@prefix dbo: <http://dbpedia.org/ontology/> .@prefix yago: <http://dbpedia.org/class/yago/> .
"Name"@en is rdfs:label of dbo:BasketballPlayer ."Team"@en is rdfs:label of yago:NationalBasketballAssociationTeams .
"Michael Jordan"@en is rdfs:label of dbpedia:Michael Jordan .dbpedia:Michael Jordan a dbo:BasketballPlayer .
"Chicago Bulls"@en is rdfs:label of dbpedia:Chicago Bulls .dbpedia:Chicago Bulls a yago:NationalBasketballAssociationTeams .
RDF Linked Data
All this in a completely automated way* DBpedia
Tables are everywhere !! … yet …
The web – 154 million high quality relational tables
Fewer than 1% of the 400K tables at data.gov have rich semantic schemas
Sampling Acronym detection
Pre-processing modules
Query and generate initial mappings
2 1
Generate Linked RDF Verify (optional) Store in a knowledge base & publish as LOD
Joint Inference/Assignment
A Domain Independent Framework
Query and Rank
Rank(String Similarity,
Popularity)Chicago
Boston
Allen Iverson 1. Chicago_Bulls2. Chicago3. Judy_Chicago
possible entitiesfor Chicago
Can be replaced by Domain Specific / other LOD knowledge bases
Generating candidate ‘types’ for Columns
Team
Chicago
Philadelphia
Houston
San Antonio
Class
Instance
1. Chicago Bulls2. Chicago3. Judy Chicago
{dbpedia-owl:Place,dbpedia-owl:City,yago:WomenArtist,yago:LivingPeople,yago:NationalBasketballAssociationTeams }
{dbpedia-owl:Place, dbpedia-owl:PopulatedPlace, dbpedia-owl:Film,yago:NationalBasketballAssociationTeams …. ….. ….. }
{……………………………………………………………. }
dbpedia-owl:Place, dbpedia-owl:City, yago:WomenArtist, yago:LivingPeople, yago:NationalBasketballAssociationTeams, dbpedia-owl:PopulatedPlace, dbpedia-owl:Film ….
Joint Inference / Assignment
A graphical model for tablesJoint inference over evidence in a table
C1 C2 C3
R11
R12
R13
R21
R22
R23
R31
R32
R33
Team
Chicago
Philadelphia
Houston
San Antonio
Class
Instance
Parameterized Graphical Model
C1 C2C3
𝝍𝟓
R11 R12 R13 R21 R22 R23 R31 R32 R33
Function capturing affinity between column headers and row values
Row value
Variable Node: Column header
Captures interaction between column headers
Factor Node
Captures interaction between row values
Standard message passingP(C1, C2, C3, R11, R12 ,R13, R21, R22, R23, R31, R32, R33)Joint Assignment :
P(C1, R11, R12 ,R13)
P(C3, R31, R32, R33)
P(C2,R21, R22, R23)
P(R31, R32, R33)
Graphical Models : Exploit Conditional
Independences
Still …
C1 R11 R12 R13 Val
4 Variables; Each having
25 options -- 390,625
entries !
Semantic message passing
Michael_I_Jordan Chicago_BullsShooting_guard
“Change”Related to Chicago_Bulls
“No Change”“No Change”
R11:[Michael_I_Jordan]
R12:[Yao_Ming]
R13:[Allen_Iverson]
R21:[Chicago_Bulls]
R31:[Shooting_Guard]
……
C1:[BasketballPlayer] C2:[NBATeam] C3:[BasketBallPositions]
Yao_Ming Allen_Iverson
BasketballPlayer
NBATeam BasketBallPositions
“Change”BasketBall
Player
“No Change”
“No Change”
“No Change”“No Change”
“No Change”
“No Change”
Inference – ExampleR11:
[Michael_I_Jordan]
R12:[Allen_Iverson] R13:[Yao_Ming]
C1:[Name]
(Michael_I_Jordan, Yao_Ming, Allen_Iverson)
“Change”“No Change”“No Change”
“BasketBallPlayer”
R11
Michael Jordan
1. Michael_I_Jordan (Professor)2. ….. 3. Michael_Jordan (BasketballPlayer)
….
Michael_I_Jordan Allen_Iverson Yao_Ming
“BasketBallPlayer”
Column header – row value agreement[Michael_I_Jordan, Allen_Iverson, Yao_Ming]
WomenArtistCityPopulatedPlaceAthleteFilm
LivingPeopleGeoPopulatedPlaceBasketBallPlayerArtWork
Name
Michael_I_Jordan
Allen_Iverson
Yao_Ming BasketballPlayerAtheleteLivingPeople
LivingPeopleAI_Researchers +1
+1
+1
+1
+1 1. Athlete2. City 3. …
1.LivingPeople2. BasketBallPlayer3.GeoPopulatedPlace….
Yago Tie-breaker/Re-order : Choose more ‘descriptive’ class. E.g. BasketBallPlayer better than LivingPeople
ClassGranularityScore = 1-[]
1: Majority Voting
2: Choose the top Class
Top Yago : BasketBallPlayer TopDBpedia : Athelete
Column header – row value agreementtopClassScore = (numberOfVotes)/(numberofRows)
Compute topScoreYago & topScoreDBpedia
(both)Score < Threshold(topScoreYago || topScoreDBpedia) >= Threshold
Check for AlignmentIs Athelete sub/superClass of BasketBallPlayer ?
Columnn Header Annotation = BasketBallPlayer, Athlete
Name
Michael_I_Jordan
Allen_Iverson
Yao_Ming BasketballPlayerAtheleteLivingPeople
LivingPeopleAI_Researchers Change
No - Change
Update Column Header Annotation = “No-Annotation”
Update row value entity annotations
R11
Michael Jordan
1. Michael_I_Jordan 2. ….. 3. Michael_Jordan ….
LivingPeopleAI_Researchers
BasketBallPlayerAthlete
𝝍𝟑 “CHANGE”
Entity Class : BasketBallPlayer or
Athlete
R11 Michael Jordan Michael_Jordan
Candidate Entities for R11
Evaluation
• Dataset of 80 tables (Wikipedia tables; part of larger dataset released by IIT-Bombay)
• Evaluated Column Header Annotation Accuracy– How good was the mapping Team to
NationalBasketballAssociationTeams• Evaluated Entity Linking Accuracy
– Mapping Michael Jordan to Michael_Jordan
Column Header Annotation Accuracy
• System produced a ranked a list of Yago & DBpedia classes
• Human judges evaluated each class• For precision, judges scored each class
• 1 if the class was accurate• 0.5 if the class ok, but not best (e.g., Place vs. City)• 0 if it was incorrect
• For Recall, score 1 if accurate/correct, 0 for incorrect
522
259
422
Accurate
Okay
Incorrect
Top K classes, F-measure
Yago rank 1
Yago rank 2
Yago rank 3
dbp rank 1
dbp rank 2
dbp rank 3
GOOG IIT-B
F-Measure 0.688360450563
204
0.487568135310
842
0.438263244010
532
0.677154394707
449
0.626750158305
777
0.612644415917
844
0.67 0.56
0.05
0.15
0.25
0.35
0.45
0.55
0.65
0.75
Column Header Annotations
F-Measure
Entity linking accuracy
http://dbpedia.org/resource/Allen_IversonAllen Iverson
Correctly Linked Entities
3022
Incorrectly Linked Entities
959
Total Entities 3981
Accuracy : 75.91 %
Other Challenges• Using table captions and other text is
associated documents to provide context• Size of some data.gov tables (> 400K rows!)
makes using full graphical model impractical– Sample table and run model on the subset
• Achieving acceptable accuracy may require human input– 100% accuracy unattainable automatically– How best to let humans offer advice and/or
correct interpretations?
Final Conclusions
• Linked data great for sharing structured and semi-structured data– Backed by machine-understandable semantics– Uses successful Web languages and protocols
• Generating and exploring linked data resources is challenging– Schemas are too large, too many URIs
• New tools mapping tables to linked data and translating structured natural language queries reduce the barriers
http://ebiq.org/