Folie 1
Type Inference on Noisy RDF Data
Heiko Paulheim, Christian Bizer
The Problem
One promise of the Semantic Web:You can issue structured queries
e.g., List all presidents that graduated from Harvard Law School
SELECT ?x WHERE {
?x a dbpedia-owl:President .
?x dbpedia-owl:almaMater
dbpedia:Harvard_Law_School }
The Problem
SELECT ?x WHERE {
?x a dbpedia-owl:President .
?x dbpedia-owl:almaMater
dbpedia:Harvard_Law_School }
...if we run this against DBpedia, we get one resulti.e., Elwell Stephen Otis
But...
The Problem
The Problem
So what is going wrong?
SELECT ?x WHERE {
?x a dbpedia-owl:President .
?x dbpedia-owl:almaMater
dbpedia:Harvard_Law_School }
In DBpedia, Barack Obama is not of type President!
How can we add missing types?
Is It a Big Problem?
DBpedia has at least 2.7 million missing type statementsw.r.t. the DBpedia ontology
found using co-occurence analysis of matching classes
in YAGO and DBpedia
a very optimistic lower bound
Highly incomplete classes:Species: >870,000 missing statements
Person: >510,000 missing statements
Event: >150,000 missing statements
A Naive Approach
Idea: exploit properties with domain and range
Pseudo RDFS Reasoning:CONSTRUCT {?x a ?t}
WHERE { {?x ?r ?y . ?r rdfs:domain ?t}
UNION {?y ?r ?x . ?r rdfs:range ?t} }
A Naive Approach
Experiment with Barack ObamaPerson, PersonFunction, Actor, Organization
Experiment with Germany:Place, Award, Populated Place, City, SportsTeam, Mountain, Agent, Organisation, Country, Stadium, RecordLabel, MilitaryUnit, Company, EducationalInstitution, PersonFunction, EthnicGroup, Architect, WineRegion, Language, MilitaryConflict, Settlement, RouteOfTransportation
A Naive Approach
What is going on here?DBpedia data is noisy
One wrong statement is enough for a wrong conclusion
e.g.: dbpedia:Kurt_H._Debus dbpedia-owl:award dbpedia:Germany
Germany example: 69,000 statements20 wrong types can come from 20 wrong statements
i.e., an error rate of 0.03% is enough for a totally screwed result
...but that would be an excellent data quality for a LOD source!
SDType Approach
Idea: outgoing/incoming properties are indicators
for a resource's typee.g.: starring Movie
e.g.: author-1 Writer
Basic compiled statisticsP(C|p) := probability of class C in presence of property p
e.g.: P(dbpedia:Film|starring) = 0.79
e.g.: P(dbpedia:Writer|author-1) = 0.44
SDType Approach
Based on precompiled statisticsFind types of instances
Using voting
score(C) = avg(all properties p) P(C|p)
Refinement:Weight for properties: discriminative power
weight(p) = sum(all classes c) (p(c)-p(c|p))
i.e., how strongly this property's class distribution
deviates from the overall class distribution
Evaluation
Two fold evaluationOn DBpedia and OpenCyc as Silver
Standard
(automatic, 10,000 random instances)
On untyped DBpedia resources (manual, 100 instances)
Using only incoming propertiesUsing outgoing properties is trivial!
Evaluation Results
On DBpedia
Evaluation Results
On OpenCyc
Evaluation Results
Evaluation on untyped resourcesRandom sample of 100 untyped resources
Manual checking of precision
Evaluation Results
DBpedia:works reasonably well (F-measure 0.89)
OpenCyc:harder because of deeper class hierarchy (F-measure 0.60)
General:having more links increases precision
(in contrast to RDFS reasoning)
more general types (e.g., Band) are easier than specific
ones
(e.g., PunkRockBand)
Deployment
Heuristic types have been included in DBpedia 3.9for previously untyped instances
3.4 million type statements at precision ~0.95
Includes also many resources without a Wikipedia pagei.e., generated from a red link
RuntimeComplexity O(PT)
P: number of property assertions
T: number of type assertions
~24h for processing DBpedia
Conclusion and Outlook
SDType approach works at high qualityoutperforms most state of the art on DBpedia
deployed for DBpedia 3.9
Same approach can be used forvalidating links
within dataset: deployed for DBpedia 3.9 (removed ~13,000 wrong statements)
across datasets: to be done
Type Inference on Noisy RDF Data
Heiko Paulheim, Christian Bizer
Klicken Sie, um die Formate des Gliederungstextes zu bearbeiten
Zweite GliederungsebeneDritte GliederungsebeneVierte GliederungsebeneFnfte GliederungsebeneSechste GliederungsebeneSiebente Gliederungsebene