From Data to Knowledge - Profiling & Interlinking Web Datasets

  • View
    146

  • Download
    1

Embed Size (px)

DESCRIPTION

Talk at KEYSTONE/ESWC meeting 2014

Text of From Data to Knowledge - Profiling & Interlinking Web Datasets

  • 1. From data to knowledge profiling and interlinking Web datasets Stefan Dietze L3S Research Center 31/07/14 1Stefan Dietze

2. Recent work on Linked Data exploration/discovery/search Entity interlinking & dataset interlinking recommendation Dataset profiling Data consistency & conflicts Research areas Web science, Information Retrieval, Semantic Web & Linked Data, data & knowledge integration (mapping, classification, interlinking) Application domains: education/TEL, Web archiving, Some projects Introduction http://www.l3s.de/ 31/07/14 2 See also: http://purl.org/dietze Stefan Dietze 3. why are there so few datasets actually used? Date reuse and in-links focused on trusted reference graphs such as DBpedia, Freebase etc Long tail of LD datasets which are neither reused nor linked to (LOD Cloud alone 300+ datasets, 50 bn triples) Explanations? Linked Data is awesome, but... 31/07/14 HTTP-accessibility (SPARQL, URI-dereferencing) Structure & Semantics (=> shared/linked vocabularies) Interlinked Persistent Hm, really? Stefan Dietze 4. Linked data is more diverse than we think SPARQL endpoint availability over time [Buil-Aranda et al 2013] Accessibility of datasets? Less than 50% of all SPARQL endpoints actually responsive at given point of time THE SPARQL protocol? No, but many variants & subsets Semantics, links, quality? data consistency? [? Yuan2014 ?] data accuracy (eg DBpedia)? [Paulheim2013] vocabulary reuse? [DAquinWebSci13] schema compliance (RDFS, schemas) [HoganJWS2012] Stefan Dietze SPARQL Web-Querying Infrastructure: Ready for Action?, Carlos Buil-Aranda, Aidan Hogan, Jrgen Umbrich Pierre-Yves Vandenbussch, International Semantic Web Conference 2013, (ISWC2013). Assessing the Educational Linked Data Landscape, DAquin, M., Adamou, A., Dietze, S., ACM Web Science 2013 (WebSci2013), Paris, France, May 2013. Type Inference on Noisy RDF Data, Paulheim H., Bizer, C. Semantic Web ISWC 2013, Lecture Notes in Computer Science Volume 8218, 2013, pp 510-525 An empirical survey of Linked Data conformance. Hogan, A., Umbrich, J., Harth, A., Cyganiak, R., Polleres, A., Decker., S., Journal of Web Semantics 14, 2012 5. Too many/diverse datasets, too little knowledge Stefan Dietze 31/07/14 ? ? ? ?? ? Which datasets are useful & trustworthy for case XY (eg learning about the solar system) ? Which topics are covered? Types: which datasets describe statistics, videos, slides, publications etc? Currentness, dynamics, accessability/reliability, data quantity & quality? 6. db:Astro. Objects Dataset Metadata Stefan Dietze 31/07/14 BIBO AAISO FOAF contains Entity disambiguation & linking [ESWC13] Topic profile extraction [WWW13, ESCW14] db:Astronomy db:Astro. Objects Dataset Catalog/Registry yov:Video po:Programme BBC Programme Wonders of the Solar System Brian Cox> > Pluto & the Dwarf Planets > Yovisto Video bibo:Fil bibo:Fi bibo:Film Schema mappings [WebSci13] Data curation, linking and dataset profiling 7. Schemas/vocabularies on the Web: XKCD 927 Stefan Dietze 31/07/14 https://xkcd.com/927/ schemas & vocabularies 8. typeX typeX Schema assessment and mapping Co-occurence of data types (in 146 datasets: 144 Vocabularies, 588 highly overlapping types, 719 Properties) Co-occurence after mapping into most frequent schemas (201 frequent types mapped into 79 classes) Assessing the Educational Linked Data Landscape, DAquin, M., Adamou, A., Dietze, S., ACM Web Science 2013 (WebSci2013), Paris, France, May 2013. bibo:Film bibo:Document po:Programme sioc:Item 31/07/14 foaf:Document yov:Video typeX 9. LinkedUp Data Catalog in a nutshell http://datahub.io/group/linked-education http://data.linkededucation.org/linkedup/catalog/ RDF (VoID) dataset catalog: browse & query distributed datasets Live information about endpoint accessibility Federated queries using type mappings Stefan Dietze 31/07/14 http://datahub.io/group/linked-education 10. 31/07/14 Dataset interlinking recommendation Candidate datasets for interlinking? 13 t Linkset1 Linkset2 Approach Given dataset t, ranking datasets from D according to probability score (di, t) to contain linking candidates (entities) Features: Vocabulary overlap Existing links (SNA) Linking candidates likely if datasets share common (a) schema elements, or (b) links (friend of a friend) Conclusions Roughly 60% MAP for both approaches Future work: quantity of links, extraction of experimental data from datasets Lopes, G.R., Paes Leme, L.A.P., Nunes, B.P., Casanova, M.A., Dietze, S., Recommending Tripleset Interlinking through a Social Network Approach, The 14th International Conference on Web Information System Engineering (WISE 2013), Nanjing, China, 2013. Paes Leme, L. A. P., Lopes, G. R., Nunes, B. P., Casanova, M.A., Dietze, S., Identifying candidate datasets for data interlinking, in Proceedings of the 13th International Conference on Web Engineering, (2013). Rank 1 DBLP 2 ACM 3 OAI 4 CiteSeer 5 IBM 6 Roma 7 IEEE 8 Ulm 9 Pisa ? ? Stefan Dietze 11. Pluto & the Dwarf Planets 8748720> Video Topics/categories addressed? Relatedness of resources/entities? (types, semantics) Wonders of the Solar SystemEmp. of the SunBrian Cox Programme Combining a co-occurrence-based and a semantic measure for entity linking, B. P. Nunes, S. Dietze, M.A. Casanova, R. Kawase, B. Fetahu, and W. Nejdl., ESWC 2013 - 10th Extended Semantic Web Conference, (May 2013). A Scalable Approach for Efficiently Generating Structured Dataset Topic Profiles, Fetahu, B., Dietze, S., Nunes, B. P., Casanova, M. A., Nejdl, W., 11th Extended Semantic Web Conference (ESWC2014), Crete, Greece, (2014). Dataset & entity linking: semantics of resources/datasets? 16Stefan Dietze 31/07/14 Planetary motion & gravity 2139393292> Slideset Pluto? 12. db:Pluto (Dwarf Planet) db:Astrono- mical Objects db:Sun Disambiguation/linking using background knowledge Semantic relatetedness of resources? db:Astronomy 17 Wonders of the Solar SystemEmp. of the SunBrian Cox Programme Planetary motion & gravity 2139393292> Slideset Video Stefan Dietze 31/07/14 Pluto & the Dwarf Planets 8748720> 13. db:Pluto (Dwarf Planet) db:Astrono- mical Objects db:Astronomy Computation of connectivity scores between resources/entities Method: combination of a (i) semantic (graph-based) connectivity score (SCS) with (ii) a Web co-occurence-based measure (CBM) (similar to NGD) For (i): adaptation of Katz-Index from SNA for (linked) data graphs (considering path number and path lengths of transversal properties) db:Sun SCS = 0.32 CBM = 0.24 http://purl.org/vol/doc/ http://purl.org/vol/ns/ 19/09/2013 18Stefan Dietze Combining a co-occurrence-based and a semantic measure for entity linking, B. P. Nunes, S. Dietze, M.A. Casanova, R. Kawase, B. Fetahu, and W. Nejdl., ESWC 2013 - 10th Extended Semantic Web Conference, (May 2013). Entity linking: semantic relatedness Planetary motion & gravity 2139393292> Slideset Wonders of the Solar SystemEmp. of the SunBrian Cox Programme Pluto & the Dwarf Planets 8748720> Video 14. Entity linking: evaluation 31/07/14 19Stefan Dietze Evaluation based on USA Today News items (80.000 entity pairs) Manually created gold standard (1000 entity pairs) Baseline: Explicit Semantic Analysis (ESA) => CBM/SCS: relatedness; ESA: similarity Precision/Recall/F1 for SCS, CBM, ESA. Combining a co-occurrence-based and a semantic measure for entity linking, B. P. Nunes, S. Dietze, M.A. Casanova, R. Kawase, B. Fetahu, and W. Nejdl., ESWC 2013 - 10th Extended Semantic Web Conference, (May 2013). 15. db:Astro. Objects Dataset Metadata Stefan Dietze 31/07/14 Entity disambiguation & linking [ESWC13] Topic profile extraction [WWW13, ESCW14] db:Astronomy db:Astro. Objects Dataset Catalog/Registry yov:Video Pluto & the Dwarf Planets > Yovisto Video Extracting representative metadata (topic profile) for datasets Ranking of most representative (DBpedia) categories (= topics); applied to all responsive LOD datasets Scalability vs representativeness: sampling & ranking for good scalability/accuracy balance A Scalable Approach for Efficiently Generating Structured Dataset Topic Profiles, Fetahu, B., Dietze, S., Nunes, B. P., Casanova, M. A., Nejdl, W., 11th Extended Semantic Web Conference (ESWC2014), Crete, Greece, (2014). Dataset profiling: whats the data about? 16. Dataset profiling: approach Stefan Dietze 31/07/14 1. Sampling of resource instances (random sampling, weighted sampling, resource centrality sampling) 2. Entity and topic extraction (NER via DBpedia Spotlight, category mapping and expansion) 3. Normalisation and ranking (using graphical- models such as PageRank with Priors, HITS with Priors and K-Step Markov) => Result: weighted dataset-topic profile graph A Scalable Approach for Efficiently Generating Structured Dataset Topic Profiles, Fetahu, B., Dietze, S., Nunes, B. P., Casanova, M. A., Nejdl, W., 11th Extended Semantic Web Conference (ESWC2014), Crete, Greece, (2014). 17. Dataset profiling: exploring LOD datasets/topics in a nutshell http://data-observatory.org/lod-profiles/ Stefan Dietze 31/07/14 Automatic extraction of dataset topics [ESWC2014] => RDF/VoiD dataset profiles Visualisation & exploration of dataset-topic graph (datasets, topics, relationships) Includes all (responsive) datasets of LOD Cloud 18. Dataset profiling: results evaluation Stefan Dietze 31/07/14 NDCG (averaged over all datasets) . Datasets & Ground Truth Yovisto, Oxpoints, LAK Dataset, Semantic Web Dogfood Crowd-sourced topic indicators from datasets (keywords, tags) Manual mapping to entities & category extraction (ranking according to frequency) Baselines 1) LDA, 2) tf/idf (applied to entire datasets) Topic extraction according to our approach, weighting/ranking based on term weight Measure NDCG @ rank l Perfo

Recommended

View more >