Discovering Related Data Sources in Data Portals
Andreas Wagner, Peter Haase, Achim Re4nger, Holger Lamm
1st Interna:onal Workshop on Seman:c Sta:s:cs
Sydney, Oct 22, 2013
WORLD BANK
Poten&al of Open (Sta&s&cs) Data
WORLD BANK
fluidOps Open Data Portal • Data collec&on • Integra&on of major open data catalogs • Automated provisioning of 10.000s data sets
• Portal for search and explora&on of data sets • Rich metadata based on open standards • Both descrip&ve and structural metadata
• Integrated querying across interlinked data sets • Easy to use queries against mul&ple data sets • Using federa&on technologies
• Self-‐service UI • Custom queries and visualiza&ons • Widgets, dashboarding, etc.
Finding Related Data Sets • Many informa&on needs require analysis of mul&ple data sets
• Example: Compare and correlate GDP, popula&on and public debt of countries over &me
• Task of finding related data sets • Iden&fy data sets that are similar, but complementary • To support queries across mul&ple data sets, e.g. in the form of joins
and unions
• Inspira&on: Finding related tables • En&ty complement: same aVributes, complemen&ng en&&es • Schema complement: same en&&es, complemen&ng aVributes
Finding Related Data Sources via Related En&&es
• Data Model: Data source is a set of mul&ple RDF graphs
• Intui&on: if data sources contain similar en&&es, they are somehow related
• Approach: 1. En&ty Extrac&on 2. En&ty Similarity 3. En&ty Clustering
En&&es
Source 3
Cluster 2
Related?!
Cluster 1
Source 2 Source 1
Related En&&es (2) 1. En&ty Extrac&on – Sample over en&&es in data graphs in D – For each en&ty crawl its surrounding sub-‐graph [1]
2. En&ty Similarity – Define dissimilarity measure between two en&&es
based on kernel func&ons – Compare en&ty structure and literals via different
kernels [2,3] 3. En&ty Clustering – Apply k-‐means clustering to discover similar
en&&es [4]
Contextualisa&on Score
• Contextualiza&on score for data source D’’ given D’: ec(D’’|D’) and sc(D’’|D’)
• En*ty complement score
• Schema complement score
Search for Gross Domes&c Product
Querying the Data Set
Visualizing the Results
Queries Across Related Data Sets • Query for GDP of Germany
• Union of results from • Worldbank: GDP (current US$ ) (up to 2010) • Eurostat: GDP at Market Prices (including projected values un&l 2014)
Queries Across Related Data Sets
Data from Eurostat Data from Worldbank
Summary and Outlook • Techniques for finding related data sets – Based on finding related en&&es
• Implementa&on available in open data portal
• Outlook – Finding relevant related data sources for a given informa&on need
– End user interfaces for formula&ng queries across data sets (see Op&que project)
– Operators for combining data cubes – Interac&ve visualiza&on and explora&on of combined data cubes (see OpenCube project)
References
[1] G. A. Grimnes, P. Edwards, and A. Preece. Instance based clustering of seman:c web resources. In ESWC, 2008.
[2] U. Lösch, S. Bloehdorn, and A. Reenger. Graph kernels for RDF data. In ESWC, 2012.
[3] J. Shawe-‐Taylor and N. Cris&anini. Kernel Methods for PaPern Analysis. 2004.
[4] R. Zhang and A. Rudnicky. A large scale clustering scheme for kernel k-‐means. In PaVern Recogni&on, 2002.