17
Discovering Related Data Sources in Data Portals Andreas Wagner, Peter Haase , Achim Re4nger, Holger Lamm 1st Interna:onal Workshop on Seman:c Sta:s:cs Sydney, Oct 22, 2013

Discovering Related Data Sources in Data Portals

Embed Size (px)

DESCRIPTION

Slides from my presentation at the 1st International Workshop on Semantic Statistics Sydney, Oct 22, 2013

Citation preview

Page 1: Discovering Related Data Sources in Data Portals

Discovering  Related  Data  Sources    in  Data  Portals  

 Andreas  Wagner,  Peter  Haase,    Achim  Re4nger,  Holger  Lamm  

1st  Interna:onal  Workshop  on  Seman:c  Sta:s:cs  

Sydney,  Oct  22,  2013    

Page 2: Discovering Related Data Sources in Data Portals

WORLD BANK

Poten&al  of  Open  (Sta&s&cs)  Data  

Page 3: Discovering Related Data Sources in Data Portals

WORLD BANK

fluidOps  Open  Data  Portal  •  Data  collec&on  •  Integra&on  of  major  open  data  catalogs  •  Automated  provisioning  of  10.000s  data  sets  

•  Portal  for  search  and  explora&on  of  data  sets  •  Rich  metadata  based  on  open  standards  •  Both  descrip&ve  and  structural  metadata  

•  Integrated  querying  across  interlinked  data  sets  •  Easy  to  use  queries  against  mul&ple  data  sets  •  Using  federa&on  technologies  

•  Self-­‐service  UI  •  Custom  queries  and  visualiza&ons  •  Widgets,  dashboarding,  etc.  

Page 4: Discovering Related Data Sources in Data Portals
Page 5: Discovering Related Data Sources in Data Portals

Finding  Related  Data  Sets  •  Many  informa&on  needs  require  analysis  of  mul&ple  data  sets  

•  Example:  Compare  and  correlate  GDP,  popula&on  and  public  debt  of  countries  over  &me  

•  Task  of  finding  related  data  sets  •  Iden&fy  data  sets  that  are  similar,  but  complementary  •  To  support  queries  across  mul&ple  data  sets,  e.g.  in  the  form  of  joins  

and  unions  

•  Inspira&on:  Finding  related  tables  •  En&ty  complement:  same  aVributes,  complemen&ng  en&&es  •  Schema  complement:  same  en&&es,  complemen&ng  aVributes  

Page 6: Discovering Related Data Sources in Data Portals

Finding  Related  Data  Sources  via  Related  En&&es  

•  Data  Model:  Data  source  is  a  set  of  mul&ple  RDF  graphs  

•  Intui&on:  if  data  sources  contain  similar  en&&es,  they  are  somehow  related  

•  Approach:  1.  En&ty  Extrac&on  2.  En&ty  Similarity  3.  En&ty  Clustering  

En&&es  

Source  3  

Cluster  2  

Related?!  

Cluster  1  

Source  2  Source  1  

Page 7: Discovering Related Data Sources in Data Portals

Related  En&&es  (2)  1.  En&ty  Extrac&on  –  Sample  over  en&&es  in  data  graphs  in  D  –  For  each  en&ty  crawl  its  surrounding  sub-­‐graph  [1]  

2.  En&ty  Similarity  –  Define  dissimilarity  measure  between  two  en&&es  

based  on  kernel  func&ons  –  Compare  en&ty  structure  and  literals  via  different  

kernels  [2,3]  3.  En&ty  Clustering  –  Apply  k-­‐means  clustering  to  discover  similar    

 en&&es  [4]  

Page 8: Discovering Related Data Sources in Data Portals

Contextualisa&on  Score  

•  Contextualiza&on  score  for  data  source  D’’  given  D’:  ec(D’’|D’)  and  sc(D’’|D’)  

•  En*ty  complement  score  

•  Schema  complement  score  

Page 9: Discovering Related Data Sources in Data Portals
Page 10: Discovering Related Data Sources in Data Portals

Search  for  Gross  Domes&c  Product  

Page 11: Discovering Related Data Sources in Data Portals
Page 12: Discovering Related Data Sources in Data Portals

Querying  the  Data  Set  

Page 13: Discovering Related Data Sources in Data Portals

Visualizing  the  Results  

Page 14: Discovering Related Data Sources in Data Portals

Queries  Across  Related  Data  Sets  •  Query  for  GDP  of  Germany  

•  Union  of  results  from    •  Worldbank:  GDP  (current  US$  )  (up  to  2010)  •  Eurostat:  GDP  at  Market  Prices  (including  projected  values  un&l  2014)  

Page 15: Discovering Related Data Sources in Data Portals

Queries  Across  Related  Data  Sets  

Data  from  Eurostat  Data  from  Worldbank  

Page 16: Discovering Related Data Sources in Data Portals

Summary  and  Outlook  •  Techniques  for  finding  related  data  sets  –  Based  on  finding  related  en&&es  

•  Implementa&on  available  in  open  data  portal  

•  Outlook  –  Finding  relevant  related  data  sources  for  a  given  informa&on  need  

–  End  user  interfaces  for  formula&ng  queries    across  data  sets  (see  Op&que  project)  

–  Operators  for  combining  data  cubes  –  Interac&ve  visualiza&on  and  explora&on  of    combined  data  cubes  (see  OpenCube  project)  

Page 17: Discovering Related Data Sources in Data Portals

References  

[1]    G.  A.  Grimnes,  P.  Edwards,  and  A.  Preece.    Instance  based  clustering  of  seman:c  web    resources.  In  ESWC,  2008.  

[2]  U.  Lösch,  S.  Bloehdorn,  and  A.  Reenger.    Graph  kernels  for  RDF  data.  In  ESWC,  2012.  

[3]  J.  Shawe-­‐Taylor  and  N.  Cris&anini.  Kernel    Methods  for  PaPern  Analysis.  2004.  

[4]    R.  Zhang  and  A.  Rudnicky.  A  large  scale    clustering  scheme  for  kernel  k-­‐means.  In    PaVern  Recogni&on,  2002.