26
Sindice.com: A Semantic Web Search Engine Giovanni Tummarello Renaud Delbru Eyal Oren Richard Cyganiak Digital Enterprise Research Institute National University of Ireland, Galway November 23, 2007 Richard Cyganiak Sindice.com 1 of 25

Sindice.com: A Semantic Web Search Enginerichard.cyganiak.de/2007/11/bristol/sindice-talk.pdfThe Semantic Web is a reality SW Conference Corpus DBpedia RDF Book Mashup DBLP Berlin

  • Upload
    others

  • View
    14

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Sindice.com: A Semantic Web Search Enginerichard.cyganiak.de/2007/11/bristol/sindice-talk.pdfThe Semantic Web is a reality SW Conference Corpus DBpedia RDF Book Mashup DBLP Berlin

Sindice.com:

A Semantic Web Search Engine

Giovanni Tummarello Renaud DelbruEyal Oren Richard Cyganiak

Digital Enterprise Research InstituteNational University of Ireland, Galway

November 23, 2007

Richard Cyganiak Sindice.com 1 of 25

Page 2: Sindice.com: A Semantic Web Search Enginerichard.cyganiak.de/2007/11/bristol/sindice-talk.pdfThe Semantic Web is a reality SW Conference Corpus DBpedia RDF Book Mashup DBLP Berlin

The Semantic Web is a reality

I Many Gigs of RDF dumps

I 30+ public SPARQL endpoints

I Linked Data, 5+ different browsers

I RDFa

Richard Cyganiak Sindice.com 2 of 25

Page 3: Sindice.com: A Semantic Web Search Enginerichard.cyganiak.de/2007/11/bristol/sindice-talk.pdfThe Semantic Web is a reality SW Conference Corpus DBpedia RDF Book Mashup DBLP Berlin

The Semantic Web is a reality

SWConference

Corpus

DBpedia

RDF Book Mashup

DBLPBerlin

Revyu

Project Guten-berg

FOAF

Geo-names

Music-brainz

Magna-tune

Jamendo

World Fact-book

DBLPHannover

SIOC

Sem-Web-

Central

Euro-stat

ECS South-ampton

BBCLater +TOTP

Fresh-meat

Open-Guides

Gov-Track

US Census Data

W3CWordNet

flickrwrappr

Wiki-company

OpenCyc

NEW! lingvoj

Onto-world

NEW!

NEW!NEW!

Richard Cyganiak Sindice.com 3 of 25

Page 4: Sindice.com: A Semantic Web Search Enginerichard.cyganiak.de/2007/11/bristol/sindice-talk.pdfThe Semantic Web is a reality SW Conference Corpus DBpedia RDF Book Mashup DBLP Berlin

The Semantic Web is a reality

We don’t worry about running out of data

Richard Cyganiak Sindice.com 4 of 25

Page 5: Sindice.com: A Semantic Web Search Enginerichard.cyganiak.de/2007/11/bristol/sindice-talk.pdfThe Semantic Web is a reality SW Conference Corpus DBpedia RDF Book Mashup DBLP Berlin

Sindice.com

Richard Cyganiak Sindice.com 5 of 25

Page 6: Sindice.com: A Semantic Web Search Enginerichard.cyganiak.de/2007/11/bristol/sindice-talk.pdfThe Semantic Web is a reality SW Conference Corpus DBpedia RDF Book Mashup DBLP Berlin

Sindice.com

Richard Cyganiak Sindice.com 6 of 25

Page 7: Sindice.com: A Semantic Web Search Enginerichard.cyganiak.de/2007/11/bristol/sindice-talk.pdfThe Semantic Web is a reality SW Conference Corpus DBpedia RDF Book Mashup DBLP Berlin

Sindice API

I http://sindice.com/query/lookup?uri=...

I http://sindice.com/query/lookup?keyword=...

I http://sindice.com/query/lookup?property=...

&object=...

I Ask for HTML, plain text, RDF/XML or JSON viacontent negotiation

Richard Cyganiak Sindice.com 7 of 25

Page 8: Sindice.com: A Semantic Web Search Enginerichard.cyganiak.de/2007/11/bristol/sindice-talk.pdfThe Semantic Web is a reality SW Conference Corpus DBpedia RDF Book Mashup DBLP Berlin

Scenario (1)

I Tom surfs to http://dbpedia.org/resource/Busan

I Tom wants more than just DBpedia’s information

I Tom’s Tabulator has a Sindice plugin

I Tom presses ‘lookup on Sindice’

I Tom gets a top-ten list of Busan sources

I Tom selects his two trustworthy sources

I Tom’s Tabulator downloads this data

I Tom continues his happy data-surfing

Richard Cyganiak Sindice.com 8 of 25

Page 9: Sindice.com: A Semantic Web Search Enginerichard.cyganiak.de/2007/11/bristol/sindice-talk.pdfThe Semantic Web is a reality SW Conference Corpus DBpedia RDF Book Mashup DBLP Berlin

Scenario (2)

I Tom goes eating in Busan

I Tom likes the food and reviews the restaurant

I Tom’s review site pings Sindice with the update

I Within an hour, others can find this info

I Tom continues his happy fish-eating

Richard Cyganiak Sindice.com 9 of 25

Page 10: Sindice.com: A Semantic Web Search Enginerichard.cyganiak.de/2007/11/bristol/sindice-talk.pdfThe Semantic Web is a reality SW Conference Corpus DBpedia RDF Book Mashup DBLP Berlin

Sindice: discover Semantic Web resources

Richard Cyganiak Sindice.com 10 of 25

Page 11: Sindice.com: A Semantic Web Search Enginerichard.cyganiak.de/2007/11/bristol/sindice-talk.pdfThe Semantic Web is a reality SW Conference Corpus DBpedia RDF Book Mashup DBLP Berlin

Indexing approach

I IR viewpoint: SW is bunch of documents

I DB viewpoint: SW is bunch of triples

I We take IR viewpoint: we index all identifiers andprovide simple lookups; no RDF queries

I Clients can browse/download/display RDF datathemselves; we tell them where to find it

Richard Cyganiak Sindice.com 11 of 25

Page 12: Sindice.com: A Semantic Web Search Enginerichard.cyganiak.de/2007/11/bristol/sindice-talk.pdfThe Semantic Web is a reality SW Conference Corpus DBpedia RDF Book Mashup DBLP Berlin

Sindice functionality (operators)

I index : url → ∅I lookup : uri → {url}I lookup : text → {url}I lookup : ifp × value → {url}

Natural data structure: inverted index over documents

Richard Cyganiak Sindice.com 12 of 25

Page 13: Sindice.com: A Semantic Web Search Enginerichard.cyganiak.de/2007/11/bristol/sindice-talk.pdfThe Semantic Web is a reality SW Conference Corpus DBpedia RDF Book Mashup DBLP Berlin

Sindice architecture

Richard Cyganiak Sindice.com 13 of 25

Page 14: Sindice.com: A Semantic Web Search Enginerichard.cyganiak.de/2007/11/bristol/sindice-talk.pdfThe Semantic Web is a reality SW Conference Corpus DBpedia RDF Book Mashup DBLP Berlin

Index lookup

I Index retrieval

I Ranking phase

I Result generation

Richard Cyganiak Sindice.com 14 of 25

Page 15: Sindice.com: A Semantic Web Search Enginerichard.cyganiak.de/2007/11/bristol/sindice-talk.pdfThe Semantic Web is a reality SW Conference Corpus DBpedia RDF Book Mashup DBLP Berlin

Graph processing

1. Fetch RDF data

2. Extract and index full-text literals

3. Extract and index mentioned URIs

4. Extract graph metadata (size and length)

5. Graph expansion and inferencing

6. Extract labels

7. Extract and index mentioned IFP pairs

Richard Cyganiak Sindice.com 15 of 25

Page 16: Sindice.com: A Semantic Web Search Enginerichard.cyganiak.de/2007/11/bristol/sindice-talk.pdfThe Semantic Web is a reality SW Conference Corpus DBpedia RDF Book Mashup DBLP Berlin

Graph processing

1. Fetch RDF data

2. Extract and index full-text literals

3. Extract and index mentioned URIs

4. Extract graph metadata (size and length)

5. Graph expansion and inferencing

6. Extract labels

7. Extract and index mentioned IFP pairs

Richard Cyganiak Sindice.com 15 of 25

Page 17: Sindice.com: A Semantic Web Search Enginerichard.cyganiak.de/2007/11/bristol/sindice-talk.pdfThe Semantic Web is a reality SW Conference Corpus DBpedia RDF Book Mashup DBLP Berlin

IFP processing

Richard Cyganiak Sindice.com 16 of 25

Page 18: Sindice.com: A Semantic Web Search Enginerichard.cyganiak.de/2007/11/bristol/sindice-talk.pdfThe Semantic Web is a reality SW Conference Corpus DBpedia RDF Book Mashup DBLP Berlin

Graph processing: IFP extraction

I OWL reasoning needed to find IFPs, butcomputationally expensive

I Desireable: reasoning cache to reuse computation

I Undesireable: global trust in all statements

Richard Cyganiak Sindice.com 17 of 25

Page 19: Sindice.com: A Semantic Web Search Enginerichard.cyganiak.de/2007/11/bristol/sindice-talk.pdfThe Semantic Web is a reality SW Conference Corpus DBpedia RDF Book Mashup DBLP Berlin

Solution: quarantained reasoning cache

I Recursively fetch all mentioned schemas

I Compute closure of schemas union

I Query and store all properties that are an IFP

I {foaf:name, dc:title, foaf:homepage, foaf:mbox} →{foaf:mbox}

I For any document that uses same properties you knowthe set of possible IFPs

Richard Cyganiak Sindice.com 18 of 25

Page 20: Sindice.com: A Semantic Web Search Enginerichard.cyganiak.de/2007/11/bristol/sindice-talk.pdfThe Semantic Web is a reality SW Conference Corpus DBpedia RDF Book Mashup DBLP Berlin

Sindice components

I Hadoop (parallel processing)

I HTable (document cache)

I Solr (document index)

I Sesame & OWLIM (reasoning)

I Ruby on Rails (frontend)

I pingthesemanticweb.com

Richard Cyganiak Sindice.com 19 of 25

Page 21: Sindice.com: A Semantic Web Search Enginerichard.cyganiak.de/2007/11/bristol/sindice-talk.pdfThe Semantic Web is a reality SW Conference Corpus DBpedia RDF Book Mashup DBLP Berlin

Tool for data providers: Semantic Sitemap

I Sitemap protocol exposes “deep web” to crawlers

I Semantic sitemap adds Semantic Web data

I http://sw.deri.org/2007/07/sitemapextension/

I Used by: Geonames, DBLP, Uniprot, DBpedia,data.semanticweb.org

Richard Cyganiak Sindice.com 20 of 25

Page 22: Sindice.com: A Semantic Web Search Enginerichard.cyganiak.de/2007/11/bristol/sindice-talk.pdfThe Semantic Web is a reality SW Conference Corpus DBpedia RDF Book Mashup DBLP Berlin

Semantic Sitemap: example

<urlset xmlns=http://www.sitemaps.org/schemas/sitemap/0.9xmlns:sc=http://sw.deri.org/2007/07/sitemapextension/scschema.xsd>

<sc:dataset>

<sc:datasetLabel>Product Catalog for Example.com</sc:datasetLabel>

<sc:linkedDataPrefix>http://example.com/products/</sc:linkedDataPrefix><sc:sparqlEndpoint>http://example.com/sparql</sc:sparqlEndpoint><sc:dataDumpLocation>http://example.com/all.rdf</sc:dataDumpLocation>

<changefreq>weekly</changefreq>

</sc:dataset></urlset>

Richard Cyganiak Sindice.com 21 of 25

Page 23: Sindice.com: A Semantic Web Search Enginerichard.cyganiak.de/2007/11/bristol/sindice-talk.pdfThe Semantic Web is a reality SW Conference Corpus DBpedia RDF Book Mashup DBLP Berlin

What about other search engines?

I We do not answer queries but refer to data sources

I We have IFP lookup using OWL reasoning

I We have semantic sitemap for data-dumps

I We support linked data (input and output)

I We have fully open client APIs

I We have Hadoop infrastructure

I We have live, continuous, updates

I Simplicity, efficiency, scalability

Richard Cyganiak Sindice.com 22 of 25

Page 24: Sindice.com: A Semantic Web Search Enginerichard.cyganiak.de/2007/11/bristol/sindice-talk.pdfThe Semantic Web is a reality SW Conference Corpus DBpedia RDF Book Mashup DBLP Berlin

Credits

I Giovanni Tummarello

I Eyal Oren

I Michele Catasta

I Renaud Delbru

I Holger Stenzhorn

I Adam Westerski

I OpenLink Software

Richard Cyganiak Sindice.com 23 of 25

Page 25: Sindice.com: A Semantic Web Search Enginerichard.cyganiak.de/2007/11/bristol/sindice-talk.pdfThe Semantic Web is a reality SW Conference Corpus DBpedia RDF Book Mashup DBLP Berlin

Upcoming as we speak . . .

I Validator API

I Trust assessment API

I SW Pipes and widgets platform

I Entity-based API (Okkam)

I Growing hardware cluster (possibly 100 nodes)

Richard Cyganiak Sindice.com 24 of 25

Page 26: Sindice.com: A Semantic Web Search Enginerichard.cyganiak.de/2007/11/bristol/sindice-talk.pdfThe Semantic Web is a reality SW Conference Corpus DBpedia RDF Book Mashup DBLP Berlin

Summary

I Sindice: lookup service for Semantic Web resources

I Lookup: resource by URIs, IFPs, keyword

I Architecture: Based on Hadoop, Solr and OWLIM

I Data: DBLP, DBpedia, Uniprot, Geonames and more

I 20M+ documents, 80M+ URIs, 4M+ IFPs, 2B+ triples

Richard Cyganiak Sindice.com 25 of 25