Upload
others
View
11
Download
0
Embed Size (px)
Citation preview
Semantic Search engines
Existing Solutions
Linked Data
How can I get my dataset into the diagram?
• There must be resolvable http:// (or https://) URIs.
• They must resolve, with or without content negotiation, to RDF data in one of the popular RDF formats (RDFa, RDF/XML, Turtle, N-Triples).
• The dataset must contain at least 1000 triples. (Hence, your FOAF file most likely does not qualify.)
How can I get my dataset into the diagram?
• The dataset must be connected via RDF links to a dataset that is already in the diagram. This means, either your dataset must use URIs from the other dataset, or vice versa. We arbitrarily require at least 50 links.
• Access of the entire dataset must be possiblevia RDF crawling, via an RDF dump, or via a SPARQL endpoint.
Why Linked Data?
• Easier search for structured documents (use of URIs in RDF triples is similar to the use of URLs in classical links)
• Easier ontology matching – Central authorities providing URIs for other data sources (e.g. DBpedia)
Semantic Search Engines
Document-Centric Semantic Search Engines
Watson
• http://kmi-web05.open.ac.uk/WatsonWUI/
• Parsing: Jena
• Repository: Jena?
• Reasoning: NO
• Keyword based search, SPARQL endpoint
Watson - Schema
Swoogle
• http://swoogle.umbc.edu/
• Crawler: 3 Custom Crawlers
– Google Crawler (.rdf, .owl files)
– Focused Crawler
– Extracted URIs crawler
• Repository: Jena
• Index: Lucene
• Keyword based search
Swoogle Architecture
Data Analysis
• Classification of Semantic Web Documents
– Databases – Makes assertions about individuals
– Ontologies – Defines new terms
• Compute rank of SWDs
• Search ordering: Swoogle PR – analogy to GPR
Entity-Centric Semantic Search Engines
Falcons
• http://iws.seu.edu.cn/services/falcons/
• Reasoning/Ontology matching: Falcon-ao
• Search ordering: TF-IDF in combination with popularity of ontologies
• Classes recommendation: Ordering according to their popularity
• Keyword search: Based on the indexed texts extracted from Virtual Documents
Falcon Screenshot
Falcon-ao
• Linguistic Matching for Ontologies– Virtual Documents (names,
labels, comments)– Levenshtein edit distance– Vector Space Model + cosine
similarity of VDs
• Graph Matching for Ontologies– Similarity of two entities comes
from the accumulation of similarities of involved statements
– Similarity of two statements comes from the accumulation of similarities of involved entities
SWSE
• http://swse.deri.org/• Crawler: MultiCrawler• Repository: YARS2 – storing quadruples (subject,
predicate, object, context)• Ontology matching: URIs, IFPs• Reasoning: Future work (Scalable Authoritative
OWL Reasoner - SAOR)• Search ordering: ReConRank (Page Rank for
Linked Data)• Keyword based search: Lucene
SWSE Architecture
• Consolidate – find synonymous identifiers
• Rank – links-based analysis, scores assignment
Sindice.com
• http://www.sindice.com
• Crawler: SindiceBot
– robots.rdf – semantic site maps
– crawling pingthesemanticweb.com
• 3 Indexes:
– URI index
– IFP index
– Keyword index
Sindice Architecture
• Crawler:
– Apache Nutch
– Hadoop
– MapReduce
• Reasoner: OWLIM Reasoner
• Keyword based search: Solr
• http://www.sig.ma
Sindice Architecture
Basic structure
Structured datacrawler
Unstructured datacrawler
Documents repository
Data extractor
Indexer
Entity repository
Other apps using API
Searcher
Sorter
Basic structure
Crawler
Documents repository
(Cache)
Data extractor(Parser)
Indexer
Entity repository
Other apps using API
Searcher
Sorter
Ping
Scheduler
Basic structure
Crawler
Indexer
SERQL
Searcher
SorterOWLIM
Ping
Scheduler
Flat Files?
Sesame
Crawling Problems
• Locating resources (not so big problem nowadays)
• Re-Crawl Timing
• Life data sources
• Automatically generated data sources
Storage Problems
• Ontology matching – structural and linguistic methods are not 100 % accurate
• Reasoning
– Tradeoff quality vs. scalability
– Data sources credibility (spamming)
• Indexing – tradeoff quality vs. scalability
– Keyword search vs. SPARQL
Searching Problems
• Extent of some queriesSELECT ?s ?o
WHERE { ?s rdf:type ?o }
– Stop words
– Top-k results
• Results ordering
– Application of Page Rank – prone to spamming
– Resources credibility
Semantic web Crawler
• Slug
– Simple – starts from a given set of documents and follows extracted URIs
– Bugs
• MultiCrawler
– No downloadable version
– Description in a paper
• Apache Nutch based solution
Java Triplestores I
• YARS2 – not devloped any more (http://sw.deri.org/2004/06/yars/)
• Jena (http://jena.sourceforge.net/)– TDB storage (access via API)– SDB storage (SPARQL endpoint)
• Sesame (http://www.openrdf.org/)– Sesame Server– SERQL
• Virtuoso (http://virtuoso.openlinksw.com)– Unified storage engine (XML, SQL, RDF, Free Text)– Berlin Benchmark
Java Triplestores II
• JRDF
– 2008 triplestore across Hadoop
– Currently no support for OWL
• Mulgara
– SPARQL, TQL
– Connection API