Upload
nuxeo
View
110
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Presentation on Apache Stanbol (incubating) and related projects given by Olivier Grisel durin ApacheCon 2011. More information: - http://incubator.apache.org/stanbol/ - http://www.iks-project.eu
Citation preview
11/7/11
Apache Stanbol (Incubating)and the Web of Data
Olivier Grisel, [email protected], 2011-11-11
11/7/11
My Background
Olivier Grisel - R&D Engineer
nuxeoOpen Source ECM
European project: IKS
Stuff I do:Machine Learning Natural Language Processing All things data
11/7/11
Agenda
The Web of Data: what, why, how?
CMS integration demo
Semantic Components in Stanbol
Building models for Stanbol
The Web of Data
What, Why, How?
11/7/11
“To a computer, then, the web is a flat, boring world devoid of meaning”
Tim Berners Lee, http://www.w3.org/Talks/WWW94Tim/
11/7/11
“This is a pity, as in fact documents on the web describe real objects and imaginary concepts, and give particular relationships between them”
Tim Berners Lee, http://www.w3.org/Talks/WWW94Tim/
11/7/11
“The Semantic Web is not a separate Web but an extension of the current one, in which information
is given well-defined meaning, better enabling computers and people to work in cooperation.”
Tim Berners Lee, http://www.w3.org/Talks/WWW94Tim/
11/7/11
“Adding semantics to the web involves two things: allowing documents which have information
in machine-readable forms, and allowing links to be created with relationship values.”
Tim Berners Lee, http://www.w3.org/Talks/WWW94Tim/
11/7/11
The Web of Data – What?
Shared description of the real world
oStructured with vocabularies
oDecentralized
oScoped by namespaces
oLinked
11/7/11
The Web of Data – Why?• Strings are ambiguous
o New York / The Big Apple / NYCo Washington (Person, State, City, Sports Team...)
• Structured context helps humans o Who is this guy?o Where is this city?
• Conceptual frame helps machineso Explicit user intent decodingo Smarter indexing / search?
11/7/11
Decoding User Intents
11/7/11
Decoding User Intents
Next Generation User InterfacesSiri - conversational interfaceIBM DeepQA: Watson for Heath Care
Tell Google about your stuffPublish structured prediction of your products"3 bedrooms flat near Montmartre"
Useful for non-public data as wellIntranet query: "ApacheCon slides"Intranet query: "Xerox invoices"Intranet query: "Xerox salesperson email"
11/7/11
The Web of Data - How?
RDF / TripeStores / SparqlGraph stores with dynamic schemasStrong interoperability
JSON-LDUpgrade your JSON with scoped vocabulariesWeb / Mobile / JS developer friendly
RDFa + schema.org & rNewsPublish annotation in structured markupVocabulary understood by Search Engines
11/7/11
HTML example
<p>
My name is Manu Sporny and you can give me a ring via 1-800-555-0155. <img src="http://manu.sporny.org/images/manu.png" /> I have a <a href="http://manu.sporny.org/">blog</a>.
</p>
11/7/11
RDFa example
<p vocab="http://schema.org/" prefix="foaf: http://xmlns.com/foaf/0.1/" about="#manu" typeof="Person">
My name is <span property="name">Manu Sporny</span> and you can give me a ring via <span property="telephone">1-800-555-0155</span>. <img rel="image" src="http://manu.sporny.org/images/manu.png" /> I have a <a rel="foaf:weblog" href="http://manu.sporny.org/">blog</a>.</p>
11/7/11
JSON-LD example
11/7/11
2007 2008
20092010
2011
Bridging the Web of Dataand my CMS
11/7/11
Apache Stanbol
EnhancerText analysis with Apache OpenNLP / Tika
EntityHub / ContentHubLinked Data Indexing with Apache SolrGraph Storage with Apache Clerezza / Jena
Reasoner / RulesInference with Apache Jena & OWLApi
Components / HTTP ServicesOSGi with Apache Felix / JAX-RS with Jersey
RESTfulis
Beautiful
11/7/11
Minimalist HTTP Client
curl -X POST -H "Accept: text/turtle" \ -H "Content-type: text/plain" \ --data "John Smith was born in London." \ http://stanbol.demo.nuxeo.com/engines
Local IT infrastructure (LAN)Local IT infrastructure (LAN)
Nuxeo DMNuxeo DMNuxeo DMNuxeo DM
addonaddon
11
1
Apache StanbolApache StanbolApache StanbolApache Stanbol
112
11
Engine 1Engine 1Engine 1Engine 1
Engine 2Engine 2Engine 2Engine 2
Engine 3Engine 3Engine 3Engine 3
3
DBpedia
Freebase
GeonamesLDAP
11/7/11
Stanbol Enhancer
Chain of Enhancement Engines
Language Detection (Tika)
Named Entity Detection (OpenNLP)
Linked Data dereferencing (Solr)
Refactoring / Translation (Jena)
11/7/11
Stanbol EntityHub
Referenced SitesDBpediaGeonames(NY Times, MusicBrainz, ProductDB, UnitProt...)
Fast local offline indices (Solr)Batch indexing utilities for RDF dumpsMultilingual fulltext search in labels & descriptions
Vocabulary mapping / merging
11/7/11
Stanbol Reasoner
RDFS / OWL-lite / OWL2
Consistency checksCardinality checks: each person has 1 birth date Range constraints: birth dates are valid dates
Materializing types / propertiesTypes from subclass: Musician > Artist > PersonSymmetric property: A worked with BTransitive property: A is a located in B
Query-time expansion / inference?
11/7/11
Stanbol Rules
Simple Prolog-like language uncleRule[ has(<http://example.org/family.owl#hasParent>, ?x, ?z) . has(<http://example.org/family.owl#hasSibling>, ?z, ?y) -> has(<http://example.org/family.owl#hasUncle>, ?x, ?y) ]
Sparql Construct or SWRL PREFIX family: <http://example.org/family.owl#> CONSTRUCT { ?x family:hasUncle} ?y } WHERE { ?x family:hasParent ?z . ?z family:hasSibling ?y}
11/7/11
Online Demos
Simple analyzer with small index https://stanbol.demo.nuxeo.com
All services deployed http://dev.iks-project.eu:8081
Building Stanbol Enhancer models from Wikipedia
with the Apache data tools
11/7/11
Universal Topic Classification
UseApache Lucene / Solr MoreLikeThis
to perform atruncated nearest neighbors query
in theTF-IDF vector space of Wikipedia
11/7/11
Universal Topic ClassificationIndex text of all articles grouped by topic
Solr MoreLikeThis query on new document
DBpedia dumps provide:Text summaries for each article
“subject” relationships between articles and topics
“broader” / “narrower” SKOS hieararchy between topics
11/7/11
About the Data500k purely technical categories
“People_with_missing_birth_place”, “Rivers_in_Romania”
70k “semantically grounded” categories
Paths to roots require both “technical” and “grounded” categories
Scale:1.2M topic / topic links30M topic / article links
11/7/11
Some results (Wikinews)US children who celebrate Independence Day more likely to become Republicans, says Harvard study
FireworksVoting theoryRepublican Party (United States)StatisticsElectoral systems
11/7/11
Some results (Wikinews)U.S. space agency NASA sues ex-astronaut
American astronautsAviation halls of fameEdwards Air Force BaseApollo programExploration of the Moon
11/7/11
Some results (Wikinews)Hundreds of thousands of British public sector workers strike over planned pension changes
Retirement in the United KingdomUnited Kingdom pensions and benefitsPensions in the United KingdomLabor disputes by countryLabor disputes
11/7/11
Some results (PLoS One)Metabolic Programming during Lactation Stimulates Renal Na+ Transport in the Adult Offspring Due to an Early Impact on Local Angiotensin II Pathways
Renal physiologyKidneyNephrologyHypertensionMembrane biology
11/7/11
Wrap Up
Web of Databrings Sructured Context Frameto decode User Intention
NLP + Entities & Topics indicesto automate Content Enrichmentto provide Disambiguationn
11/7/11
Resources
Documentation, svn, mailing list: http://incubator.apache.org/stanbol
IKS project blog: http://blog.iks-project.eu
Blog posts about Semantic ECM: http://blogs.nuxeo.com/dev/semantic/
11/7/11
Thank you for your attention!
Olivier Grisel
https://twitter.com/ogrisel
Training models for NER from Wikipedia
Extract sentences with link positions in Wikipedia articles
DBPedia to the find type of the target entity (Person, Location, Organization)
Apache Pig scripts to compute the join + format the result as training files for OpenNLP
Apache OpenNLP to build and evaluate the models
Apache Hadoop / Apache Whirr for distributed processing
52