Upload
stefane-fermigier
View
224
Download
0
Embed Size (px)
Citation preview
8/6/2019 Solutions Linux 2011 Merged
1/68
8/6/2019 Solutions Linux 2011 Merged
2/68
Agenda
A pragmatic introduction to the SemanticWeb
Experience report and demos from Nuxeo
Apache tools for Big Linked Data
Wednesday, May 11, 2011
8/6/2019 Solutions Linux 2011 Merged
3/68
1. Introduction to theSemantic Web
Wednesday, May 11, 2011
8/6/2019 Solutions Linux 2011 Merged
4/68
Prelude
Wednesday, May 11, 2011
8/6/2019 Solutions Linux 2011 Merged
5/68
Source: Mills Davis, Semantic Social Computing, sept. 2007Wednesday, May 11, 2011
8/6/2019 Solutions Linux 2011 Merged
6/68
History
Wednesday, May 11, 2011
8/6/2019 Solutions Linux 2011 Merged
7/68Wednesday, May 11, 2011
8/6/2019 Solutions Linux 2011 Merged
8/68Wednesday, May 11, 2011
8/6/2019 Solutions Linux 2011 Merged
9/68Wednesday, May 11, 2011
8/6/2019 Solutions Linux 2011 Merged
10/68
Historical perspective
From web 1.0: web of sites and pages,aka the World Wide Web
To web 2.0: web of people and ofparticipation, aka the Social Web (Blogs,RSS, tags, Facebook, Wikipedia, etc.)
To web 3.0: web of data, of meaning andconnected knowledge, aka the SemanticWeb
Wednesday, May 11, 2011
8/6/2019 Solutions Linux 2011 Merged
11/68
Semantics & Ontologies
Wednesday, May 11, 2011
8/6/2019 Solutions Linux 2011 Merged
12/68Wednesday, May 11, 2011
8/6/2019 Solutions Linux 2011 Merged
13/68Wednesday, May 11, 2011
8/6/2019 Solutions Linux 2011 Merged
14/68Wednesday, May 11, 2011
8/6/2019 Solutions Linux 2011 Merged
15/68Wednesday, May 11, 2011
8/6/2019 Solutions Linux 2011 Merged
16/68
Some examples
FOAF: relationships between people (socialnetwork)
SIOC: relationships between websites,articles, blogs, comments
Rich Snippets: syndicate RDFa content forSEO by Google, Yahoo
good-relations: e-commerce (Ebay...)
rNews: metadata for news agencies (AFP,
Reuters...)Wednesday, May 11, 2011
8/6/2019 Solutions Linux 2011 Merged
17/68
How is it related tothe Web?
Wednesday, May 11, 2011
8/6/2019 Solutions Linux 2011 Merged
18/68
The traditional Web
A principle: hypertext
A protocol: HTTP
An identification scheme: URNs/URIs
A language: HTML
Wednesday, May 11, 2011
8/6/2019 Solutions Linux 2011 Merged
19/68
To a computer, then, the web is a flat,
boring world devoid ofmeaning
Tim Berners Lee, http://www.w3.org/Talks/WWW94Tim/Wednesday, May 11, 2011
http://www.w3.org/Talks/WWW94Tim/http://www.w3.org/Talks/WWW94Tim/http://www.w3.org/Talks/WWW94Tim/http://www.w3.org/Talks/WWW94Tim/http://www.w3.org/Talks/WWW94Tim/http://www.w3.org/Talks/WWW94Tim/8/6/2019 Solutions Linux 2011 Merged
20/68
This is a pity, as in fact documents on the
web describe real objects and imaginaryconcepts, and give particular relationships
between them
Tim Berners Lee, http://www.w3.org/Talks/WWW94Tim/Wednesday, May 11, 2011
http://www.w3.org/Talks/WWW94Tim/http://www.w3.org/Talks/WWW94Tim/http://www.w3.org/Talks/WWW94Tim/http://www.w3.org/Talks/WWW94Tim/http://www.w3.org/Talks/WWW94Tim/http://www.w3.org/Talks/WWW94Tim/8/6/2019 Solutions Linux 2011 Merged
21/68
Adding semantics to the web involves two things:
allowing documents which have information inmachine-readable forms, and allowing links to be
created with relationship values.
Tim Berners Lee, http://www.w3.org/Talks/WWW94Tim/Wednesday, May 11, 2011
http://www.w3.org/Talks/WWW94Tim/http://www.w3.org/Talks/WWW94Tim/http://www.w3.org/Talks/WWW94Tim/http://www.w3.org/Talks/WWW94Tim/8/6/2019 Solutions Linux 2011 Merged
22/68
The Semantic Web is not a separate Web but an
extension of the current one, in which informationis given well-defined meaning, better enabling
computers and people to work in cooperation.
Tim Berners Lee, http://www.w3.org/Talks/WWW94Tim/Wednesday, May 11, 2011
http://www.w3.org/Talks/WWW94Tim/http://www.w3.org/Talks/WWW94Tim/http://www.w3.org/Talks/WWW94Tim/http://www.w3.org/Talks/WWW94Tim/8/6/2019 Solutions Linux 2011 Merged
23/68
The traditional Web
A principle: hypertext
A protocol: HTTP
An identification scheme: URNs/URIs
A language: HTML
Wednesday, May 11, 2011
8/6/2019 Solutions Linux 2011 Merged
24/68
The semantic Web
A principle: hypertext
A protocol: HTTP
An identification scheme: URNs/URIs
A language: HTML RDF
Wednesday, May 11, 2011
8/6/2019 Solutions Linux 2011 Merged
25/68
The W3C Layer Cake
Wednesday, May 11, 2011
8/6/2019 Solutions Linux 2011 Merged
26/68
The W3C Layer Cake
Alreadystandardized
Wednesday, May 11, 2011
8/6/2019 Solutions Linux 2011 Merged
27/68
URIs and the
Web of Things
URIs (Unique Resource Identifiers) are
used to identify things (also called
entities) in the real world
For instance: people, places, events,
companies, products, movies, etc.
Wednesday, May 11, 2011
8/6/2019 Solutions Linux 2011 Merged
28/68
The RDF model
Subject ObjectPredicate
RDF is used to describe relationships
between objects, identified by their URIs
Wednesday, May 11, 2011
8/6/2019 Solutions Linux 2011 Merged
29/68
Example
Source: http://www.slideshare.net/AntidotNet/web-smantique-web-de-donnes-web-30-linked-data-quelques-repres-pour-sy-retrouver
Wednesday, May 11, 2011
http://www.slideshare.net/AntidotNet/web-smantique-web-de-donnes-web-30-linked-data-quelques-repres-pour-sy-retrouverhttp://www.slideshare.net/AntidotNet/web-smantique-web-de-donnes-web-30-linked-data-quelques-repres-pour-sy-retrouverhttp://www.slideshare.net/AntidotNet/web-smantique-web-de-donnes-web-30-linked-data-quelques-repres-pour-sy-retrouverhttp://www.slideshare.net/AntidotNet/web-smantique-web-de-donnes-web-30-linked-data-quelques-repres-pour-sy-retrouverhttp://www.slideshare.net/AntidotNet/web-smantique-web-de-donnes-web-30-linked-data-quelques-repres-pour-sy-retrouverhttp://www.slideshare.net/AntidotNet/web-smantique-web-de-donnes-web-30-linked-data-quelques-repres-pour-sy-retrouver8/6/2019 Solutions Linux 2011 Merged
30/68
RDF serialization
As XML:
Others, ex: N3:
Wednesday, May 11, 2011
8/6/2019 Solutions Linux 2011 Merged
31/68
SPARQL
Query language for RDF databases
Several implementations
OSS: Apache Jena, Sesame, 4Store,
Virtuoso, Mulgara, Redland, Open Anzo...
Proprietary: 5Store, AllegroGraphRDFStore, Stardog, Dydra, OWLIM...
More expressive than SQL, scalability is still
an open question
Wednesday, May 11, 2011
8/6/2019 Solutions Linux 2011 Merged
32/68
SPARQL Sample
Wednesday, May 11, 2011
8/6/2019 Solutions Linux 2011 Merged
33/68
Where and howto find these data?
Wednesday, May 11, 2011
8/6/2019 Solutions Linux 2011 Merged
34/68
Solution 1: Lift
One can use HTML scrapping and naturallanguage processing (NLP) technique toextract semantic information from existingcontent / sites
Generic solutions: OpenCalais, Zemanta,Apache Stanbol
Pro: no need to change existing content
Con: error rone, needs human checks
Wednesday, May 11, 2011
8/6/2019 Solutions Linux 2011 Merged
35/68
Example: DBPedia
Wednesday, May 11, 2011
8/6/2019 Solutions Linux 2011 Merged
36/68
Solution 2: export
RDFa and microformats are used to embed
semantic information (expressed using the
RDF model) into regular web pages
RDFa does it using existing (rel) and
additional (about, property, typeof)
attributes Microformats only use usual HTML
attributes (class)
Wednesday, May 11, 2011
8/6/2019 Solutions Linux 2011 Merged
37/68
Solution 3: reuse
Linked Open Data: (usually large) data
repositories available on the web (for freeor not), expressed using the RDF model
Interoperability between these repositories
(their ontologies) must be defined
Wednesday, May 11, 2011
8/6/2019 Solutions Linux 2011 Merged
38/68
Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/
Linked Open Data in 2007
Wednesday, May 11, 2011
http://lod-cloud.net/http://lod-cloud.net/8/6/2019 Solutions Linux 2011 Merged
39/68
Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/
2008
Wednesday, May 11, 2011
2009
http://lod-cloud.net/http://lod-cloud.net/8/6/2019 Solutions Linux 2011 Merged
40/68
Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/
2009
Wednesday, May 11, 2011
http://lod-cloud.net/http://lod-cloud.net/http://lod-cloud.net/8/6/2019 Solutions Linux 2011 Merged
41/68
Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/
2010
Wednesday, May 11, 2011
http://lod-cloud.net/http://lod-cloud.net/8/6/2019 Solutions Linux 2011 Merged
42/68
Good for Enterprise apps too!
Diagram source: http://www.w3.org/2007/Talks/0130-sb-W3CTechSemWeb/
Wednesday, May 11, 2011
8/6/2019 Solutions Linux 2011 Merged
43/68
Why now?
Wednesday, May 11, 2011
8/6/2019 Solutions Linux 2011 Merged
44/68
Key Enablers
Open Data and Linked Online Data
Advances in automatic content analysis
(linguistics, image processing) and machinelearning
Classical logic and classical AI
Computing power (Moores law +
MapReduce)
Wednesday, May 11, 2011
8/6/2019 Solutions Linux 2011 Merged
45/68
Lets put them to use!
The technologies and data
are available,
Wednesday, May 11, 2011
8/6/2019 Solutions Linux 2011 Merged
46/68
2. Nuxeo &Semantic ECM
Wednesday, May 11, 2011
8/6/2019 Solutions Linux 2011 Merged
47/68
Nuxeo: an open source
ECM vendorOur Focus is Enterprise Content Management
ECM as a Platform for Content Applications
Open Source as Efficient Development Model
Modern architecture for 21st Centurybusiness
Lean, mobile, social, interoperable
ASocial Marketplace in action
Innovation driven by community of customers, partners,
and our core developers
Wednesday, May 11, 2011
8/6/2019 Solutions Linux 2011 Merged
48/68
45
Wednesday, May 11, 2011
8/6/2019 Solutions Linux 2011 Merged
49/68
Ma or Customers
Wednesday, May 11, 2011
8/6/2019 Solutions Linux 2011 Merged
50/68
47
Goals for Semantic ECM
Repurpose existing content better
Improve search and collaboration
Make information more contextual
Extract and use information from content
Leverage Open and Linked Data, contribute
Make ECM users content smarter!
> Gain efficiency, effectiveness and strategic
positioning on the ECM market
Wednesday, May 11, 2011
8/6/2019 Solutions Linux 2011 Merged
51/68
48
Demo
Wednesday, May 11, 2011
8/6/2019 Solutions Linux 2011 Merged
52/68
49
IKS project
European project under theFP7, with 13 partners (6 SMEs) and a 8.5 MEURbudget
Goal: create a semantic software stack that will be
used by CMS vendors to add semantic features totheir products
Started in Jan. 2009, will last until Dec. 2012
First tangible result: Apache Stanbol, alreadyintegrated in a Nuxeo plugin
Wednesday, May 11, 2011
8/6/2019 Solutions Linux 2011 Merged
53/68
50
The Semantic Engine
From unstructured content to Knowledge
Language guessing
Topic classification (Business, Sports, Media, ...)
Named Entities extraction and linking
Relationships and properties extraction
Wednesday, May 11, 2011
8/6/2019 Solutions Linux 2011 Merged
54/68
51
Wednesday, May 11, 2011
8/6/2019 Solutions Linux 2011 Merged
55/68
52
Wednesday, May 11, 2011
8/6/2019 Solutions Linux 2011 Merged
56/68
53
Wednesday, May 11, 2011
8/6/2019 Solutions Linux 2011 Merged
57/68
54
Wednesday, May 11, 2011
8/6/2019 Solutions Linux 2011 Merged
58/68
55
=
Semantic Engines(Apache OpenNLP)
+Fast Linked Data local index
(Apache Solr)+
Semantic Rule Engine
(Apache Jena)Wednesday, May 11, 2011
8/6/2019 Solutions Linux 2011 Merged
59/68
56
12
3
DBpedia
Freebase
Geonames
LDAP
Wednesday, May 11, 2011
8/6/2019 Solutions Linux 2011 Merged
60/68
3. Apache tools for
processingBig and/or Linked Data
Wednesday, May 11, 2011
Training statistical models for NER with
8/6/2019 Solutions Linux 2011 Merged
61/68
58
Training statistical models for NER with
Wikipedia and DBpedia
Extract sentences with link positions in Wikipedia articles
DBPedia to the find type of the target entity(Person,
Location, Organization)
Apache Pig scripts to compute thejoin + format the result as
training files for OpenNLP
Apache OpenNLP to build and evaluate the models
Apache Hadoop for distributed processing
Apache Whirr for deployment and management on Amazon
EC2 cluster
Wednesday, May 11, 2011
8/6/2019 Solutions Linux 2011 Merged
62/68
59
Wednesday, May 11, 2011
8/6/2019 Solutions Linux 2011 Merged
63/68
60
Wednesday, May 11, 2011
8/6/2019 Solutions Linux 2011 Merged
64/68
61
Wednesday, May 11, 2011
8/6/2019 Solutions Linux 2011 Merged
65/68
62
Wednesday, May 11, 2011
Training statistical models for topic
8/6/2019 Solutions Linux 2011 Merged
66/68
63
Training statistical models for topic
classification from Wikipedia and DBpedia
Filter category tree from DBpedia SKOS entries (~500k)
Pig scripts to compute thejoins with articles abstracts for all
the articles categorized in Wikipedia
Export as 2.8GB TSV file to be indexed inApache Solr
Use Solr MoreLikeThisHandler to find the top 5 most related
Wikipedia category for any kind of text
Apache Whirr & Hadoop for deployment and management on
Amazon EC2 cluster
Wednesday, May 11, 2011
Wh t t?
8/6/2019 Solutions Linux 2011 Merged
67/68
64
Whats next?
Integrate the R&D results into Stanbol / Nuxeo
Work on user interface / high level javascript toolkits for Linked
Data editing
http://github.com/bergie/VIE based on backbone.js
Experiment / Integrate / Refine
Wednesday, May 11, 2011
R
http://github.com/bergie/VIEhttp://github.com/bergie/VIEhttp://incubator.apache.org/projects/whirr.html8/6/2019 Solutions Linux 2011 Merged
68/68
Resources
http://iks-project.eu
http://stanbol.demo.nuxeo.com
http://incubator.apache.org/stanbol
http://blogs.nuxeo.com/dev
http://hadoop.apache.org/
http://incubator.apache.org/opennlp/
http://github.com/ogrisel/pignlproc
http://incubator.apache.org/projects/whirr.htmlhttp://incubator.apache.org/projects/whirr.htmlhttp://incubator.apache.org/opennlp/http://hadoop.apache.org/http://blogs.nuxeo.com/devhttp://incubator.apache.org/stanbolhttp://fise.demo.nuxeo.com/http://incubator.apache.org/projects/whirr.htmlhttp://incubator.apache.org/projects/whirr.htmlhttp://incubator.apache.org/opennlp/http://incubator.apache.org/opennlp/http://hadoop.apache.org/http://hadoop.apache.org/http://blogs.nuxeo.com/devhttp://blogs.nuxeo.com/devhttp://incubator.apache.org/stanbolhttp://incubator.apache.org/stanbolhttp://fise.demo.nuxeo.com/http://fise.demo.nuxeo.com/http://iks-project.eu/http://iks-project.eu/