Transcript
Page 1: Apache Stanbol and the Web of Data - ApacheCon 2011

11/7/11

Apache Stanbol (Incubating)and the Web of Data

Olivier Grisel, [email protected], 2011-11-11

Page 2: Apache Stanbol and the Web of Data - ApacheCon 2011

11/7/11

My Background

Olivier Grisel - R&D Engineer

nuxeoOpen Source ECM  

European project: IKS

Stuff I do:Machine Learning Natural Language Processing All things data

Page 3: Apache Stanbol and the Web of Data - ApacheCon 2011

11/7/11

Agenda

The Web of Data: what, why, how?

CMS integration demo

Semantic Components in Stanbol

Building models for Stanbol

Page 4: Apache Stanbol and the Web of Data - ApacheCon 2011

The Web of Data

What, Why, How?

Page 5: Apache Stanbol and the Web of Data - ApacheCon 2011
Page 6: Apache Stanbol and the Web of Data - ApacheCon 2011

11/7/11

“To a computer, then, the web is a flat, boring world devoid of meaning”

Tim Berners Lee, http://www.w3.org/Talks/WWW94Tim/

Page 7: Apache Stanbol and the Web of Data - ApacheCon 2011

11/7/11

“This is a pity, as in fact documents on the web describe real objects and imaginary concepts, and give particular relationships between them”

Tim Berners Lee, http://www.w3.org/Talks/WWW94Tim/

Page 8: Apache Stanbol and the Web of Data - ApacheCon 2011

11/7/11

“The Semantic Web is not a separate Web but an extension of the current one, in which information

is given well-defined meaning, better enabling computers and people to work in cooperation.”

Tim Berners Lee, http://www.w3.org/Talks/WWW94Tim/

Page 9: Apache Stanbol and the Web of Data - ApacheCon 2011

11/7/11

“Adding semantics to the web involves two things: allowing documents which have information

in machine-readable forms, and allowing links to be created with relationship values.”

Tim Berners Lee, http://www.w3.org/Talks/WWW94Tim/

Page 10: Apache Stanbol and the Web of Data - ApacheCon 2011

11/7/11

The Web of Data – What?

Shared description of the real world

oStructured with vocabularies

oDecentralized

oScoped by namespaces

oLinked

Page 11: Apache Stanbol and the Web of Data - ApacheCon 2011

11/7/11

The Web of Data – Why?• Strings are ambiguous

o New York / The Big Apple / NYCo Washington (Person, State, City, Sports Team...)

• Structured context helps humans o Who is this guy?o Where is this city?

• Conceptual frame helps machineso Explicit user intent decodingo Smarter indexing / search?

Page 12: Apache Stanbol and the Web of Data - ApacheCon 2011

11/7/11

Decoding User Intents

Page 13: Apache Stanbol and the Web of Data - ApacheCon 2011

11/7/11

Decoding User Intents

Next Generation User InterfacesSiri - conversational interfaceIBM DeepQA: Watson for Heath Care

Tell Google about your stuffPublish structured prediction of your products"3 bedrooms flat near Montmartre"

Useful for non-public data as wellIntranet query: "ApacheCon slides"Intranet query: "Xerox invoices"Intranet query: "Xerox salesperson email"

Page 14: Apache Stanbol and the Web of Data - ApacheCon 2011

11/7/11

The Web of Data - How?

RDF / TripeStores / SparqlGraph stores with dynamic schemasStrong interoperability

JSON-LDUpgrade your JSON with scoped vocabulariesWeb / Mobile / JS developer friendly

RDFa + schema.org & rNewsPublish annotation in structured markupVocabulary understood by Search Engines

Page 15: Apache Stanbol and the Web of Data - ApacheCon 2011

11/7/11

HTML example

<p>

  My name is Manu Sporny and you can give me a ring via  1-800-555-0155.    <img src="http://manu.sporny.org/images/manu.png" />    I have a <a href="http://manu.sporny.org/">blog</a>.

</p>

Page 16: Apache Stanbol and the Web of Data - ApacheCon 2011

11/7/11

RDFa example

<p vocab="http://schema.org/"   prefix="foaf: http://xmlns.com/foaf/0.1/"   about="#manu" typeof="Person">

  My name is <span property="name">Manu Sporny</span>  and you can give me a ring via  <span property="telephone">1-800-555-0155</span>.    <img rel="image"     src="http://manu.sporny.org/images/manu.png" />    I have a <a rel="foaf:weblog"    href="http://manu.sporny.org/">blog</a>.</p>

Page 17: Apache Stanbol and the Web of Data - ApacheCon 2011

11/7/11

JSON-LD example

Page 18: Apache Stanbol and the Web of Data - ApacheCon 2011

11/7/11

2007 2008

20092010

Page 19: Apache Stanbol and the Web of Data - ApacheCon 2011

2011

Page 20: Apache Stanbol and the Web of Data - ApacheCon 2011

Bridging the Web of Dataand my CMS

Page 21: Apache Stanbol and the Web of Data - ApacheCon 2011
Page 22: Apache Stanbol and the Web of Data - ApacheCon 2011

11/7/11

Apache Stanbol

EnhancerText analysis with Apache OpenNLP / Tika

EntityHub / ContentHubLinked Data Indexing with Apache SolrGraph Storage with Apache Clerezza / Jena

Reasoner / RulesInference with Apache Jena & OWLApi

Components / HTTP ServicesOSGi with Apache Felix / JAX-RS with Jersey

Page 23: Apache Stanbol and the Web of Data - ApacheCon 2011
Page 24: Apache Stanbol and the Web of Data - ApacheCon 2011
Page 25: Apache Stanbol and the Web of Data - ApacheCon 2011
Page 26: Apache Stanbol and the Web of Data - ApacheCon 2011
Page 27: Apache Stanbol and the Web of Data - ApacheCon 2011

RESTfulis

Beautiful

Page 28: Apache Stanbol and the Web of Data - ApacheCon 2011

11/7/11

Minimalist HTTP Client

curl -X POST -H "Accept: text/turtle" \ -H "Content-type: text/plain" \ --data "John Smith was born in London." \ http://stanbol.demo.nuxeo.com/engines

Page 29: Apache Stanbol and the Web of Data - ApacheCon 2011
Page 30: Apache Stanbol and the Web of Data - ApacheCon 2011
Page 31: Apache Stanbol and the Web of Data - ApacheCon 2011

Local IT infrastructure (LAN)Local IT infrastructure (LAN)

Nuxeo DMNuxeo DMNuxeo DMNuxeo DM

addonaddon

11

1

Apache StanbolApache StanbolApache StanbolApache Stanbol

112

11

Engine 1Engine 1Engine 1Engine 1

Engine 2Engine 2Engine 2Engine 2

Engine 3Engine 3Engine 3Engine 3

3

DBpedia

Freebase

GeonamesLDAP

Page 32: Apache Stanbol and the Web of Data - ApacheCon 2011

11/7/11

Stanbol Enhancer

Chain of Enhancement Engines

Language Detection (Tika)

Named Entity Detection (OpenNLP)

Linked Data dereferencing (Solr)

Refactoring / Translation (Jena)

Page 33: Apache Stanbol and the Web of Data - ApacheCon 2011

11/7/11

Stanbol EntityHub

Referenced SitesDBpediaGeonames(NY Times, MusicBrainz, ProductDB, UnitProt...)

Fast local offline indices (Solr)Batch indexing utilities for RDF dumpsMultilingual fulltext search in labels & descriptions

Vocabulary mapping / merging

Page 34: Apache Stanbol and the Web of Data - ApacheCon 2011

11/7/11

Stanbol Reasoner

RDFS / OWL-lite / OWL2

Consistency checksCardinality checks: each person has 1 birth date Range constraints: birth dates are valid dates

Materializing types / propertiesTypes from subclass: Musician > Artist > PersonSymmetric property: A worked with BTransitive property: A is a located in B

Query-time expansion / inference?

Page 35: Apache Stanbol and the Web of Data - ApacheCon 2011

11/7/11

Stanbol Rules

Simple Prolog-like language uncleRule[ has(<http://example.org/family.owl#hasParent>, ?x, ?z) . has(<http://example.org/family.owl#hasSibling>, ?z, ?y) -> has(<http://example.org/family.owl#hasUncle>, ?x, ?y) ]

Sparql Construct or SWRL PREFIX family: <http://example.org/family.owl#> CONSTRUCT { ?x family:hasUncle} ?y } WHERE { ?x family:hasParent ?z . ?z family:hasSibling ?y}

Page 36: Apache Stanbol and the Web of Data - ApacheCon 2011

11/7/11

Online Demos

Simple analyzer with small index https://stanbol.demo.nuxeo.com

All services deployed http://dev.iks-project.eu:8081

Page 37: Apache Stanbol and the Web of Data - ApacheCon 2011

Building Stanbol Enhancer models from Wikipedia

with the Apache data tools

Page 38: Apache Stanbol and the Web of Data - ApacheCon 2011

11/7/11

Universal Topic Classification

UseApache Lucene / Solr MoreLikeThis

to perform atruncated nearest neighbors query

in theTF-IDF vector space of Wikipedia

Page 39: Apache Stanbol and the Web of Data - ApacheCon 2011

11/7/11

Universal Topic ClassificationIndex text of all articles grouped by topic

Solr MoreLikeThis query on new document

DBpedia dumps provide:Text summaries for each article

“subject” relationships between articles and topics

“broader” / “narrower” SKOS hieararchy between topics

Page 40: Apache Stanbol and the Web of Data - ApacheCon 2011

11/7/11

About the Data500k purely technical categories

“People_with_missing_birth_place”, “Rivers_in_Romania”

70k “semantically grounded” categories

Paths to roots require both “technical” and “grounded” categories

Scale:1.2M topic / topic links30M topic / article links

Page 41: Apache Stanbol and the Web of Data - ApacheCon 2011

11/7/11

Some results (Wikinews)US children who celebrate Independence Day more likely to become Republicans, says Harvard study

FireworksVoting theoryRepublican Party (United States)StatisticsElectoral systems

Page 42: Apache Stanbol and the Web of Data - ApacheCon 2011

11/7/11

Some results (Wikinews)U.S. space agency NASA sues ex-astronaut

American astronautsAviation halls of fameEdwards Air Force BaseApollo programExploration of the Moon

Page 43: Apache Stanbol and the Web of Data - ApacheCon 2011

11/7/11

Some results (Wikinews)Hundreds of thousands of British public sector workers strike over planned pension changes

Retirement in the United KingdomUnited Kingdom pensions and benefitsPensions in the United KingdomLabor disputes by countryLabor disputes

Page 44: Apache Stanbol and the Web of Data - ApacheCon 2011

11/7/11

Some results (PLoS One)Metabolic Programming during Lactation Stimulates Renal Na+ Transport in the Adult Offspring Due to an Early Impact on Local Angiotensin II Pathways

Renal physiologyKidneyNephrologyHypertensionMembrane biology

Page 45: Apache Stanbol and the Web of Data - ApacheCon 2011

11/7/11

Wrap Up

Web of Databrings Sructured Context Frameto decode User Intention

NLP + Entities & Topics indicesto automate Content Enrichmentto provide Disambiguationn

Page 46: Apache Stanbol and the Web of Data - ApacheCon 2011

11/7/11

Resources

Documentation, svn, mailing list:  http://incubator.apache.org/stanbol

IKS project blog:  http://blog.iks-project.eu

Blog posts about Semantic ECM:  http://blogs.nuxeo.com/dev/semantic/

Page 47: Apache Stanbol and the Web of Data - ApacheCon 2011

11/7/11

Thank you for your attention!

Olivier Grisel

[email protected]

https://twitter.com/ogrisel

Page 48: Apache Stanbol and the Web of Data - ApacheCon 2011

Training models for NER from Wikipedia

Extract sentences with link positions in Wikipedia articles

DBPedia to the find type of the target entity (Person, Location, Organization)

Apache Pig scripts to compute the join + format the result as training files for OpenNLP

Apache OpenNLP to build and evaluate the models

Apache Hadoop / Apache Whirr for distributed processing

Page 49: Apache Stanbol and the Web of Data - ApacheCon 2011
Page 50: Apache Stanbol and the Web of Data - ApacheCon 2011
Page 51: Apache Stanbol and the Web of Data - ApacheCon 2011
Page 52: Apache Stanbol and the Web of Data - ApacheCon 2011

52