Apache Stanbol and the Web of Data - ApacheCon 2011

Preview:

DESCRIPTION

Presentation on Apache Stanbol (incubating) and related projects given by Olivier Grisel durin ApacheCon 2011. More information: - http://incubator.apache.org/stanbol/ - http://www.iks-project.eu

Citation preview

11/7/11

Apache Stanbol (Incubating)and the Web of Data

Olivier Grisel, Nuxeoogrisel@apache.org, 2011-11-11

11/7/11

My Background

Olivier Grisel - R&D Engineer

nuxeoOpen Source ECM  

European project: IKS

Stuff I do:Machine Learning Natural Language Processing All things data

11/7/11

Agenda

The Web of Data: what, why, how?

CMS integration demo

Semantic Components in Stanbol

Building models for Stanbol

The Web of Data

What, Why, How?

11/7/11

“To a computer, then, the web is a flat, boring world devoid of meaning”

Tim Berners Lee, http://www.w3.org/Talks/WWW94Tim/

11/7/11

“This is a pity, as in fact documents on the web describe real objects and imaginary concepts, and give particular relationships between them”

Tim Berners Lee, http://www.w3.org/Talks/WWW94Tim/

11/7/11

“The Semantic Web is not a separate Web but an extension of the current one, in which information

is given well-defined meaning, better enabling computers and people to work in cooperation.”

Tim Berners Lee, http://www.w3.org/Talks/WWW94Tim/

11/7/11

“Adding semantics to the web involves two things: allowing documents which have information

in machine-readable forms, and allowing links to be created with relationship values.”

Tim Berners Lee, http://www.w3.org/Talks/WWW94Tim/

11/7/11

The Web of Data – What?

Shared description of the real world

oStructured with vocabularies

oDecentralized

oScoped by namespaces

oLinked

11/7/11

The Web of Data – Why?• Strings are ambiguous

o New York / The Big Apple / NYCo Washington (Person, State, City, Sports Team...)

• Structured context helps humans o Who is this guy?o Where is this city?

• Conceptual frame helps machineso Explicit user intent decodingo Smarter indexing / search?

11/7/11

Decoding User Intents

11/7/11

Decoding User Intents

Next Generation User InterfacesSiri - conversational interfaceIBM DeepQA: Watson for Heath Care

Tell Google about your stuffPublish structured prediction of your products"3 bedrooms flat near Montmartre"

Useful for non-public data as wellIntranet query: "ApacheCon slides"Intranet query: "Xerox invoices"Intranet query: "Xerox salesperson email"

11/7/11

The Web of Data - How?

RDF / TripeStores / SparqlGraph stores with dynamic schemasStrong interoperability

JSON-LDUpgrade your JSON with scoped vocabulariesWeb / Mobile / JS developer friendly

RDFa + schema.org & rNewsPublish annotation in structured markupVocabulary understood by Search Engines

11/7/11

HTML example

<p>

  My name is Manu Sporny and you can give me a ring via  1-800-555-0155.    <img src="http://manu.sporny.org/images/manu.png" />    I have a <a href="http://manu.sporny.org/">blog</a>.

</p>

11/7/11

RDFa example

<p vocab="http://schema.org/"   prefix="foaf: http://xmlns.com/foaf/0.1/"   about="#manu" typeof="Person">

  My name is <span property="name">Manu Sporny</span>  and you can give me a ring via  <span property="telephone">1-800-555-0155</span>.    <img rel="image"     src="http://manu.sporny.org/images/manu.png" />    I have a <a rel="foaf:weblog"    href="http://manu.sporny.org/">blog</a>.</p>

11/7/11

JSON-LD example

11/7/11

2007 2008

20092010

2011

Bridging the Web of Dataand my CMS

11/7/11

Apache Stanbol

EnhancerText analysis with Apache OpenNLP / Tika

EntityHub / ContentHubLinked Data Indexing with Apache SolrGraph Storage with Apache Clerezza / Jena

Reasoner / RulesInference with Apache Jena & OWLApi

Components / HTTP ServicesOSGi with Apache Felix / JAX-RS with Jersey

RESTfulis

Beautiful

11/7/11

Minimalist HTTP Client

curl -X POST -H "Accept: text/turtle" \ -H "Content-type: text/plain" \ --data "John Smith was born in London." \ http://stanbol.demo.nuxeo.com/engines

Local IT infrastructure (LAN)Local IT infrastructure (LAN)

Nuxeo DMNuxeo DMNuxeo DMNuxeo DM

addonaddon

11

1

Apache StanbolApache StanbolApache StanbolApache Stanbol

112

11

Engine 1Engine 1Engine 1Engine 1

Engine 2Engine 2Engine 2Engine 2

Engine 3Engine 3Engine 3Engine 3

3

DBpedia

Freebase

GeonamesLDAP

11/7/11

Stanbol Enhancer

Chain of Enhancement Engines

Language Detection (Tika)

Named Entity Detection (OpenNLP)

Linked Data dereferencing (Solr)

Refactoring / Translation (Jena)

11/7/11

Stanbol EntityHub

Referenced SitesDBpediaGeonames(NY Times, MusicBrainz, ProductDB, UnitProt...)

Fast local offline indices (Solr)Batch indexing utilities for RDF dumpsMultilingual fulltext search in labels & descriptions

Vocabulary mapping / merging

11/7/11

Stanbol Reasoner

RDFS / OWL-lite / OWL2

Consistency checksCardinality checks: each person has 1 birth date Range constraints: birth dates are valid dates

Materializing types / propertiesTypes from subclass: Musician > Artist > PersonSymmetric property: A worked with BTransitive property: A is a located in B

Query-time expansion / inference?

11/7/11

Stanbol Rules

Simple Prolog-like language uncleRule[ has(<http://example.org/family.owl#hasParent>, ?x, ?z) . has(<http://example.org/family.owl#hasSibling>, ?z, ?y) -> has(<http://example.org/family.owl#hasUncle>, ?x, ?y) ]

Sparql Construct or SWRL PREFIX family: <http://example.org/family.owl#> CONSTRUCT { ?x family:hasUncle} ?y } WHERE { ?x family:hasParent ?z . ?z family:hasSibling ?y}

11/7/11

Online Demos

Simple analyzer with small index https://stanbol.demo.nuxeo.com

All services deployed http://dev.iks-project.eu:8081

Building Stanbol Enhancer models from Wikipedia

with the Apache data tools

11/7/11

Universal Topic Classification

UseApache Lucene / Solr MoreLikeThis

to perform atruncated nearest neighbors query

in theTF-IDF vector space of Wikipedia

11/7/11

Universal Topic ClassificationIndex text of all articles grouped by topic

Solr MoreLikeThis query on new document

DBpedia dumps provide:Text summaries for each article

“subject” relationships between articles and topics

“broader” / “narrower” SKOS hieararchy between topics

11/7/11

About the Data500k purely technical categories

“People_with_missing_birth_place”, “Rivers_in_Romania”

70k “semantically grounded” categories

Paths to roots require both “technical” and “grounded” categories

Scale:1.2M topic / topic links30M topic / article links

11/7/11

Some results (Wikinews)US children who celebrate Independence Day more likely to become Republicans, says Harvard study

FireworksVoting theoryRepublican Party (United States)StatisticsElectoral systems

11/7/11

Some results (Wikinews)U.S. space agency NASA sues ex-astronaut

American astronautsAviation halls of fameEdwards Air Force BaseApollo programExploration of the Moon

11/7/11

Some results (Wikinews)Hundreds of thousands of British public sector workers strike over planned pension changes

Retirement in the United KingdomUnited Kingdom pensions and benefitsPensions in the United KingdomLabor disputes by countryLabor disputes

11/7/11

Some results (PLoS One)Metabolic Programming during Lactation Stimulates Renal Na+ Transport in the Adult Offspring Due to an Early Impact on Local Angiotensin II Pathways

Renal physiologyKidneyNephrologyHypertensionMembrane biology

11/7/11

Wrap Up

Web of Databrings Sructured Context Frameto decode User Intention

NLP + Entities & Topics indicesto automate Content Enrichmentto provide Disambiguationn

11/7/11

Resources

Documentation, svn, mailing list:  http://incubator.apache.org/stanbol

IKS project blog:  http://blog.iks-project.eu

Blog posts about Semantic ECM:  http://blogs.nuxeo.com/dev/semantic/

11/7/11

Thank you for your attention!

Olivier Grisel

ogrisel@apache.org

https://twitter.com/ogrisel

Training models for NER from Wikipedia

Extract sentences with link positions in Wikipedia articles

DBPedia to the find type of the target entity (Person, Location, Organization)

Apache Pig scripts to compute the join + format the result as training files for OpenNLP

Apache OpenNLP to build and evaluate the models

Apache Hadoop / Apache Whirr for distributed processing

52