31
A framework for knowledge extraction, linked data and semantic search. What do we want computers to do for us?

What do we want computers to do for us?

Embed Size (px)

Citation preview

A framework for knowledge extraction, linked data and semantic search.

What do we want computers to do for us?

We have data.

• From 2005 to 2020, the digital universe will grow in size by a factor of 300, from 30 exabytes to 40 trillion gigabyte (40 ZB).

• From now until 2020, the digital universe will about double every two years.

• Volumes of data are projected to reach 5.247 GB per person with emerging economies playing an increasingly important role (producing two thirds of the world data by the end of this decade).

• Only 0.5% of this data is used today for analysis.

• The amount of information individuals create themselves - writing documents, taking pictures, recording audio - is far less than the information being created about them in the digital universe.

[IDC I V I E W, 2012]

What do we want computers to do for us?

Text

Images/Video

Audio

"language": "de"

Categorisation, Summarisation, Search, Question/Answer, …

"label": "outdoor"

Suggest tags, Image search, …

Automatic Speech Recognition, Speaker identification, Music classification, …

[Andrew NG, 2011]We want computers to process data.

Natural Language Processing

We use it everyday.

[J U RAFSKY & MARTIN, 2008]

a theoretically motivated range of computational techniques for analysing naturally occurring text/speech for the purpose of achieving human-like language processing.

Features extraction in text/speech.

Levels of knowledge encoding in language data.

INPUT

Morphologic

Syntactic

Semantic

FEATURES

NLP{ Parser

Lexical DB

Stemming

AnaphoraPos Tagging

NER

TEXT

NLP FEATURES

WISDOM

What do we want computers to do with a text?

STRUCTURED DATA

CONTEXT

We want computers to make sense of unstructured data.

KNOWLEDGE{Semantic Lifting

TEXT WISDOM

A practical example.

CONTEXT

Combining Semantic Web technologies with NLP technologies.

KNOWLEDGE

Lucoli"label": "Lucoli"

"values":

["13.338889"], "predicate": "http://www.w3.org/2003/01/geo/wgs84_pos#long"

"values":

["42.29194444444445"], "predicate": "http://www.w3.org/2003/01/geo/wgs84_pos#lat""values": [ !!!!! ], "predicate": "http://xmlns.com/foaf/0.1/depiction"

About 20 minutes car drive from L’Aquila.

How we started.

Building an open platform for knowledge extraction, linked data and semantic search.!Delivering the world’s most advanced open source content analysis and making linked data publishing and information discovery accessible to anyone.

• Incorporating requirements from industry partners:

• CMS companies

• System integrators

• Tool providers

• Inheriting 6 years of IP with R&D on:

• Semantic Information Management and Publishing (RDF and Semantic Web Technology)

• Semantic Processing

• Conceptual Search

• Semantic enhancement process chaining

• Multiple NLP features extraction facilities

• Multiple language support

• Content classification and sentiment analysis

• Graduated as Top Level Project of the Apache Foundation in September 2012

STANBOL.APACHE.ORG

A Toolbox for Semantic Processing.

SOLR.APACHE.ORG

The Highly Scalable Search Server.

• Based on Apache Lucene

• Various language specific processing procedures

• Highly scalable (Solr cloud) and highly configurable

• Ultra fast indexing/searching, indexes can be merged/optimised

• Semantic Search available with an easy-to-install Redlink Plugin

DEV.REDLINK.IO/PLUGINS/SOLR

Adding Semantic Search to Apache Solr.

• Boost your existing Apache Solr installation with semantic enhancements via Redlink Content Analysis

• Watch the screencast • Learn more• Customising the semantic enhancements

with user-created vocabularies and Redlink NLP extraction facilities

Managing vocabularies.

VocabulariesDEV.REDLINK.IO/API/1.0-BETA.html#linked-data

• Build your first app • Learn more

• Redlink allows users to create their own Linked Data server for managing vocabularies or publishing datasets for Linked (Open) Data projects

• Datasets managed with Redlink can be made available for content analysis and linking

• Datasets can be either private (Linked Enterprise Data) or public (Linked Open Data)!

• Public Datasets such as DBpedia, Freebase and GeoNames are available for de-referencing and interlinking

• Read-Write Linked Data

• Triple store with transactions, versioning and rule-based reasoning

• SPARQL and LDPath query languages

• Transparent Linked Data Caching

• Graduated as Top Level Project of the Apache Foundation in November 2013

MARMOTTA.APACHE.ORG

The Open Platform for Linked Data.

An Open Linked Data Project for Tourism in Salzburg

• Cross platform publishing as more travellers massively begin using mobile devices

• Multiple Web CMSs (both proprietary and open source) to be managed simultaneously

• Costly manual curation and interlinking

• Increasing demand for content syndication (from big players like foursquare as well as from local application developers)

• Need for better SEO especially for events and sites (too regional to be understood by commercial search engines)

Remixing existing content and creating new value.

A magazine running on WordPress

An online booking system

freshly updated content on locations and events

a database containing: events, facilities, accommodations, …

Everything we know already from Wikipedia

the World’s largest encyclopedia

Using Linked Data to make sense of the information

Linked Data Publishing

• Data from the online booking system (Feratel) is enriched and transformed in triples using identified vocabularies and ontologies

• Triples are stored in the Redlink triple store in a dedicated context

• RDF data and SPARQL end-points are published to the data website (data.salzburgerland.com) running CKAN as Linked Open Data

• CKAN makes the data accessibile to third parties in various formats by querying Redlink

Florianifeier

with RDF different data sources are integrated to provide robot-friendly information that describe real world things

<subject><predicate><object>

Semantic Lifting and Linked Data Principles

• A “word” or “phrase” becomes an identifier used to denote “things” (named entities) existing in the real world

1.Real-world thing are unambiguously represented with web addresses (URI)

2.By accessing these web addresses (HTTP-URI) usable data is sent in return using standard formats (RDF, SPARQL)

3.This data includes links to other data so that people can discover more things

"label":"May",

"reference":

“http://dbpedia.org/

resource/May”

!Type: Thing

"values"["13.7446"],"predicate": "http://www.w3.org/2003/01/geo/wgs84_pos#long" values"["47.10222"],"predicate": “http://www.w3.org/2003/01/geo/wgs84_pos#lat” "reference": “http://dbpedia.org/page/Unternberg” !Type: Place

“label":"Florianifeier", "reference": “http://rdf.salzburgerland.com/events/event/dea7fde1-5583-4002-97eb-007

4a182fa9c.html” !Type: Event

Tim Berners-Lee.

LANGUAGE EVENT THING LOCATION

ENGLISH FLORIANIFEIER MAY UNTERNBERG

[Très Riches Heures du duc de Berry, Raymond Cazelles et Johannes Rathofe]

“This May don't miss the Florianifeier, we'll have fun as usual in Unternberg”

Dynamic Semantic Publishing with ordLiftW

• Data from the Redlink triple store is made available for content enrichment and can be edited using WordLift, a semantic plugin for WordPress.

Data Curation

• Using Linked Data the Web becomes my new CMS

• information is automatically imported in WordPress

• posts are connected with entities

• properties for each entity can be edited using WordPress

• any change is automatically reflected in the triple-store and re-published as Open Data

Using Linked Data and WordLift the Web becomes your new CMS.

editing a blog post

editing an entity

Web Search 19.900 results

no answer

Touristic applications attempting to discover events in Salzburgerland.

“Which events occur in May in Lungau?”

Linked Open Data Query 5 result

5 answer

Unternberg is a village in the area of Lungauon google.at!!

Better SEO using Semantic Markup

Florianifeier

Unternberg

• Using schema.org the data from the triple-store is added to the pages as semantic markup

• Search engines can finally “recognise” entities that were previously unknown (i.e. Florianifeier)

ordLiftW

•Media in cross-media context, allowing to analyse media resources as well as connected content, including video, images, audio, text, link structure and metadata;

• Investigate cross-media analysis along the complete, distributed analysis chain, namely extraction, metadata publishing, querying and recommendations;

•Contribute its main software development results as Open Source components to two established Apache projects, Apache Marmotta and Apache Stanbol, simplifying the use of the technology in industrial products.

What do we want computers to do with Media?

MICO-PROJECT.EU

“Show me the tempo-regional fragments where Lewis Jones is right beside Connor Macfarlane?”

MICO-PROJECT.EU

PREFIX foaf: <http://xmlns.com/foaf/0.1/> PREFIX mm: <http://linkedmultimedia.org/sparql-mm/functions#> PREFIX ma: <http://www.w3.org/ns/ma-ont#> PREFIX dct: <http://purl.org/dc/terms/> !SELECT (mm:boundingBox(?l1,?l2) AS ?left_right) WHERE { ?f1 ma:locator ?l1; dct:subject ?p1. ?p1 foaf:name "Lewis Jones". ?f2 ma:locator ?l2; dct:subject ?p2. ?p2 foaf:name "Connor Macfarlane". ! FILTER mm:rightBeside(?l1,?l2) FILTER mm:temporalOverlaps(?l1,?l2) }

We want computers to process media.

GRAZIE!

foaf:name “Andrea Volpini"

Hopefully soon in the

Linked Data

Cloud!

CREDITS

ANDREW NG, 2011

J U RAFSKY & MARTIN, 2008

Webscale IA using Linked Open Data on slideshare by reduxd

LODE linking open descriptions of events aswc 2009 on slideshare by Raphael Troncy

Semantic SEO in the post-Hummingbird era on slideshare by Kim Renberg and Andrea Volpini

Querying of metadata, media content and context in MICO a demo by Thomas Kurz

this presentation is the result of many inspiring ideas and amazing work from other people and here is the list:

any idea, graphics or meme belonging to us is available for sharing, copying and re-mixing under

creative commons license 3.0