Entity-Centric Data Management - unifr.chEntity-Centric Data Management Philippe Cudré-Mauroux...

Entity-Centric Data Management

Philippe Cudré-Mauroux

eXascale Infolab, University of Fribourg Switzerland

MIT CSAIL – DB GROUP March 29, 2012

eXascale Infolab

•  New lab @ U. of Fribourg, Switzerland •  Financed largely by the Swiss Federal State •  Extremely large-scale, non-relational data

management (… mostly)

XI 10.11.2010

ZenCrowd

XI Projects

Entities

Information Management

•  The story so far: – Strict separation between unstructured and structured

data management infrastructures

Inverted Index

Keywords

Information Integration

•  Information integration is still the one of the biggest CS problem out there (according to many e.g., Gartner)

•  Information integration typically requires some sort of mediation 1.  Unstructured Data: keywords, synsets 2.  Structured Data: global schema, transitive closure of

schemas (mostly syntactic)

⇒ nightmarish if 1, and 2. taken separately, horror marathon if considered together

Entities as Mediation

•  Rising paradigm – Store information at the entity granularity –  Integrate information by inter-linking entities

•  Advantages? – Coarser granularity compared to keywords

•  More natural, e.g., brain functions similarly (or is it the other way around?)

– Denormalized information compared to RDBMSs •  Pre-computed joins •  “Semantic” linking

•  Drawbacks?

Example (1): Yahoo! Portal

Image © R. Rabbat

Example (2): Web Data Integration

http://data.semanticweb.org/person/philippe-cudre-mauroux

Focus on 4 Core Problems

1. Extracting entities from (HTML) text (ZenCrowd)

2.  Searching for entities

3.  Accessing entities (DNS^3)

4. Storing entities (Diplodocus[RDF])

Extracting Entities

•  Extracting entities from text is an old problem… – … and is extremely hard, esp. for machines

•  Dozens of approaches have been suggested

•  What if – We want to combine approaches / frameworks? – We want to leverage both human computations &

algorithms?

ZenCrowd

•  Extracts entities from text using state-of-the-art techniques

•  Uses sets of algorithmic matchers to match entities to online concepts

•  Uses dynamic templating to create micro-matching-tasks and publish them on MTurk

•  Combines both algorithmic and human matchers using probabilistic networks

ZenCrowd Architecture

Micro Matching

HTMLPages

HTML+ RDFaPages

LOD Open Data Cloud

CrowdsourcingPlatform

ZenCrowdEntity

Extractors

LOD Index Get Entity

Input Output

Probabilistic Network

Decision Engine

Workers Decisions

AlgorithmicMatchers

“Black-Magic” Component

•  Probabilistic network to integrate a priori & a posteriori information – Agreement of good turkers & algorithms

•  Learning process – Constraints

•  Unicity •  Equality (SameAs)

– Giant probabilistic graph •  Instantiated selectively

pw1( ) pw2( )

lf1( ) lf2( )

pl1( ) pl2( )

lf3( )

pl3( )

c11 c22c12c21 c13 c23

u2-3( )sa1-2( )

Does it Work?

•  Improves avg. prec. by 0.14 on average! – Minimal crowd involvement – Embarrassingly parallel problem

Top$US$Worker$

0$ 250$ 500$

Worker&P

recision

Number&of&Tasks&

US$Workers$

IN$Workers$

0.6$0.62$0.64$0.66$0.68$0.7$

0.72$0.74$0.76$0.78$0.8$

1$ 2$ 3$ 4$ 5$ 6$ 7$ 8$ 9$

Precision)

Top)K)workers)

Searching for Entities

•  How can end-users reach entities? ⇒  Keyword search

•  On their names or attributes

– Obviously not ideal •  BM25 on TREC 2011 AOR: MAP=0.15, P@10=0.20 •  Query extension, query completion or pseudo-relevance

feedback yield comparable (or worse) results

Hybrid Entity Search

The Descendants

TheDescendants

titleGeorgeClooney

George Clooney

nameMay 6, 1961

dateOfBirth

ShaileneW

Shailene Woodley

Nov. 15, 1991

dateOfBirth

playsIn

•  Main idea: combine unstructured and structured search –  Inverted index to locate first candidates – Graph queries to refine the results

•  Graph traversals (queries on object properties) •  Graph neighborhoods (queries on

data type properties)

Inverted Index

Keywords

SPARQL

Architecture

LOD Cloud

index()User

Query Annotation and Expansion

Inverted Index

RDF Store

Ranking FunctionsRanking

FunctionsRanking Functions

query()

Entity SearchKeyword Query

intermediatetop-k resultsGraph-Enriched

Results

Graph Traversals(queries on object

properties)

Neighborhoods(queries on datatype

properties)

Structured Inverted Index

WordNet

3rd partysearch engines

Final Ranking Function

Pseudo-Relevance Feedback

Query Refinement

•  Finding the right graph queries to refine results is a real challenge – Which object property to follow? Transitive closures?

– Which data type property to take into account? Scope?

Does it Work?

•  Up to 25% improvement over best IR (stat. sign.) •  Very modest impact on latency (17%)

0" 0.2" 0.4" 0.6" 0.8" 1"

Precision)

Recall)

BM25"baseline"

SAMEAS"

Whom to Ask for Entities?

•  Void + SPARQL end-point •  DOA •  DNS [DNS^3] •  Same_as service / Okkam IDs •  P2P Mesh of entities [idMesh] •  Downscaled registries?

How to Store Entities?

•  Fundamental impedance mismatch between graphs of entities and… – N-ary / decomposition storage model –  Inverted Indices – Key-value paradigms

Entity Storage

•  Materialize the joins! •  Dense-pack the values •  Provide new indices

•  Co-locate •  Co-locate •  Co-locate

DipLODocus: System Architecture

Main Idea - data structures

Template Matching

Hash Table

lexicographic tree to encode URIs

template based indexing

extremely compact lists of homologous nodes

Molecule Clusters •  extremely compact sub-graphs •  precomputed joins

List of Literals •  extremely compact list of sorted values

Basic operations - queries - triple patterns ?x type Student. ?x takesCourse Course0.

?x type Student. ?x takesCourse Course0. ?x takesCourse Course1.

=> intersection of sorted lists

Basic operations - queries aggregates and analytics

?x type Student. ?x age ?y filter (?y < 21)

Basic operations - queries - molecule queries

?a name 'Student1'. ?a ?b ?c. ?c ?d ?e.

Results - LUBM - 100 Universities

35 x faster for OLTP 300 x faster for OLAP

Current Directions

•  ZenCrowd: Web-scale Integration •  Entity Search: recommendation based on graph

mining •  Entity registry: DOA / Downscaling for ad-hoc •  Diplodocus

– Open-Source – Cloud Storage – Visualization (sensor data)

References

•  ZenCrowd [WWW 2012] •  idMesh [WWW 2009] •  Hybrid Entity Search [SIGIR 2012] •  DNS^3 [ISWC 2011 demo] •  Downscaling Entity Registries [Downscale 2012] •  Diplodocus[RDF] [ISWC 2011] •  Graph Data Management for New Application

Domains [VLDB 2011]

Upcoming Events

Entity-Centric Data Management - unifr.chEntity-Centric Data Management Philippe Cudré-Mauroux...

Documents

SEMPEDIA : Sémantisation à partir des documents semi … · 2020-01-22 · méthodes d’apprentissage supervisé et semi-supervisé alors que Smirnova and Cudré-Mauroux [2018],

EarlyBridge case from product centric to customer centric eb

Automatic Discovery of Data-Centric and Artifact-Centric Processes …dfahland/publications/NooijenDF_2012_dab_artifact... · Automatic Discovery of Data-Centric and Artifact-Centric

Customer centric

SciDB: an open-source, array-oriented database management ... · SciDB: an open-source, array-oriented database management system Philippe Cudré-Mauroux Massachusetts Institute of

MLOps: From Model-centric to Data-centric AI

Meter-Centric to Customer-Centric: Madrileña Red de Gas

People-Centric Collaboration · PC Centric Phone Centric Video Centric Next Generation Virtual Workspace ... Collaboration applications virtualized in the data center 1 ... building

Prevention of natural risks and public expenditures by Amélie Mauroux

A (BRIEF) HISTORY OF COMPUTING By Dane Paschal. BIASES Amero-Euro centric Computer science centric Google centric

Architecture Centric Virtual Integration Process (ACVIP · PDF fileArchitecture Centric Virtual Integration Process (ACVIP) ... Architecture Centric Virtual Integration Process

From page-centric to portlet-centric Web development: easing the

Moving from Device Centric to a User Centric Management

Karl Aberer and Philippe Cudr©-Mauroux

Network Centric Net-Centric Enterprise · PDF fileNetwork Centric Enterprise Net-Centric ... • Air Traffic Control Command Control ... Balanced Scorecard – April 2007 CMMI

©2005, Karl Aberer and Philippe Cudré-Mauroux Semantic Overlay Networks VLDB 2005 Karl Aberer and Philippe Cudré-Mauroux School of Computer and Communication

Network Centric Operations Conceptual Framework Version 1 · Network Centric Operations Conceptual Framework Version 1 ... and mature the concepts of Network Centric ... Network Centric

1 ISWC04 11.11.04 GridVine: Building Internet-Scale Semantic Overlay Networks Karl Aberer, Philippe Cudré-Mauroux, Manfred Hauswirth School of Computer

Graph Data Management Systems for New Application Domains: Social Networks & the Web of Data Tutorial at VLDB 2011 Philippe Cudré-MaurouxSameh Elnikety

Rearrangement planning using object-centric and robot-centric … · 2018-03-29 · Rearrangement planning using object-centric and robot-centric action spaces Jennifer E. King ⇤,