30
1 The Dream of a Global Network of Knowledge Martin Doerr Center for Cultural Informatics Institute of Computer Science Foundation for Research and Technology - Hellas Amsterdam, Netherlands November 17, 2011

The Dream of a Global Network of Knowledge

Embed Size (px)

DESCRIPTION

The Dream of a Global Network of Knowledge. Martin Doerr. Center for Cultural Informatics Institute of Computer Science Foundation for Research and Technology - Hellas. Amsterdam, Netherlands November 17, 2011. Introduction. Digital Libraries take on different forms and roles. - PowerPoint PPT Presentation

Citation preview

Page 1: The Dream of a  Global Network of Knowledge

1

The Dream of a Global Network of Knowledge

Martin DoerrCenter for Cultural Informatics Institute of Computer Science

Foundation for Research and Technology - Hellas

Amsterdam, NetherlandsNovember 17, 2011

Page 2: The Dream of a  Global Network of Knowledge

A Global Network

ICS-FORTH November 17, 2011 2

Digital Libraries take on different forms and roles.

Initially collection management systems, literature collections, digitized resources resource libraries (Perseus etc), on-line corpora

In addition, data services scientific data collections research systems (e.g., GIS integrated data)

“Metadata” Aggregation Services: a new paradigm using semantic networks integrate diverse forms of information assets and pointers to them for the

support of research and interested public New grand challenges

Library access paradigm still dominates!

Introduction

Page 3: The Dream of a  Global Network of Knowledge

A Global Network

ICS-FORTH November 17, 2011 3

The typical library contents: “The whole stories”, access widely solved! Primary literature: Fiction. Categorical: theories and hypotheses Secondary literature (research results) Facts brought into causal context

The typical museum information: “Museum objects rarely talk” Factual documentation of

properties and context per object, references, classification Highly heterogeneous, About things taken out of original context, distributed over the world

Library, Archive, Museum Information

Page 4: The Dream of a  Global Network of Knowledge

A Global Network

ICS-FORTH November 17, 2011 4

The typical archive contents: “The needle in the haystack” Primary sources, “bits and pieces” (letters, legal documents, administration acts, images,

scientific records). factual, kept in the contextual sequence of creation, as by the creator or responsible. kept due to mandate related to functions.

Similarly, library content itself: “What is in the book?” parts of book content (citations!) as primary source of investigation access: not much more than keyword search, if a digital form exists…

Library, Archive, Museum Information

Page 5: The Dream of a  Global Network of Knowledge

A Global Network

ICS-FORTH November 17, 2011 5

Libraries Museums

Archives

illustrate,exemplify

are aboutBooks

Objects, Sites

primary Documents

provide finding aids

refer to

document features & context

providefinding aids

contain narrativesmade from

publish

using

SMRs

Epistemology of Integration

document manage

refer to

exhibit

pub lish

Page 6: The Dream of a  Global Network of Knowledge

A Global Network

ICS-FORTH November 17, 2011 6

The traditional library task: Collect and preserve documents and provide finding aids The job is solved, when the (one, best) document is handed out. “All you want is in this

document”.

The digital analogue: implementing “finding aids”: Assumption: User knows a topic, characterized by a noun, or knows associations of a

thing he knows it exists. Associations may be known properties, but not directly correlated to the problem to be solved (e.g. “organic farming” for “host-parasite studies”.)

Semantic interoperability is limited to the aggregation task: Metadata are mainly homogeneous (DC, VRA, etc.), the only challenge discussed is the matching of terminologies (KOS).

…still THE dominant global information integration paradigm

Traditional Information Access

Page 7: The Dream of a  Global Network of Knowledge

A Global Network

ICS-FORTH November 17, 2011 7

Problems

No support to learn from the aggregated sources, to retrieve by contexts, e.g., Who was the employer of Donald Johanson when he found Lucy? e.g., Which plant species are documented for the Black Sea coast for 6000 BC? (Critical

climate hypothesis connected to detecting the Black Sea flood in 5600 BC) e.g., Which resolution had Galileo’s telescope when he observed...

But understanding lives from relationships. Cultural information has complex relationships. Relationships may be categorical or factual: Categorical (e.g., “smoking causes cancer”). : Richly exploited by Semantic Web

technology. Use and integration limited to research results. Not useful for primary research itself.

Factual associations concatenate information assets to meaningful (“epistemic”) networks (“stories”): support context-based hypothesis building, cross-disciplinary search etc. (e.g. “John smoked with 20”, …30.. 40”. “John had lung cancer with 60”)

Knowledge of Factual associations is the “food” of scholarly research

Page 8: The Dream of a  Global Network of Knowledge

A Global Network

ICS-FORTH November 17, 2011 8

Access to categorical knowledge is well solved, if hypotheses have names: subject search, keyword search.

content management systems & search engines

Increasing account of structured categorical knowledge built in form of thesauri, ontologies (life sciences!) access by terms and browsing broader/narrower terms

access by categorical relationships more rarely touched

Access to facts is idiosyncratic to diverse systems and limited to: structured data services – no general access paradigm

KOS (authors lists, gazetteers)

“surfing and browsing” on the Internet or in Digital Libraries

What Can IT Do Now?

Page 9: The Dream of a  Global Network of Knowledge

A Global Network

ICS-FORTH November 17, 2011 9

New promises: Semantic Networks, Semantic Web RDF Triple Stores

Open World Systems: Billions of facts under any number of schemata in one database

Linked Open Data (LoD): Thousands of triple stores to be accessed

Shift to metadata rich of facts from Archives, Libraries, Museums, Digital Libraries

from research databases -> difference of data and metadata blurs

A global network of knowledge ?...or a perfect intellectual chaos…?

What Can IT Do Now?

Page 10: The Dream of a  Global Network of Knowledge

A Global Network

ICS-FORTH November 17, 2011 10

Semantic Networks

space

time “LAOKOON”(copy)

(in Vatican museum)

Winkelmann

“…noble simplicity,silent grandeur…”

(in a library)

Winkelmann’sbirth

Winkelmann’sdeath

Winkelmannsees “Laokoon”

Winkelmannwrites….

Winkelmann’smother

unknown Romancopies “Laokoon”

“LAOKOON”

unknown Roman

Greece Rome Germany

(archive information?)

(archive information?)

Published Inference

(in a library?)

1755

Page 11: The Dream of a  Global Network of Knowledge

A Global Network

ICS-FORTH November 17, 2011 11

We need a rich, integrating global schema– a core and extensions of any depth Con: impossible – everybody has his own conceptualization Pro: CIDOC-FRBR work empirically proves opposite

“Knitting” the network : without co-ref resolution facts/triples do not connect Con: impossible – automatic means limited, human labor not scalable Pro or Con?: LoD Pro: Human labor scales if massively organized

End-users need to query effectively large Triple Stores Con: impossible to write ad hoc rich SPARQL statements, impossible to

memorize hundreds of properties Pro: use another, simple global schema for querying

3 Grand Challenges

Page 12: The Dream of a  Global Network of Knowledge

A Global Network

ICS-FORTH November 17, 2011 12

A Global Schema: The CIDOC CRM

Developed by the CRM Special Interest Group of the International Committee for Documentation (CIDOC) of the International Council of Museums (ICOM)

Is an extensible core ontology of 86 classes and 137 properties describing the underlying semantics of over a hundred database schemata and structures from all museum disciplines, archives and libraries,

Extended by FRBROO, modeling IFLA’s FRBR, and soon FRSAD,FRAD, (RDFS integration with DC, Europeana EDM, ORE exists)

It is result of 15 years interdisciplinary work and agreement. In essence, it is a generic model of recording of “what has happened” in human

scale, i.e. a class of discourse. By it we can generate huge, meaningful networks of knowledge by a simple

abstraction: history as meetings of people, things and information. An interlingua to transform, transport and merge information from most data

structures with clear meaning.

Page 13: The Dream of a  Global Network of Knowledge

A Global Network

ICS-FORTH November 17, 2011 13

Explicit Events, Object Identity, Symmetry

P14 performed

P11 participated in

P94 has created

E31 Document“Yalta Agreement”

E7 Activity

“Crimea Conference”

E65 Creation Event

*

E38 Image

P86 falls within

P7 took place at

P67 is referred to by

E52 Time-SpanFebruary 1945

P81 ongoing throughout

P82 at some time within

E39 Actor

E39 Actor

E39 Actor

E53 Place7012124

E52 Time-Span

1945-02-11

Page 14: The Dream of a  Global Network of Knowledge

A Global Network

ICS-FORTH November 17, 2011 14

Data example (RDF-like form)Epitaphios GE34604 (entity E22 Man-Made Object)

P30 custody transferred through, P24 changed ownership throughTransfer of Epitaphios GE34604 (entity E10 Transfer of Custody, E8 Acquisition Event) P28 custody surrendered by

Metropolitan Church of the Greek Community of Ankara (entity E39 Actor) P23 transferred title from

Metropolitan Church of the Greek Community of Ankara (entity E39 Actor) P29 custody received by

Museum Benaki (entity E39 Actor) P22 transferred title to

Exchangeable Fund of Refugees (entity E40 Legal Body) P2 has type national foundation (entity E55 Type)

P14 carried out by Exchangeable Fund of Refugees (entity E39 Actor)

P4 has time-span GE34604_transfer_time (entity E52 Time-Span)

P82 at some time within 1923 – 1928 (entity E61 Time Primitive)

P7 took place at Greece (entity E53 Place)

P2 has type nation (entity E55 Type) republic (entity E55 Type)P89 falls within Europe (entity E53 Place) P2 has type

continent (entity E55 Type)

TGN data

Multiple Instantiation

Page 15: The Dream of a  Global Network of Knowledge

A Global Network

ICS-FORTH November 17, 2011 15

CRM Top-level classes useful for integration

participate in

E39 Actors

E55 Types

E28 Conceptual Objects

E18 Physical Thing

E2 Temporal Entities

E41

Ap

pel

lati

ons

affect or / refer to

refer to / refine

refe

r to

/ i d

ent i f

y

location

atwithinE53 Places

E52 Time-Spans

Page 16: The Dream of a  Global Network of Knowledge

A Global Network

ICS-FORTH November 17, 2011 16

The types of relationships

Identification of real world items by real world names

Observation and Classification of real world items

Part-decomposition and structural properties of Conceptual & Physical

Objects, Periods, Actors, Places and Times

Participation of persistent items in temporal entities

creates a notion of history: “world-lines” meeting in space-time

Location of periods in space-time and physical objects in space

Influence of objects on activities and products and vice-versa

Reference of information objects to any real-world item

The CIDOC CRM

Page 17: The Dream of a  Global Network of Knowledge

A Global Network

ICS-FORTH November 17, 2011 17

The Hierarchy of Participation Properties

P12 occurred in the presence of (was present at)

P16 used specific object (was used for)

P25 moved (moved by)

P33 used specific technique (was used by)

P142 used constituent (was used in)

P143 joined (was joined by)

P144 joined with (gained member by)

P145 separated (left by)

P124 transformed (was transformed by)

P110 augmented (was augmented by)

P112 diminished (was diminished by)

P95 has formed (was formed by)

P22 transferred title to (acquired title through)

P23 transferred title from (surrendered title through)

P135 created type (was created by)

Generalization

P31 has modified (was modified by)

P146 separated from (lost member by)

P28 custody surrendered by (surrendered custody through)

P11 had participant (participated in)

P93 took out of existence (was taken out of existence by)

P92 brought into existence (was brought into existence by)

P96 by mother (gave birth)

P14 carried out by (performed)

P99 dissolved (was dissolved by)

P13 destroyed (was destroyed by)

P100 was death of (died in)

P108 has produced (was produced by)

P123 resulted in (resulted from)

P98 brought into life (was born)

P94 has created (was created by)

P29 custody received by (received custody through)

Page 18: The Dream of a  Global Network of Knowledge

A Global Network

ICS-FORTH November 17, 2011 18

Schema Integration by Property Generalization

Access all data from any levelby CRM property generalization

Dublin Core

CDWA

MIDAS

Data

Few concepts,high recall

Special concepts,high precision

automatic data export

CIDOC Conceptual Reference Model (CRM)

ThingActor

Event

Acquisition

was present at

used object

happened at

Page 19: The Dream of a  Global Network of Knowledge

A Global Network

ICS-FORTH November 17, 2011 1919

Fact Integration

Ethiopia

Johanson's Expedition

CRM:global classification of relationships

Documents,Data,Metadata

Hadar

Discovery of Lucy AL 288-1

Lucy

Deductions

Linking documents via co-reference, nothyperlinks!

Primary linkextracted from one document

Cleveland Museum of Natural History

Knitting the Network: Extracted Relations & Co-reference

Actor Event Thing

Place

Time-Span

Donald Johanson

Fact Extraction

Page 20: The Dream of a  Global Network of Knowledge

A Global Network

ICS-FORTH November 17, 2011 20

Co-reference Knowledge and Reality

M.Smithborn 2-5-65 M.Smith

born 2-5-65

symbolic level (“vocabulary”)

interpretion(“speakers”)

real world(“objects”)

same as

same asnot same as

(data comparison)

(direct negotiation) (direct negotiation)

Page 21: The Dream of a  Global Network of Knowledge

A Global Network

ICS-FORTH November 17, 2011 21

A group of “speakers”(a database)” shares unique identifiers for a set of things. Another group “matches” their identifiers to mean the “same as”.

The transitive closure of “same as” – “not same as” exhibits “impossible worlds”, the only indication of false knowledge at the data level.

Ultimate knowledge is what the author meant by “her/him/it” – a part-of-speech, a database key, an occurrence of a name or URI.

Co-reference is primary knowledge, true research, not a “cleaning” issue.

Co-reference is more fundamental than schema integration: Supports integration without schema. Schema integration can be seen as co-reference problem.

Co-reference is more fundamental than Reference KOS: No description elements are needed. Reference KOS can help co-reference. Co-reference can be distributed!

Automatic “duplicate detection” is based on/ improved by co-reference,

“Negotiation with the speakers” is the ultimate confirmation = scholarly research.

Theory of Co-reference

Page 22: The Dream of a  Global Network of Knowledge

A Global Network

ICS-FORTH November 17, 2011 22

Content

Source 1

Query “Friends of a Friend”

1. query

Co-reference Problem

input: “Martin”

Read output:find “Kostas”,

guess“Κώστας”

Content

Source 2

2. query

input: “Κώστας”

output: “George”

“Κώστας”

“Kostas”

has friend

has friend

Page 23: The Dream of a  Global Network of Knowledge

A Global Network

ICS-FORTH November 17, 2011 23

.

.

.

.

second match

Authority service

Link

table

first match

local ids

idsresulting link

friend-of-a-

friend

Join across sources by transitivity of co-reference

Co-reference via Authority

.

.

.

.

Content

Source 1

query

input: “Martin”

.

.

.

.

local ids

Content

Source 2

output: “George”

“Κώστας” /“Kostas”

match

Page 24: The Dream of a  Global Network of Knowledge

A Global Network

ICS-FORTH November 17, 2011 24

.

.

.

.

local ids

Join across sources by transitivity of co-reference

.

.

.

.

Content

Source 1

query

input: “Martin”

.

.

.

.

local ids

Content

Source 2

output: “George”

“Κώστας” /“Kostas”

match

Curating Co-reference without Authority

make a co-reference

make a co-reference local ids

friend-of-a-

friend

Page 25: The Dream of a  Global Network of Knowledge

A Global Network

ICS-FORTH November 17, 2011 25

“M. Doerr” “M. Dörr”

explicit initial “same as” (n-1)

explicit redundant “same as”

implicit link ( n(n-1)/2 )New link

connecting clusters !

What happens ?

Managing Co-reference Clusters

Authority files are good “attractors” of co-reference links, but do not solve co-reference !

reference occurrence

Page 26: The Dream of a  Global Network of Knowledge

A Global Network

ICS-FORTH November 17, 2011 26

Co-reference links should be persistent and public. Primary Co-reference links should be curated and preserved in local databases: “co-reference indices”.

Use NER and duplicate-detection algorithms to prepopulate co-reference indices. Use appropriate belief values for generated data.

Automated, global, distributed consistency control services are feasible.

Co-reference indices are much larger than ontologies, but not larger than search engines.

Mobilize general users and domain experts to enhance and verify co-reference information by social tagging to scale-up human labor and precision.

Install global supervision by open consortia setting the rules and doing central services.

Then the network may converge to consistent global knowledge.

Linked Open Data has no co-reference concept so-far. It will lead to a proliferation of URIs.

A New Service: Global Co-reference Indices

Page 27: The Dream of a  Global Network of Knowledge

A Global Network

ICS-FORTH November 17, 2011 27

Humans think consciously in “compressed relations” (G.Fauconnier “The Way We Think”), in particular omitting events:

“What do we have from New Guinea?” There are a few “Fundamental Categories” that partition our concepts

(Ranganathan, “Who, When, Where, What..) and disambiguate most words e.g., a “”museum” is a “who”, a “where” or a “what”

If we implement a simple semantic network with few compressed relationships, we cannot integrate knowledge, because the intermediates are missing, and we cannot manage the immense number of redundant relations

If we implement a CIDOC CRM network, end-users cannot write queries

Solution: Define a new “datamodel” of “Fundamental Categories” and “Fundamental

Relationships” for querying only! implemented as automated deductions from a CRM-based network

Last Problem: How to query 250 properties?

Page 28: The Dream of a  Global Network of Knowledge

A Global Network

ICS-FORTH November 17, 2011 28

Fundamental Categories: Thing, Actor, Time, Place, Event (E2), Type

Fundamental Relationships: has type /is type of is similar to or same with is part of (is member of) / has part (has member) has met from (has founder or has parent) / is origin, founder, parent, provider or creator of had (=owns, keeps) / were owned/kept by at refers to or is about / is referred by/ is referred to at

Relationships change interpretation depending on category of domain and range.

How to query with 250 properties?

Page 29: The Dream of a  Global Network of Knowledge

A Global Network

ICS-FORTH November 17, 2011 29

Following this schema, we have implemented over a hundred deductions such as:

Thing -> P130F.shows_features_of (0,n) OR P130B.features_are_also_found_on (0,n) -> {E24.Physical_Man-Made_Thing -> P62F.depicts -> ThingORE24.Physical_Man-Made_Thing -> P128F.carries(0,n) -> E73.Information Object

-> P67F.refers_to-> Thing ORD1.Digital_Object -> {L11B.was_output_of -> D3.Formal_Derivation -> L10F.had_input -

> D1.Digital_Object ->}(0,n) L11B.was_output_of -> { D7.Digital_Machine_Event -> P9B.forms_part_of(0,n) ->}(0,1) D2.Digitization_Process -> L1F.digitized -> E18.Physical_Thing

}

It works!!!

Thing is about Thing Path Expression

Page 30: The Dream of a  Global Network of Knowledge

A Global Network

ICS-FORTH November 17, 2011 30

After 50 Years of “Artificial Intelligence” research and 15 years “Semantic Web”, the Global Network of Knowledge is still a dream.

Today, we have the chance to lay foundations for global knowledge network(s!) with a limited consistency, with a tendency to converge to something more consistent a limited common language, a limited way to globally explore deep relationships

For that, we have to Overcome intellectual barriers in conceptual modelling (“quick & dirty”, W3C “beliefs”,

ignoring empirical scientific methods, political thinking, domain blindness) Organize domain communities to curate collectively data and co-reference by new

awarding methods Invest in technology and methodology for a long data life-cycle by mapping, and

transforming data “for ever”, as we do since antiquity…

Conclusions