88
SESAM Lars Marius Garshol, Baksia, 2013-02-13 [email protected], http://twitter.com/larsga 1

Hafslund SESAM - Semantic integration in practice

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Hafslund SESAM - Semantic integration in practice

1

SESAMLars Marius Garshol, Baksia, [email protected], http://twitter.com/larsga

Page 2: Hafslund SESAM - Semantic integration in practice

2

Agenda

• Project background• Functional overview• Under the hood• Similar projects (if time)

Page 3: Hafslund SESAM - Semantic integration in practice

Lars Marius Garshol

• Consultant in Bouvet since 2007– focus on information architecture and semantics

• Worked with semantic technologies since 1999– mostly with Topic Maps– co-founder of Ontopia, later CTO– editor of several Topic Maps ISO standards 2001-– co-chair of TMRA conference 2006-2011– developed several key Topic Maps technologies– consultant in a number of Topic Maps projects

• Published a book on XML on Prentice-Hall• Implemented Unicode support in the

Opera web browser

Page 4: Hafslund SESAM - Semantic integration in practice

My role on the project

• The overall architecture is the brainchild of Axel Borge

• SDshare came from an idea by Graham Moore

• I only contributed parts of the design– and some parts of the implementation

• Don’t actually know the whole system

Page 5: Hafslund SESAM - Semantic integration in practice

Hafslund SESAM

Page 6: Hafslund SESAM - Semantic integration in practice

Hafslund ASA

• Norwegian energy company– founded 1898– 53% owned by the city of Oslo– responsible for energy grid around Oslo– 1.4 million customers

• A conglomerate of companies– Nett (electricity grid)– Fjernvarme (remote heating)– Produksjon (power generation)– Venture– ...

Page 7: Hafslund SESAM - Semantic integration in practice

What if...?

Customer

MeterCablesTransfor

mer

Workorder

ERPCRM

Bill

Page 8: Hafslund SESAM - Semantic integration in practice

Hafslund SESAM

• An archive system, really• Generally, archive systems are glorified

trash cans– putting it in the archive effectively means hiding

it• Because archives are not important, are

they?• Except, when you need that contract

from 1937 about the right to build a power line across...

Page 9: Hafslund SESAM - Semantic integration in practice
Page 10: Hafslund SESAM - Semantic integration in practice

Problems with archives

• Many documents aren’t there– even though they should be,– because entering metadata is too much hassle

• Poor metadata, because nobody bothers to enter it properly– yet, much of the metadata exists in the user

context• Not used by anybody

– strange, separate system with poor interface– (and the metadata is poor, too)

• Contains only documents– not connected to anything else

Page 11: Hafslund SESAM - Semantic integration in practice

Goals for the SESAM project

• Increase percentage of archived documents– to do this, archiving must be made easy

• Increase metadata quality– by automatically harvesting metadata

• Make it easy to find archived documents– by building a well-tuned search application

Page 12: Hafslund SESAM - Semantic integration in practice

What SESAM does

• Automatically enrich document metadata– to do that we have to collect background data

from the source systems• Connect document metadata with

structured business data– this is a side-effect of the enrichment

• Provide search across the whole information space– once we’ve collected the data this part is easy

Page 13: Hafslund SESAM - Semantic integration in practice

13

As seen by customer

Page 14: Hafslund SESAM - Semantic integration in practice

High-level architecture

Triple store

ERP CRMShare-point

ArchiveSearch engine

SDshare

SDshareSDshare

CMIS

Page 15: Hafslund SESAM - Semantic integration in practice

Main principle of data extraction• No canonical model!

– instead, data reflects model of source system• One ontology per source system

– subtyped from core ontology where possible• Vastly simplifies data extraction

– for search purposes it loses us nothing– and translation is easier once the data is in the

triple store

Page 16: Hafslund SESAM - Semantic integration in practice

Simplified core ontology

Page 17: Hafslund SESAM - Semantic integration in practice

Data structure in triple store

Sharepoint

CRM

Archive

ERP

sameAs

sameAs

Page 18: Hafslund SESAM - Semantic integration in practice

18

Connecting data

• Contacts in many different systems– ERP has one set of

contacts,– these are mirrored in

archive• Different IDs in these

two systems– using “sameAs” we know

which are the same– can do queries across the

systems– can also translate IDs from

one system to the other

=

sameAs

ERP

Archive

Page 19: Hafslund SESAM - Semantic integration in practice

Duplicate suppression

Customers

Companies

Customers

CRM

Customers

Billing

Data hubDuke

Field Record 1 Record 2 Probability

Name acme inc acme inc 0.9

Assoc no 177477707

0.5

Zip code 9161 9161 0.6

Country norway norway 0.51

Address 1

mb 113 mailbox 113

0.49

Address 2

0.5

http://code.google.com/p/duke/

owl:sameAs

SDshare

ERPSuppliers

Page 20: Hafslund SESAM - Semantic integration in practice

When archiving

• The user works on the document in some system– ERP, CRM, whatever

• This system knows the context– what user, project, equipment, etc is involved

• This information is passed to the CMIS server– it uses already gathered information from the

triple store to attach more metadata

Page 21: Hafslund SESAM - Semantic integration in practice

Auto-tagging

Work

order

ProjectSent to archive

Manager

Customer

Equipment

Equipment

Page 22: Hafslund SESAM - Semantic integration in practice

22

Archive integration

• Clients deliver documents via CMIS– an OASIS standard for CMS interfaces– lots of available implementations

• Metadata translation– not only auto-tagging– client vocabulary translated to archive

vocab– required static metadata added

automatically• Benefits

– clients can reuse existing software for connecting

– interface is independent of actual archive

– archive integration is reusable

Archive

CMIS

Page 23: Hafslund SESAM - Semantic integration in practice

23

Archive integration elsewhere

• Generally, archive integrations are hard– web site integration: 400 hours– #2 integration: performance problems

• They also reproduce the same functionality over and over again

• Hard bindings against– internal archive model– archive interface

• Switching archive systemwill be very hard...

– even upgrading is hardArchive

Regulationsyste

m

System #1

OutlookSystem

#2

Web site

Page 24: Hafslund SESAM - Semantic integration in practice

Showing context in the ERP system

Page 25: Hafslund SESAM - Semantic integration in practice

Access control

• Users only see objects they’re allowed to see

• Implemented by search engine– all objects have lists of users/groups allowed to

see them– on login a SPARQL query lists user’s access

control group memberships– search engine uses this to filter search results

• In some cases, complex access rules are run to resolve ACLs before loading into triple store– e.g: archive system

Page 26: Hafslund SESAM - Semantic integration in practice

Data volumesGraph Statements

IFS data 5,417,260

Public 360 data

3,725,963

GeoNIS data 44,242

Tieto CAB data

138,521,810

Hummingbird 1

32,619,140

Hummingbird 2

165,671,179

Hummingbird 3

192,930,188

Hummingbird 4

48,623,178

Address data 2,415,315

Siebel data 36,117,786

Duke links 4,858

Total 626,090,919

Page 27: Hafslund SESAM - Semantic integration in practice

7

Page 28: Hafslund SESAM - Semantic integration in practice
Page 29: Hafslund SESAM - Semantic integration in practice
Page 30: Hafslund SESAM - Semantic integration in practice
Page 31: Hafslund SESAM - Semantic integration in practice
Page 32: Hafslund SESAM - Semantic integration in practice
Page 33: Hafslund SESAM - Semantic integration in practice

33

Under the hood

Page 34: Hafslund SESAM - Semantic integration in practice

34

RDF

• The data hub is an RDF database– RDF = Resource Description Framework– a W3C standard for data– also interchange format and query language

(SPARQL)• Many implementations of the

standards– lots of alternatives to choose between

• Technology relatively well known– books, specifications, courses,

conferences, ...

Page 35: Hafslund SESAM - Semantic integration in practice

35

How RDF works

ID

NAME EMAIL

1 Stian Danenbarger

stian.danenbarger@

2 Lars Marius Garshol

[email protected]

3 Axel Borge axel.borge@bouvetSUBJECT PROPERTY OBJECT

http://example.com/person/1

rdf:type ex:Person

http://example.com/person/1

ex:name Stian Danenbarger

http://example.com/person/1

ex:email stian.danenbarger@

http://example.com/person/2

rdf:type Person

http://example.com/person/2

ex:name Lars Marius Garshol

... ... ...

‘PERSON’ table

RDF-ized data

Page 36: Hafslund SESAM - Semantic integration in practice

36

RDF/XML

• Standard XML format for RDF– not really very nice

• However, it’s a generic format– that is, it’s the same format regardless of

what data you’re transmitting• Therefore

– all data flows are in the same format– absolutely no syntax transformations

whatsoever

Page 37: Hafslund SESAM - Semantic integration in practice

37

Plug-and-play

• Data source integrations are pluggable– if you need another data

source, connect and pull in the data

• RDF database is schemaless– no need to create tables

and columns beforehand• No common data model

– do not need to transform data to store it

Data hub

System #1

System #2

Page 38: Hafslund SESAM - Semantic integration in practice

38

Product queue

• Since sources are pluggable we can guide the project with a product queue– consisting of new sources and data

elements• We analyse the items in the queue

– estimate complexity and amount of work– also to see how it fits into what’s already

there• Customer then decides what to do,

and in what order

Page 39: Hafslund SESAM - Semantic integration in practice

39

Don’t push!

• The general IT approach is to push data– source calls services provided by recipient

• Leads to high complexity– two-way dependency between systems– have to extend source system with extra logic– extra triggers/threads in source system– many moving parts

System #1

Page 40: Hafslund SESAM - Semantic integration in practice

40

Pull!

• We let the recipient pull– always using same protocol and same format

• Wrap the source– the wrapper must support 3 simple functions

• A different kind of solution– one-way dependency– often zero code– wrapper is thin and stateless– data moving done by reused code

System #1

Page 41: Hafslund SESAM - Semantic integration in practice

The data integration

• All data transport done by SDshare• A simple Atom-based specification for

synchronizing RDF data– http://www.sdshare.org

• Provides two main features– snapshot of the data– fragments for each updated resource

Page 42: Hafslund SESAM - Semantic integration in practice

Basics of SDshare

• Source offers– a dump of the entire data set– a list of fragments changed since time t– a dump of each fragment

• Completely generic solution– always the same protocol– always the same data format (RDF/XML)

• A generic SDshare client then transfers the data– to the receipient, whatever it is

Page 43: Hafslund SESAM - Semantic integration in practice

SDshare service structure

Page 44: Hafslund SESAM - Semantic integration in practice

Typical usage of SDshare

• Client downloads snapshot– client now has complete data set

• Client polls fragment feed– each time asking for new fragments since last

check– client keeps track of time of last check– fragments are applied to data, keeping them in

sync

Page 45: Hafslund SESAM - Semantic integration in practice

Implementing the fragment feedselect objid, objtype, change_time

from history_logwhere change_time > :since:order by change_time asc

<atom> <title>Fragments for ...</title> ...

<entry> <title>Change to 34121</title> <link rel=fragment href=“...”/> <sdshare:resource>http://...</sdshare:resource> <updated>2012-09-06T08:22:23</updated> </entry> <entry> <title>Change to 94857</title> <link rel=fragment href=“...”/> <sdshare:resource>http://...</sdshare:resource> <updated>2012-09-06T08:22:24</updated> </entry>...

Page 46: Hafslund SESAM - Semantic integration in practice

The SDshare client

Frontend

Core

SPARQL-

backendPOST-

backend

Triple store

WS

http://code.google.com/p/sdshare-client/

Page 47: Hafslund SESAM - Semantic integration in practice

Getting data out of the triple store

• Set up SPARQL queries to extract the data

• Server does the rest• Queries can be

configured to produce– any subset of data– data in any shape

RDF

SDshare server

SPARQL

Page 48: Hafslund SESAM - Semantic integration in practice

Contacts into the archive

• We want some resources in the triple store to be written into the archive as “contacts”– need to select which resources to include– must also transform from source data model

• How to achieve without hard-wiring anything?

Page 49: Hafslund SESAM - Semantic integration in practice

Contacts solution

• Create a generic archive object writer– type of RDF resource specifies type of object to

create– name of RDF property (within namespace)

specifies which property to set• Set up RDF mapping from source data

– type1 maps-to type2– prop1 maps-to prop2– only mapped types/properties included

• Use SPARQL to– create SDshare feed– do data translation with CONSTRUCT query

Page 50: Hafslund SESAM - Semantic integration in practice

Properties of the system

• Uniform integration approach– everything is done the same way

• Really simple integration– setting up a data source is generally very easy

• Loose bindings– components can easily be replaced

• Very little state– most components are stateless (or have little

state)• Idempotent

– applying a fragment 1 or many times: same result• Clear and reload

– can delete everything and reload at any time

Page 51: Hafslund SESAM - Semantic integration in practice

51

Conclusion SESAM

Page 52: Hafslund SESAM - Semantic integration in practice

52

Project outcome

• Big, innovative project– many developers over a long time period– innovated a number of techniques and

tools• Delivered on time, and working

– despite an evolving project context• Very satisfied customer

– they claim the project has already paid for itself in cost savings at the document center

Page 53: Hafslund SESAM - Semantic integration in practice

53

Archive of the year 2012”Virksomheten har vært innovative og strategiske i sin bruk av teknologi både når det gjelder de gamle papirarkivene og det digitale arkivet. De har et stort fokus på å øke datafangsten og forenkle bruken av metadata. De har hatt som mål at brukerne kun skal måtte påføre én metadata. De er nå nede i to – alt annet blir påført automatisk i løsningen – men målet er fortsatt å komme ned i kun én metadata.”

http://www.arkivrad.no/utdanning.html?newsid=10677

Page 54: Hafslund SESAM - Semantic integration in practice

54

Highly flexible solution

• ERP integration replaced twice– without affecting other components

• CRM system replaced while in production– all customers got new IDs– Sesam hid this from users, making

transition 100% transparent

Page 55: Hafslund SESAM - Semantic integration in practice

55

Status now

• In production since autumn 2011• Used by

– Hafslund Fjernvarme– Hafslund support centre

Page 56: Hafslund SESAM - Semantic integration in practice

56

Ohter implementations

• A very similar system has been implemented for Statkraft

• A system based on the same architecture has been implemented for DSS– basically an intranet system

Page 57: Hafslund SESAM - Semantic integration in practice

57

DSS Intranet

Page 58: Hafslund SESAM - Semantic integration in practice

58

DSS project background

• Delivers IT services and platform to ministries– archive, intranet, email, ...

• Intranet currently based on EPiServer– in addition, want to use Sharepoint– of course, many other related systems...

• How to integrate all this?– integrations must be loose, as systems in

13 ministries come and go

Page 59: Hafslund SESAM - Semantic integration in practice

59

Metadata structure

Person

Tema/ arb.område/Arkivnøkkel

Prosjekt/ sak

Person

Dokument/ Innhold

Org. enhet(seksjon, avd., dep.,

virksomhet)

Har ansvar for

Jobber med

Har rolle i forhold til

Har ansvar for/ skrevet

Tilhører

Handler om

Har ansvar for

Er ekspert på

Tilhører/ har rolle i

Handler om

Er involvert i

Fil-typeDokument-type

Saks-/ Prosess-type

Dato

Omtale/ kompetanse

Tittel Telefon adresse …

Saks-nr.

Er relevant for

Status

Page 60: Hafslund SESAM - Semantic integration in practice

60

Project scope

IDMEPiServer(intranet)

Sharepoint

User informationAccess groups

Page 61: Hafslund SESAM - Semantic integration in practice
Page 62: Hafslund SESAM - Semantic integration in practice
Page 63: Hafslund SESAM - Semantic integration in practice
Page 64: Hafslund SESAM - Semantic integration in practice
Page 65: Hafslund SESAM - Semantic integration in practice
Page 66: Hafslund SESAM - Semantic integration in practice
Page 67: Hafslund SESAM - Semantic integration in practice
Page 68: Hafslund SESAM - Semantic integration in practice
Page 69: Hafslund SESAM - Semantic integration in practice
Page 70: Hafslund SESAM - Semantic integration in practice
Page 71: Hafslund SESAM - Semantic integration in practice
Page 72: Hafslund SESAM - Semantic integration in practice
Page 73: Hafslund SESAM - Semantic integration in practice
Page 74: Hafslund SESAM - Semantic integration in practice
Page 75: Hafslund SESAM - Semantic integration in practice
Page 76: Hafslund SESAM - Semantic integration in practice
Page 77: Hafslund SESAM - Semantic integration in practice
Page 78: Hafslund SESAM - Semantic integration in practice
Page 79: Hafslund SESAM - Semantic integration in practice

79

How we built it

Page 80: Hafslund SESAM - Semantic integration in practice

80

The web part

• Viser data gitt– en spørring– et XSLT-stilsett

• Kan kjøres hvor som helst– standard protokoll

til databasen• Lett å inkludere i

andre systemer også

VirtuosoRDF DB

EPiServer Sharepoint

Novell IDM

Active Director

y

Regjeringen.no

Web

part

Web

part

SPARQL SDShareSDShare

SDShareSDShare

SDShare

Page 81: Hafslund SESAM - Semantic integration in practice

81

Data flow (simplified)

VirtuosoRDF DB

Novell IDM

EmployeesOrg. structure

SharepointOrg structure

EPiServer

CategoriesPage structure

Categories

WorkspacesDocumentsLists

Page 82: Hafslund SESAM - Semantic integration in practice

82

Independence of source

• Solutions need to be independent of the data source– when we extract data, or– when we run queries against the data hub

• Independence isolates clients from changes to the sources– allows us to replace sources completely, or– change the source models

• Of course, total independence is impossible– if structures are too different it won’t work– but surprisingly often they’re not

Page 83: Hafslund SESAM - Semantic integration in practice

83

Independence of source #1

Core model

core: Perso

n

core: Proje

ct

core:participant

idm: Perso

n

idm: Proje

ct

sp: Perso

n

sp: Proje

ct

sp:member-ofidm:has-member

IDM Sharepoint

Page 84: Hafslund SESAM - Semantic integration in practice

84

Independence of source #2

• “Facebook streams” require queries across sources– every source has its own

model• We can reduce the

differences with annotation– ie: we describe the data with

RDF• Allows data models to

– change, or– be plugged in

• with no changes to query

sp:doc-created a dss:StreamEvent; rdfs:label "opprettet"; dss:time-property sp:Created; dss:user-property sp:Author. sp:doc-updated a dss:StreamEvent; rdfs:label "endret"; dss:time-property sp:Modified; dss:user-property sp:Editor.

idm:ny-ansatt a dss:StreamEvent; rdfs:label "ble ansatt i"; dss:time-property idm:ftAnsattFra .

Page 85: Hafslund SESAM - Semantic integration in practice

85

Conclusion

Page 86: Hafslund SESAM - Semantic integration in practice

86

Execution

• Customer asked for start August 15, delivery January 1– sources: EPiServer, Sharepoint, IDM

• We offered fixed price, done November 1

• Delibered october 20– also integrated with ActiveDirectory– added Regjeringen.no for extra value

• Integration with ACOS Websak– has been analyzed and described– but not implemented

Page 87: Hafslund SESAM - Semantic integration in practice

87

The data sources

Source Connector

Active Directory LDAP

IDM LDAP

Intranet EPiServer

Regjeringen.no EPiServer

Sharepoint Sharepoint

Page 88: Hafslund SESAM - Semantic integration in practice

88

Conclusion

• We could do this so quickly because– the architecture is right– have components we can reuse– the architecture allows us to plug

together the components in different settings

• Once we have the data, the rest is usually simple– it’s getting the data that’s hard