Hafslund SESAM - Semantic integration in practice

1

SESAMLars Marius Garshol, Baksia, [email protected], http://twitter.com/larsga

2

Agenda

• Project background• Functional overview• Under the hood• Similar projects (if time)

Lars Marius Garshol

• Consultant in Bouvet since 2007– focus on information architecture and semantics

• Worked with semantic technologies since 1999– mostly with Topic Maps– co-founder of Ontopia, later CTO– editor of several Topic Maps ISO standards 2001-– co-chair of TMRA conference 2006-2011– developed several key Topic Maps technologies– consultant in a number of Topic Maps projects

• Published a book on XML on Prentice-Hall• Implemented Unicode support in the

Opera web browser

My role on the project

• The overall architecture is the brainchild of Axel Borge

• SDshare came from an idea by Graham Moore

• I only contributed parts of the design– and some parts of the implementation

• Don’t actually know the whole system

Hafslund SESAM

Hafslund ASA

• Norwegian energy company– founded 1898– 53% owned by the city of Oslo– responsible for energy grid around Oslo– 1.4 million customers

• A conglomerate of companies– Nett (electricity grid)– Fjernvarme (remote heating)– Produksjon (power generation)– Venture– ...

What if...?

Customer

MeterCablesTransfor

mer

Workorder

ERPCRM

Bill

Hafslund SESAM

• An archive system, really• Generally, archive systems are glorified

trash cans– putting it in the archive effectively means hiding

it• Because archives are not important, are

they?• Except, when you need that contract

from 1937 about the right to build a power line across...

Problems with archives

• Many documents aren’t there– even though they should be,– because entering metadata is too much hassle

• Poor metadata, because nobody bothers to enter it properly– yet, much of the metadata exists in the user

context• Not used by anybody

– strange, separate system with poor interface– (and the metadata is poor, too)

• Contains only documents– not connected to anything else

Goals for the SESAM project

• Increase percentage of archived documents– to do this, archiving must be made easy

• Increase metadata quality– by automatically harvesting metadata

• Make it easy to find archived documents– by building a well-tuned search application

What SESAM does

• Automatically enrich document metadata– to do that we have to collect background data

from the source systems• Connect document metadata with

structured business data– this is a side-effect of the enrichment

• Provide search across the whole information space– once we’ve collected the data this part is easy

13

As seen by customer

High-level architecture

Triple store

ERP CRMShare-point

ArchiveSearch engine

SDshare

SDshareSDshare

CMIS

Main principle of data extraction• No canonical model!

– instead, data reflects model of source system• One ontology per source system

– subtyped from core ontology where possible• Vastly simplifies data extraction

– for search purposes it loses us nothing– and translation is easier once the data is in the

triple store

Simplified core ontology

Data structure in triple store

Sharepoint

CRM

Archive

ERP

sameAs

sameAs

18

Connecting data

• Contacts in many different systems– ERP has one set of

contacts,– these are mirrored in

archive• Different IDs in these

two systems– using “sameAs” we know

which are the same– can do queries across the

systems– can also translate IDs from

one system to the other

=

sameAs

ERP

Archive

Duplicate suppression

Customers

Companies

Customers

CRM

Customers

Billing

Data hubDuke

Field Record 1 Record 2 Probability

Name acme inc acme inc 0.9

Assoc no 177477707

0.5

Zip code 9161 9161 0.6

Country norway norway 0.51

Address 1

mb 113 mailbox 113

0.49

Address 2

0.5

http://code.google.com/p/duke/

owl:sameAs

SDshare

ERPSuppliers

When archiving

• The user works on the document in some system– ERP, CRM, whatever

• This system knows the context– what user, project, equipment, etc is involved

• This information is passed to the CMIS server– it uses already gathered information from the

triple store to attach more metadata

Auto-tagging

Work

order

ProjectSent to archive

Manager

Customer

Equipment

Equipment

22

Archive integration

• Clients deliver documents via CMIS– an OASIS standard for CMS interfaces– lots of available implementations

• Metadata translation– not only auto-tagging– client vocabulary translated to archive

vocab– required static metadata added

automatically• Benefits

– clients can reuse existing software for connecting

– interface is independent of actual archive

– archive integration is reusable

Archive

CMIS

23

Archive integration elsewhere

• Generally, archive integrations are hard– web site integration: 400 hours– #2 integration: performance problems

• They also reproduce the same functionality over and over again

• Hard bindings against– internal archive model– archive interface

• Switching archive systemwill be very hard...

– even upgrading is hardArchive

Regulationsyste

m

System #1

OutlookSystem

#2

Web site

Showing context in the ERP system

Access control

• Users only see objects they’re allowed to see

• Implemented by search engine– all objects have lists of users/groups allowed to

see them– on login a SPARQL query lists user’s access

control group memberships– search engine uses this to filter search results

• In some cases, complex access rules are run to resolve ACLs before loading into triple store– e.g: archive system

Data volumesGraph Statements

IFS data 5,417,260

Public 360 data

3,725,963

GeoNIS data 44,242

Tieto CAB data

138,521,810

Hummingbird 1

32,619,140

Hummingbird 2

165,671,179

Hummingbird 3

192,930,188

Hummingbird 4

48,623,178

Address data 2,415,315

Siebel data 36,117,786

Duke links 4,858

Total 626,090,919

7

33

Under the hood

34

RDF

• The data hub is an RDF database– RDF = Resource Description Framework– a W3C standard for data– also interchange format and query language

(SPARQL)• Many implementations of the

standards– lots of alternatives to choose between

• Technology relatively well known– books, specifications, courses,

conferences, ...

35

How RDF works

ID

NAME EMAIL

1 Stian Danenbarger

stian.danenbarger@

2 Lars Marius Garshol

[email protected]

3 Axel Borge axel.borge@bouvetSUBJECT PROPERTY OBJECT

http://example.com/person/1

rdf:type ex:Person


ex:name Stian Danenbarger


ex:email stian.danenbarger@


rdf:type Person


ex:name Lars Marius Garshol

... ... ...

‘PERSON’ table

RDF-ized data

36

RDF/XML

• Standard XML format for RDF– not really very nice

• However, it’s a generic format– that is, it’s the same format regardless of

what data you’re transmitting• Therefore

– all data flows are in the same format– absolutely no syntax transformations

whatsoever

37

Plug-and-play

• Data source integrations are pluggable– if you need another data

source, connect and pull in the data

• RDF database is schemaless– no need to create tables

and columns beforehand• No common data model

– do not need to transform data to store it

Data hub

System #1

System #2

38

Product queue

• Since sources are pluggable we can guide the project with a product queue– consisting of new sources and data

elements• We analyse the items in the queue

– estimate complexity and amount of work– also to see how it fits into what’s already

there• Customer then decides what to do,

and in what order

39

Don’t push!

• The general IT approach is to push data– source calls services provided by recipient

• Leads to high complexity– two-way dependency between systems– have to extend source system with extra logic– extra triggers/threads in source system– many moving parts

System #1

40

Pull!

• We let the recipient pull– always using same protocol and same format

• Wrap the source– the wrapper must support 3 simple functions

• A different kind of solution– one-way dependency– often zero code– wrapper is thin and stateless– data moving done by reused code

System #1

The data integration

• All data transport done by SDshare• A simple Atom-based specification for

synchronizing RDF data– http://www.sdshare.org

• Provides two main features– snapshot of the data– fragments for each updated resource

Basics of SDshare

• Source offers– a dump of the entire data set– a list of fragments changed since time t– a dump of each fragment

• Completely generic solution– always the same protocol– always the same data format (RDF/XML)

• A generic SDshare client then transfers the data– to the receipient, whatever it is

SDshare service structure

Typical usage of SDshare

• Client downloads snapshot– client now has complete data set

• Client polls fragment feed– each time asking for new fragments since last

check– client keeps track of time of last check– fragments are applied to data, keeping them in

sync

Implementing the fragment feedselect objid, objtype, change_time

from history_logwhere change_time > :since:order by change_time asc

<atom> <title>Fragments for ...</title> ...

<entry> <title>Change to 34121</title> <link rel=fragment href=“...”/> <sdshare:resource>http://...</sdshare:resource> <updated>2012-09-06T08:22:23</updated> </entry> <entry> <title>Change to 94857</title> <link rel=fragment href=“...”/> <sdshare:resource>http://...</sdshare:resource> <updated>2012-09-06T08:22:24</updated> </entry>...

The SDshare client

Frontend

Core

SPARQL-

backendPOST-

backend

Triple store

WS

http://code.google.com/p/sdshare-client/

Getting data out of the triple store

• Set up SPARQL queries to extract the data

• Server does the rest• Queries can be

configured to produce– any subset of data– data in any shape

RDF

SDshare server

SPARQL

Contacts into the archive

• We want some resources in the triple store to be written into the archive as “contacts”– need to select which resources to include– must also transform from source data model

• How to achieve without hard-wiring anything?

Contacts solution

• Create a generic archive object writer– type of RDF resource specifies type of object to

create– name of RDF property (within namespace)

specifies which property to set• Set up RDF mapping from source data

– type1 maps-to type2– prop1 maps-to prop2– only mapped types/properties included

• Use SPARQL to– create SDshare feed– do data translation with CONSTRUCT query

Properties of the system

• Uniform integration approach– everything is done the same way

• Really simple integration– setting up a data source is generally very easy

• Loose bindings– components can easily be replaced

• Very little state– most components are stateless (or have little

state)• Idempotent

– applying a fragment 1 or many times: same result• Clear and reload

– can delete everything and reload at any time

51

Conclusion SESAM

52

Project outcome

• Big, innovative project– many developers over a long time period– innovated a number of techniques and

tools• Delivered on time, and working

– despite an evolving project context• Very satisfied customer

– they claim the project has already paid for itself in cost savings at the document center

53

Archive of the year 2012”Virksomheten har vært innovative og strategiske i sin bruk av teknologi både når det gjelder de gamle papirarkivene og det digitale arkivet. De har et stort fokus på å øke datafangsten og forenkle bruken av metadata. De har hatt som mål at brukerne kun skal måtte påføre én metadata. De er nå nede i to – alt annet blir påført automatisk i løsningen – men målet er fortsatt å komme ned i kun én metadata.”

http://www.arkivrad.no/utdanning.html?newsid=10677

54

Highly flexible solution

• ERP integration replaced twice– without affecting other components

• CRM system replaced while in production– all customers got new IDs– Sesam hid this from users, making

transition 100% transparent

55

Status now

• In production since autumn 2011• Used by

– Hafslund Fjernvarme– Hafslund support centre

56

Ohter implementations

• A very similar system has been implemented for Statkraft

• A system based on the same architecture has been implemented for DSS– basically an intranet system

57

DSS Intranet

58

DSS project background

• Delivers IT services and platform to ministries– archive, intranet, email, ...

• Intranet currently based on EPiServer– in addition, want to use Sharepoint– of course, many other related systems...

• How to integrate all this?– integrations must be loose, as systems in

13 ministries come and go

59

Metadata structure

Person

Tema/ arb.område/Arkivnøkkel

Prosjekt/ sak

Person

Dokument/ Innhold

Org. enhet(seksjon, avd., dep.,

virksomhet)

Har ansvar for

Jobber med

Har rolle i forhold til

Har ansvar for/ skrevet

Tilhører

Handler om

Har ansvar for

Er ekspert på

Tilhører/ har rolle i

Handler om

Er involvert i

Fil-typeDokument-type

Saks-/ Prosess-type

Dato

Omtale/ kompetanse

Tittel Telefon adresse …

Saks-nr.

Er relevant for

Status

60

Project scope

IDMEPiServer(intranet)

Sharepoint

User informationAccess groups

79

How we built it

80

The web part

• Viser data gitt– en spørring– et XSLT-stilsett

• Kan kjøres hvor som helst– standard protokoll

til databasen• Lett å inkludere i

andre systemer også

VirtuosoRDF DB

EPiServer Sharepoint

Novell IDM

Active Director

y

Regjeringen.no

Web

part

Web

part

SPARQL SDShareSDShare

SDShareSDShare

SDShare

81

Data flow (simplified)

VirtuosoRDF DB

Novell IDM

EmployeesOrg. structure

SharepointOrg structure

EPiServer

CategoriesPage structure

Categories

WorkspacesDocumentsLists

82

Independence of source

• Solutions need to be independent of the data source– when we extract data, or– when we run queries against the data hub

• Independence isolates clients from changes to the sources– allows us to replace sources completely, or– change the source models

• Of course, total independence is impossible– if structures are too different it won’t work– but surprisingly often they’re not

83

Independence of source #1

Core model

core: Perso

n

core: Proje

ct

core:participant

idm: Perso

n

idm: Proje

ct

sp: Perso

n

sp: Proje

ct

sp:member-ofidm:has-member

IDM Sharepoint

84

Independence of source #2

• “Facebook streams” require queries across sources– every source has its own

model• We can reduce the

differences with annotation– ie: we describe the data with

RDF• Allows data models to

– change, or– be plugged in

• with no changes to query

sp:doc-created a dss:StreamEvent; rdfs:label "opprettet"; dss:time-property sp:Created; dss:user-property sp:Author. sp:doc-updated a dss:StreamEvent; rdfs:label "endret"; dss:time-property sp:Modified; dss:user-property sp:Editor.

idm:ny-ansatt a dss:StreamEvent; rdfs:label "ble ansatt i"; dss:time-property idm:ftAnsattFra .

85

Conclusion

86

Execution

• Customer asked for start August 15, delivery January 1– sources: EPiServer, Sharepoint, IDM

• We offered fixed price, done November 1

• Delibered october 20– also integrated with ActiveDirectory– added Regjeringen.no for extra value

• Integration with ACOS Websak– has been analyzed and described– but not implemented

87

The data sources

Source Connector

Active Directory LDAP

IDM LDAP

Intranet EPiServer

Regjeringen.no EPiServer

Sharepoint Sharepoint

88

Conclusion

• We could do this so quickly because– the architecture is right– have components we can reuse– the architecture allows us to plug

together the components in different settings

• Once we have the data, the rest is usually simple– it’s getting the data that’s hard