Upload
lars-marius-garshol
View
4.308
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Citation preview
1
SESAMLars Marius Garshol, Baksia, [email protected], http://twitter.com/larsga
2
Agenda
• Project background• Functional overview• Under the hood• Similar projects (if time)
Lars Marius Garshol
• Consultant in Bouvet since 2007– focus on information architecture and semantics
• Worked with semantic technologies since 1999– mostly with Topic Maps– co-founder of Ontopia, later CTO– editor of several Topic Maps ISO standards 2001-– co-chair of TMRA conference 2006-2011– developed several key Topic Maps technologies– consultant in a number of Topic Maps projects
• Published a book on XML on Prentice-Hall• Implemented Unicode support in the
Opera web browser
My role on the project
• The overall architecture is the brainchild of Axel Borge
• SDshare came from an idea by Graham Moore
• I only contributed parts of the design– and some parts of the implementation
• Don’t actually know the whole system
Hafslund SESAM
Hafslund ASA
• Norwegian energy company– founded 1898– 53% owned by the city of Oslo– responsible for energy grid around Oslo– 1.4 million customers
• A conglomerate of companies– Nett (electricity grid)– Fjernvarme (remote heating)– Produksjon (power generation)– Venture– ...
What if...?
Customer
MeterCablesTransfor
mer
Workorder
ERPCRM
Bill
Hafslund SESAM
• An archive system, really• Generally, archive systems are glorified
trash cans– putting it in the archive effectively means hiding
it• Because archives are not important, are
they?• Except, when you need that contract
from 1937 about the right to build a power line across...
Problems with archives
• Many documents aren’t there– even though they should be,– because entering metadata is too much hassle
• Poor metadata, because nobody bothers to enter it properly– yet, much of the metadata exists in the user
context• Not used by anybody
– strange, separate system with poor interface– (and the metadata is poor, too)
• Contains only documents– not connected to anything else
Goals for the SESAM project
• Increase percentage of archived documents– to do this, archiving must be made easy
• Increase metadata quality– by automatically harvesting metadata
• Make it easy to find archived documents– by building a well-tuned search application
What SESAM does
• Automatically enrich document metadata– to do that we have to collect background data
from the source systems• Connect document metadata with
structured business data– this is a side-effect of the enrichment
• Provide search across the whole information space– once we’ve collected the data this part is easy
13
As seen by customer
High-level architecture
Triple store
ERP CRMShare-point
ArchiveSearch engine
SDshare
SDshareSDshare
CMIS
Main principle of data extraction• No canonical model!
– instead, data reflects model of source system• One ontology per source system
– subtyped from core ontology where possible• Vastly simplifies data extraction
– for search purposes it loses us nothing– and translation is easier once the data is in the
triple store
Simplified core ontology
Data structure in triple store
Sharepoint
CRM
Archive
ERP
sameAs
sameAs
18
Connecting data
• Contacts in many different systems– ERP has one set of
contacts,– these are mirrored in
archive• Different IDs in these
two systems– using “sameAs” we know
which are the same– can do queries across the
systems– can also translate IDs from
one system to the other
=
sameAs
ERP
Archive
Duplicate suppression
Customers
Companies
Customers
CRM
Customers
Billing
Data hubDuke
Field Record 1 Record 2 Probability
Name acme inc acme inc 0.9
Assoc no 177477707
0.5
Zip code 9161 9161 0.6
Country norway norway 0.51
Address 1
mb 113 mailbox 113
0.49
Address 2
0.5
http://code.google.com/p/duke/
owl:sameAs
SDshare
ERPSuppliers
When archiving
• The user works on the document in some system– ERP, CRM, whatever
• This system knows the context– what user, project, equipment, etc is involved
• This information is passed to the CMIS server– it uses already gathered information from the
triple store to attach more metadata
Auto-tagging
Work
order
ProjectSent to archive
Manager
Customer
Equipment
Equipment
22
Archive integration
• Clients deliver documents via CMIS– an OASIS standard for CMS interfaces– lots of available implementations
• Metadata translation– not only auto-tagging– client vocabulary translated to archive
vocab– required static metadata added
automatically• Benefits
– clients can reuse existing software for connecting
– interface is independent of actual archive
– archive integration is reusable
Archive
CMIS
23
Archive integration elsewhere
• Generally, archive integrations are hard– web site integration: 400 hours– #2 integration: performance problems
• They also reproduce the same functionality over and over again
• Hard bindings against– internal archive model– archive interface
• Switching archive systemwill be very hard...
– even upgrading is hardArchive
Regulationsyste
m
System #1
OutlookSystem
#2
Web site
Showing context in the ERP system
Access control
• Users only see objects they’re allowed to see
• Implemented by search engine– all objects have lists of users/groups allowed to
see them– on login a SPARQL query lists user’s access
control group memberships– search engine uses this to filter search results
• In some cases, complex access rules are run to resolve ACLs before loading into triple store– e.g: archive system
Data volumesGraph Statements
IFS data 5,417,260
Public 360 data
3,725,963
GeoNIS data 44,242
Tieto CAB data
138,521,810
Hummingbird 1
32,619,140
Hummingbird 2
165,671,179
Hummingbird 3
192,930,188
Hummingbird 4
48,623,178
Address data 2,415,315
Siebel data 36,117,786
Duke links 4,858
Total 626,090,919
7
33
Under the hood
34
RDF
• The data hub is an RDF database– RDF = Resource Description Framework– a W3C standard for data– also interchange format and query language
(SPARQL)• Many implementations of the
standards– lots of alternatives to choose between
• Technology relatively well known– books, specifications, courses,
conferences, ...
35
How RDF works
ID
NAME EMAIL
1 Stian Danenbarger
stian.danenbarger@
2 Lars Marius Garshol
3 Axel Borge axel.borge@bouvetSUBJECT PROPERTY OBJECT
http://example.com/person/1
rdf:type ex:Person
http://example.com/person/1
ex:name Stian Danenbarger
http://example.com/person/1
ex:email stian.danenbarger@
http://example.com/person/2
rdf:type Person
http://example.com/person/2
ex:name Lars Marius Garshol
... ... ...
‘PERSON’ table
RDF-ized data
36
RDF/XML
• Standard XML format for RDF– not really very nice
• However, it’s a generic format– that is, it’s the same format regardless of
what data you’re transmitting• Therefore
– all data flows are in the same format– absolutely no syntax transformations
whatsoever
37
Plug-and-play
• Data source integrations are pluggable– if you need another data
source, connect and pull in the data
• RDF database is schemaless– no need to create tables
and columns beforehand• No common data model
– do not need to transform data to store it
Data hub
System #1
System #2
38
Product queue
• Since sources are pluggable we can guide the project with a product queue– consisting of new sources and data
elements• We analyse the items in the queue
– estimate complexity and amount of work– also to see how it fits into what’s already
there• Customer then decides what to do,
and in what order
39
Don’t push!
• The general IT approach is to push data– source calls services provided by recipient
• Leads to high complexity– two-way dependency between systems– have to extend source system with extra logic– extra triggers/threads in source system– many moving parts
System #1
40
Pull!
• We let the recipient pull– always using same protocol and same format
• Wrap the source– the wrapper must support 3 simple functions
• A different kind of solution– one-way dependency– often zero code– wrapper is thin and stateless– data moving done by reused code
System #1
The data integration
• All data transport done by SDshare• A simple Atom-based specification for
synchronizing RDF data– http://www.sdshare.org
• Provides two main features– snapshot of the data– fragments for each updated resource
Basics of SDshare
• Source offers– a dump of the entire data set– a list of fragments changed since time t– a dump of each fragment
• Completely generic solution– always the same protocol– always the same data format (RDF/XML)
• A generic SDshare client then transfers the data– to the receipient, whatever it is
SDshare service structure
Typical usage of SDshare
• Client downloads snapshot– client now has complete data set
• Client polls fragment feed– each time asking for new fragments since last
check– client keeps track of time of last check– fragments are applied to data, keeping them in
sync
Implementing the fragment feedselect objid, objtype, change_time
from history_logwhere change_time > :since:order by change_time asc
<atom> <title>Fragments for ...</title> ...
<entry> <title>Change to 34121</title> <link rel=fragment href=“...”/> <sdshare:resource>http://...</sdshare:resource> <updated>2012-09-06T08:22:23</updated> </entry> <entry> <title>Change to 94857</title> <link rel=fragment href=“...”/> <sdshare:resource>http://...</sdshare:resource> <updated>2012-09-06T08:22:24</updated> </entry>...
The SDshare client
Frontend
Core
SPARQL-
backendPOST-
backend
Triple store
WS
http://code.google.com/p/sdshare-client/
Getting data out of the triple store
• Set up SPARQL queries to extract the data
• Server does the rest• Queries can be
configured to produce– any subset of data– data in any shape
RDF
SDshare server
SPARQL
Contacts into the archive
• We want some resources in the triple store to be written into the archive as “contacts”– need to select which resources to include– must also transform from source data model
• How to achieve without hard-wiring anything?
Contacts solution
• Create a generic archive object writer– type of RDF resource specifies type of object to
create– name of RDF property (within namespace)
specifies which property to set• Set up RDF mapping from source data
– type1 maps-to type2– prop1 maps-to prop2– only mapped types/properties included
• Use SPARQL to– create SDshare feed– do data translation with CONSTRUCT query
Properties of the system
• Uniform integration approach– everything is done the same way
• Really simple integration– setting up a data source is generally very easy
• Loose bindings– components can easily be replaced
• Very little state– most components are stateless (or have little
state)• Idempotent
– applying a fragment 1 or many times: same result• Clear and reload
– can delete everything and reload at any time
51
Conclusion SESAM
52
Project outcome
• Big, innovative project– many developers over a long time period– innovated a number of techniques and
tools• Delivered on time, and working
– despite an evolving project context• Very satisfied customer
– they claim the project has already paid for itself in cost savings at the document center
53
Archive of the year 2012”Virksomheten har vært innovative og strategiske i sin bruk av teknologi både når det gjelder de gamle papirarkivene og det digitale arkivet. De har et stort fokus på å øke datafangsten og forenkle bruken av metadata. De har hatt som mål at brukerne kun skal måtte påføre én metadata. De er nå nede i to – alt annet blir påført automatisk i løsningen – men målet er fortsatt å komme ned i kun én metadata.”
http://www.arkivrad.no/utdanning.html?newsid=10677
54
Highly flexible solution
• ERP integration replaced twice– without affecting other components
• CRM system replaced while in production– all customers got new IDs– Sesam hid this from users, making
transition 100% transparent
55
Status now
• In production since autumn 2011• Used by
– Hafslund Fjernvarme– Hafslund support centre
56
Ohter implementations
• A very similar system has been implemented for Statkraft
• A system based on the same architecture has been implemented for DSS– basically an intranet system
57
DSS Intranet
58
DSS project background
• Delivers IT services and platform to ministries– archive, intranet, email, ...
• Intranet currently based on EPiServer– in addition, want to use Sharepoint– of course, many other related systems...
• How to integrate all this?– integrations must be loose, as systems in
13 ministries come and go
59
Metadata structure
Person
Tema/ arb.område/Arkivnøkkel
Prosjekt/ sak
Person
Dokument/ Innhold
Org. enhet(seksjon, avd., dep.,
virksomhet)
Har ansvar for
Jobber med
Har rolle i forhold til
Har ansvar for/ skrevet
Tilhører
Handler om
Har ansvar for
Er ekspert på
Tilhører/ har rolle i
Handler om
Er involvert i
Fil-typeDokument-type
Saks-/ Prosess-type
Dato
Omtale/ kompetanse
Tittel Telefon adresse …
Saks-nr.
Er relevant for
Status
60
Project scope
IDMEPiServer(intranet)
Sharepoint
User informationAccess groups
79
How we built it
80
The web part
• Viser data gitt– en spørring– et XSLT-stilsett
• Kan kjøres hvor som helst– standard protokoll
til databasen• Lett å inkludere i
andre systemer også
VirtuosoRDF DB
EPiServer Sharepoint
Novell IDM
Active Director
y
Regjeringen.no
Web
part
Web
part
SPARQL SDShareSDShare
SDShareSDShare
SDShare
81
Data flow (simplified)
VirtuosoRDF DB
Novell IDM
EmployeesOrg. structure
SharepointOrg structure
EPiServer
CategoriesPage structure
Categories
WorkspacesDocumentsLists
82
Independence of source
• Solutions need to be independent of the data source– when we extract data, or– when we run queries against the data hub
• Independence isolates clients from changes to the sources– allows us to replace sources completely, or– change the source models
• Of course, total independence is impossible– if structures are too different it won’t work– but surprisingly often they’re not
83
Independence of source #1
Core model
core: Perso
n
core: Proje
ct
core:participant
idm: Perso
n
idm: Proje
ct
sp: Perso
n
sp: Proje
ct
sp:member-ofidm:has-member
IDM Sharepoint
84
Independence of source #2
• “Facebook streams” require queries across sources– every source has its own
model• We can reduce the
differences with annotation– ie: we describe the data with
RDF• Allows data models to
– change, or– be plugged in
• with no changes to query
sp:doc-created a dss:StreamEvent; rdfs:label "opprettet"; dss:time-property sp:Created; dss:user-property sp:Author. sp:doc-updated a dss:StreamEvent; rdfs:label "endret"; dss:time-property sp:Modified; dss:user-property sp:Editor.
idm:ny-ansatt a dss:StreamEvent; rdfs:label "ble ansatt i"; dss:time-property idm:ftAnsattFra .
85
Conclusion
86
Execution
• Customer asked for start August 15, delivery January 1– sources: EPiServer, Sharepoint, IDM
• We offered fixed price, done November 1
• Delibered october 20– also integrated with ActiveDirectory– added Regjeringen.no for extra value
• Integration with ACOS Websak– has been analyzed and described– but not implemented
87
The data sources
Source Connector
Active Directory LDAP
IDM LDAP
Intranet EPiServer
Regjeringen.no EPiServer
Sharepoint Sharepoint
88
Conclusion
• We could do this so quickly because– the architecture is right– have components we can reuse– the architecture allows us to plug
together the components in different settings
• Once we have the data, the rest is usually simple– it’s getting the data that’s hard