Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
The MetaLex Document Server
Legal Documents as Versioned Linked Data
Rinke HoekstraUniversiteit van Amsterdam & VU University Amsterdam
Monday, November 14, 2011
Um... why is this interesting?
Monday, November 14, 2011
Um... why is this interesting?
Monday, November 14, 2011
Um... why is this interesting?
Monday, November 14, 2011
start
State Nameentry/actiondo/activityexit/actionevent/action(arguments)
Stateaction
end
Regulation A(01-01-2011)
Art 12(04-02-2011)
Art 14, lid 3, 2e volzin(11-06-2008)
Art 14, lid 3, 2e volzin(01-07-2011)
• Open Data: current public service falls short
• Hidden agenda:
• Large scale validation of CEN MetaLex
• Linked Open Government Data HOT!As of September 2011
MusicBrainz
(zitgist)
P20
Turismo de
Zaragoza
yovisto
Yahoo! Geo
Planet
YAGO
World Fact-book
El ViajeroTourism
WordNet (W3C)
WordNet (VUA)
VIVO UF
VIVO Indiana
VIVO Cornell
VIAF
URIBurner
Sussex Reading
Lists
Plymouth Reading
Lists
UniRef
UniProt
UMBEL
UK Post-codes
legislationdata.gov.uk
Uberblic
UB Mann-heim
TWC LOGD
Twarql
transportdata.gov.
uk
Traffic Scotland
theses.fr
Thesau-rus W
totl.net
Tele-graphis
TCMGeneDIT
TaxonConcept
Open Library (Talis)
tags2con delicious
t4gminfo
Swedish Open
Cultural Heritage
Surge Radio
Sudoc
STW
RAMEAU SH
statisticsdata.gov.
uk
St. Andrews Resource
Lists
ECS South-ampton EPrints
SSW Thesaur
us
SmartLink
Slideshare2RDF
semanticweb.org
SemanticTweet
Semantic XBRL
SWDog Food
Source Code Ecosystem Linked Data
US SEC (rdfabout)
Sears
Scotland Geo-
graphy
ScotlandPupils &Exams
Scholaro-meter
WordNet (RKB
Explorer)
Wiki
UN/LOCODE
Ulm
ECS (RKB
Explorer)
Roma
RISKS
RESEX
RAE2001
Pisa
OS
OAI
NSF
New-castle
LAASKISTI
JISC
IRIT
IEEE
IBM
Eurécom
ERA
ePrints dotAC
DEPLOY
DBLP (RKB
Explorer)
Crime Reports
UK
Course-ware
CORDIS (RKB
Explorer)CiteSeer
Budapest
ACM
riese
Revyu
researchdata.gov.
ukRen. Energy Genera-
tors
referencedata.gov.
uk
Recht-spraak.
nl
RDFohloh
Last.FM (rdfize)
RDF Book
Mashup
Rådata nå!
PSH
Product Types
Ontology
ProductDB
PBAC
Poké-pédia
patentsdata.go
v.uk
OxPoints
Ord-nance Survey
Openly Local
Open Library
OpenCyc
Open Corpo-rates
OpenCalais
OpenEI
Open Election
Data Project
OpenData
Thesau-rus
Ontos News Portal
OGOLOD
JanusAMP
Ocean Drilling Codices
New York
Times
NVD
ntnusc
NTU Resource
Lists
Norwe-gian
MeSH
NDL subjects
ndlna
myExperi-ment
Italian Museums
medu-cator
MARC Codes List
Man-chester Reading
Lists
Lotico
Weather Stations
London Gazette
LOIUS
Linked Open Colors
lobidResources
lobidOrgani-sations
LEM
LinkedMDB
LinkedLCCN
LinkedGeoData
LinkedCT
LinkedUser
FeedbackLOV
Linked Open
Numbers
LODE
Eurostat (OntologyCentral)
Linked EDGAR
(OntologyCentral)
Linked Crunch-
base
lingvoj
Lichfield Spen-ding
LIBRIS
Lexvo
LCSH
DBLP (L3S)
Linked Sensor Data (Kno.e.sis)
Klapp-stuhl-club
Good-win
Family
National Radio-activity
JP
Jamendo (DBtune)
Italian public
schools
ISTAT Immi-gration
iServe
IdRef Sudoc
NSZL Catalog
Hellenic PD
Hellenic FBD
PiedmontAccomo-dations
GovTrack
GovWILD
GoogleArt
wrapper
gnoss
GESIS
GeoWordNet
GeoSpecies
GeoNames
GeoLinkedData
GEMET
GTAA
STITCH
SIDER
Project Guten-berg
MediCare
Euro-stat
(FUB)
EURES
DrugBank
Disea-some
DBLP (FU
Berlin)
DailyMed
CORDIS(FUB)
Freebase
flickr wrappr
Fishes of Texas
Finnish Munici-palities
ChEMBL
FanHubz
EventMedia
EUTC Produc-
tions
Eurostat
Europeana
EUNIS
EU Insti-
tutions
ESD stan-dards
EARTh
Enipedia
Popula-tion (En-AKTing)
NHS(En-
AKTing) Mortality(En-
AKTing)
Energy (En-
AKTing)
Crime(En-
AKTing)
CO2 Emission
(En-AKTing)
EEA
SISVU
education.data.g
ov.uk
ECS South-ampton
ECCO-TCP
GND
Didactalia
DDC Deutsche Bio-
graphie
datadcs
MusicBrainz
(DBTune)
Magna-tune
John Peel
(DBTune)
Classical (DB
Tune)
AudioScrobbler (DBTune)
Last.FM artists
(DBTune)
DBTropes
Portu-guese
DBpedia
dbpedia lite
Greek DBpedia
DBpedia
data-open-ac-uk
SMCJournals
Pokedex
Airports
NASA (Data Incu-bator)
MusicBrainz(Data
Incubator)
Moseley Folk
Metoffice Weather Forecasts
Discogs (Data
Incubator)
Climbing
data.gov.uk intervals
Data Gov.ie
databnf.fr
Cornetto
reegle
Chronic-ling
America
Chem2Bio2RDF
Calames
businessdata.gov.
uk
Bricklink
Brazilian Poli-
ticians
BNB
UniSTS
UniPathway
UniParc
Taxonomy
UniProt(Bio2RDF)
SGD
Reactome
PubMedPub
Chem
PRO-SITE
ProDom
Pfam
PDB
OMIMMGI
KEGG Reaction
KEGG Pathway
KEGG Glycan
KEGG Enzyme
KEGG Drug
KEGG Com-pound
InterPro
HomoloGene
HGNC
Gene Ontology
GeneID
Affy-metrix
bible ontology
BibBase
FTS
BBC Wildlife Finder
BBC Program
mes BBC Music
Alpine Ski
Austria
LOCAH
Amster-dam
Museum
AGROVOC
AEMET
US Census (rdfabout)
Media
Geographic
Publications
Government
Cross-domain
Life sciences
User-generated content
Monday, November 14, 2011
CEN MetaLex
• CEN Workshop Agreement
• Interchange format
• Highly generic XML elements(hcontainer, block, inline)
• “Content models” signal content(e.g. chapter, article, sentence)
• Schema extension
• Metadata as RDFa
• Naming convention
“Open XML Interchange Format for Legal and Legislative Resources”
http://www.metalex.eu
Monday, November 14, 2011
Public content services hosted at wetten.nl
Current Situation
Monday, November 14, 2011
Wetten.nl XML Service
•Only available format is BWB XML
•Only current version
• Content at document level
• Identification at document level
• Identifiers are not dereferencable
• Hardly any metadata (e.g. version date)
•Only available context is position in text
http://wetten.overheid.nl/xml.php?regelingID=...
Monday, November 14, 2011
BWBId Web Servicehttp://wetten.overheid.nl/BWBIdService/BWBIdList.xml.zip
NB: The problem with the XML processing instruction was reported and fixed, but returned some weeks later
Monday, November 14, 2011
Identifiers & Juriconnect
• Existing identification standard: Juriconnect
• URN-like... but no naming servercf. Document Object Identifiers
• Named elements do not carry identifier
• No explicit version information, only contextual
1.0:c:BWBR0005416&artikel=6vs
http://wetten.overheid.nl/cgi-bin/deeplink/law1/bwbid=BWBR0005416/article=6/date=2005-01-14vs
http://wetten.overheid.nl/BWBR0005416/TitelII698946/HoofdstukII/Artikel16/geldigheidsdatum_14-01-2005
Monday, November 14, 2011
Step 1Requirements
Monday, November 14, 2011
Goals• “Deserialize” regulation content
(e.g. topic-based browsing)
• Extract and reconstruct implicit information (identifiers, metadata)
• Annotate regulations(reconstructed metadata, third-party metadata)
• Annotate using regulations(knowledge based systems, services, business processes ...)
• Accessible and reusable for any other party(shared vocabularies, standard access)
Monday, November 14, 2011
Requirements
•Unique, persistent identification(URL-like URIs)
•Generic XML structure of documents(CEN MetaLex XML documents)
•Extensible metadata framework(Linked Open Data)
•Flexible web services(Transparent REST services, Cool URIs)
Monday, November 14, 2011
Available Sources...
•List of all regulations in “XML”
•Wetten.nl XML Service
•Metadata in HTML table on wetten.nl (the “info page”)
•... so let’s get started already
Monday, November 14, 2011
Come up with persistent identifiers at element level and a solid versioning scheme
Step 2
Monday, November 14, 2011
Levels of Identification
•IFLA FRBR levels
•Work
•Expression
•Manifestation
Bibliographic Entity Work
Expression
Manifestation
Item
XML version of regulation
exemplifies
embodies
realizes
Version of regulation Regulation
XML version of regulation on my harddisk
Monday, November 14, 2011
•Hierarchical information (work)
•Version and language (expression)
•Format information (manifestation)
http://doc.metalex.eu/id/BWBR0011823/hoofdstuk/1/artikel/1
http://doc.metalex.eu/id/BWBR0011823/hoofdstuk/1/artikel/1/nl/2010-09-01
http://doc.metalex.eu/doc/BWBR0011823/hoofdstuk/1/artikel/1/nl/2010-09-01/data.xml
http://doc.metalex.eu/id/BWBR0011823/artikel/1
Transparent Identifiers
Monday, November 14, 2011
Problem•URIs don’t carry semantics...
•Detect changes:
•which element versions are the same
• ... and which versions are different?
Art. 44, lid 4(2011-03-26)
Art. 44, lid 4(2011-04-05)
from: Besluit prudentiële regels Wft, BWBR0020420
Monday, November 14, 2011
Opaque Identifiers
• Content information
• Unique SHA1 Hash of text
http://doc.metalex.eu/BWBR0011823/hoofdstuk/1/artikel/34b0cee26ee5138c74aa2c62caf2c117d3c616e9
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Aliquam accumsan ullamcorper dolor, aliquam consequat lorem elementum a.
Sentence 1t1
Sentence 2t1
AE6
B6C
Monday, November 14, 2011
Opaque Identifiers
• Content information
• Unique SHA1 Hash of text
http://doc.metalex.eu/BWBR0011823/hoofdstuk/1/artikel/34b0cee26ee5138c74aa2c62caf2c117d3c616e9
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Aliquam accumsan ullamcorper dolor, aliquam consequat lorem elementum a.
Sentence 1t1
Sentence 2t1
AE6
B6C
Sentence 1t2
Sentence 2t2
Monday, November 14, 2011
Opaque Identifiers
• Content information
• Unique SHA1 Hash of text
http://doc.metalex.eu/BWBR0011823/hoofdstuk/1/artikel/34b0cee26ee5138c74aa2c62caf2c117d3c616e9
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Aliquam accumsan ullamcorper dolor, aliquam consequat lorem elementum a.
Sentence 1t1
Sentence 2t1
AE6
B6C
Sentence 1t2
Sentence 2t2
Suspendisse ornare mi non lectus bibendum gravida. 3F5
Sentence 1t3
Sentence 2t3
Sentence 3t3
Monday, November 14, 2011
Generic conversion of BWB XML to a generic XML format (CEN MetaLex) and appropriate metadata
Step 3
Monday, November 14, 2011
ProcedureFor each BWB XML file listed,
if update has occurred since latest run,download latest version,
scrape metadata, and produce:
Persistent URIs
CEN MetaLex + Citations
Inline RDFa (optional) or RDF graph (optional), Pajek “.net” files (optional)
Monday, November 14, 2011
BWB to CEN MetaLex?
•Straightforward 1:1 mapping
•... some minor fixes14
Table 1. Conversion performance for 300 randomly selected regulations.
Number % Number %Substitutions42 Correctionscontainer 22312 29 % artikel 2525 72 %hcontainer 3730 5 % divisie 519 15 %htitle 3730 5 % colspec 289 8 %block 34325 44 % illustratie 54 2 %inline 13527 17 % others 99 3 %Total 77624 Total 3486
Total no. of regulations 300Revoked regulations 109 30 %Correction % 4 %
Lastly, the MDS offers a simple search interface for finding regulations based onthe title and version date.
6 Conclusion and Results
We ran the MetaLex conversion script on all regulations available through thewetten.nl portal, resulting in a total of 27.687 versions of regulations being con-verted, roughly 1 GB in size for BWB XML, and 2.27 GB as MetaLex.40 Thesize increase is primarily due to the length and number of identifiers in MetaLex,which aren’t present in BWB XML. The generated Turtle files comprise 9.9 GB,and contain 87.9 million triples. At this moment, the MDS runs on a lightweight8GB Ubuntu box with a 4Store triple store.41 This is hardly a sustainable situ-ation as new and modified regulations are published almost every week, whichmeans that the number of triples will accumulate with time.
We evaluated the ability of MetaLex to map onto the BWB XML by runningthe converter on 300 randomly selected BWB identifiers; results are presented inTable 1. The artikel element accounts for 72% of all corrections, and correspondsto 68% of all htitle substitutions (5 % of total). This means that only a verysmall part of BWB XML does not directly fit onto the MetaLex schema. Wehave conducted a similar exercise on a single example of a CHLexML documentand results were comparable; the main cause for incompatibility is the restrictionin MetaLex that hcontainer elements are not allowed to contain block elements.
6.1 Meeting Requirements
Publishing identifiers and metadata of regulations in RDF meets the minimalrequirements for facilitating regulatory compliance and annotation. Third partiescan freely and transparently annotate regulations with specialised vocabulariesand business rules. Our versioning scheme allows these annotations to be fine-grained and stable through time. For instance, annotating an opaque expression
40 The actual number of regulations available at a single time is typically a bit lower.The conversion was done in several batches, and several modified regulations werepublished in the meantime.
41 See http://4store.org.
Monday, November 14, 2011
Events & Provenance
http://doc.metalex.eu/id/BWBR0017869/2009-10-23
http://doc.metalex.eu/id/process/BWBR0017869/2009-10-23 http://doc.metalex.eu/id/event/BWBR0017869/2009-10-23
opmv:wasGeneratedByml:resultOf
http://doc.metalex.eu/id/date/2009-10-23
opmv:wasGeneratedAt
ml:date
ml:LegislativeModification
rdf:type
opmv:Process
rdf:type
"2009-10-23"^^xsd:date
time:inXSDDateTime
time:hasEnd
time:Instant
rdf:type
ml:Date
rdf:type
opmv:Artifact
rdf:type ml:BibliographicExpression
rdf:type
sem:Event
rdf:type
sem:eventType
sem:hasTime
sem:Time
rdf:typesem:timeTypesem:hasTimeStamp
The expression (version) URI of a regulation
The process that generated the expression
The date at which the expression was created
rdf:value
The creation event of the regulation
Monday, November 14, 2011
Dublin CoreOPMVSEM
W3C TimeMetaLex
FOAF
Monday, November 14, 2011
Publish: The MetaLex Document Server (MDS)
Step 4
Monday, November 14, 2011
• RESTful API
• Cool URIs(Dereference to XML, RDF, .net)
• Shorthands (‘/latest’)
• SPARQL endpoint
• Citation graphs
• Whoosh-based search
• CSS Stylesheet for CEN MetaLex
Monday, November 14, 2011
• RESTful API
• Cool URIs(Dereference to XML, RDF, .net)
• Shorthands (‘/latest’)
• SPARQL endpoint
• Citation graphs
• Whoosh-based search
• CSS Stylesheet for CEN MetaLex
119,935,096
Monday, November 14, 2011
http://doc.metalex.eu/id/BWBR0011823/nl/2010-09-01Accept: text/xml
http://doc.metalex.eu/doc/BWBR0011823/nl/2010-09-01/data.xml
Server redirects to manifestation URI (HTTP 303) 2
Client requests URI1
Glob Server queries file store for XML manifestation3Triplestore returns URI of Manifestation 5 Manifestation
MDS redirects to Manifestation URI (HTTP 302) 6
Location of Manifestation
http://doc.metalex.eu/files/BWBR0011823_2010-03-01_mls.xml
(extract)
If no manifestation exist, extract from parent4
• RDF syntaxes (SCBD)application/rdf+xml, text/rdf+n3, application/x-turtle
• XML documentstext/xml
• HTML clientsapplication/xml, application/xhtml+xml, text/html
• Pajek clientstext/plain
• Download .net file
• View using Gephi Toolkit http://gephi.org
Monday, November 14, 2011
Technical Details• Current situation
• +/- 29 thousand incremental regulation versions
• 119,3 million triples (legislation.gov.uk: 1.9 billion)
• Updated daily
• Technical details
• Dell PowerEdge II T110, 32GB RAM
• Garlik 4Store triplestore (http://4store.org)
• Python Django web applications
• Tomcat servlet + Gephi Toolkit API
• See http://doc.metalex.eu
Monday, November 14, 2011
Use it
Step 5
Monday, November 14, 2011
In DegreeBetweenness Centrality
Monday, November 14, 2011
Monday, November 14, 2011
vermogen van de erflater
vermogen erflater
SWHoofdstuk I, Artikel 10
SWHoofdstuk I, Artikel 13
SWHoofdstuk III, Artikel 32
legatorvermogen(geld) weten vermogen
(het kunnen) capaciteit
testator
erflater
[...]
aanspraken
aard
algemeen belang
[...]
dcterms:subject dcterms:subjectdcterms:subject
skos:closeMatch skos:closeMatch
skos:relatedMatchskos:broader skos:broader
ma:cooccursWith
ma:cooccursWith
ma:cooccursWith
ma:cooccursWith
PrincetonW
ordnetC
ornettoW
ordnetM
etaLex Annotator
MetaLex
Docum
entServer
>500k triples for Inheritance Tax law
Monday, November 14, 2011
Discussion• Linked Open Data is not just for citizens
• Content service of Dutch gov. falls short
• Successful transformation to CEN MetaLex
• Successful transformation to RDF
• Dual versioning at element level
• Extend to other (international) XML formats
• Empirical study of network analysis
Monday, November 14, 2011
http://doc.metalex.eu
http://github.com/RinkeHoekstra
Fin
Monday, November 14, 2011