Upload
connected-data-london
View
284
Download
0
Embed Size (px)
Citation preview
Autodiscovery
or
The long tail of open data
Christopher Gutteridge
University of Southampton
& data.ac.uk
Bragsheet
Christopher Gutteridge - @cgutteridge• Previously; Lead Developer of EPrints
(Open access research repository software).
• “Linked Open Data Architect” for University of Southampton.
(or whatever we’re currently call doing LOD stuff for an organisation)
• Benevolent technical dictator of data.ac.uk(recently deposed)
• Webmaster WWW2006• Assistant Webmaster WWW2007, WWW2009
Image Attributions:• Backgrounds:
– http://www.fansshare.com/gallery/photos/14646865/abstract-background-brown-and-blue-circles/
– http://www.pptback.com/old-machine-gears-pptbackground.html
• Cliff leap pic: Justin De La Ornellas @ Flickr• Train tracks: duncanh1 @ Flickr• Lego bricks: rawdonfox @ Flickr• Mechano Box: Lady alys @ Wikipedia• Stickle Bricks: Simon Jobling @ Flickr• Free Universal Construction Kit: F.A.T. Lab + Sy-Lab.• Telescope: Brongaeh @ Flickr• Pinata: Peasap @ Flickr• Containers: l2f1 @ Flickr
Why don’t organisations
share data?
(and what stops them)
Us early adopters have shared data because it’s cool.
We were not 100% clear on the benefits but it looks like fun and maybe gains us reputation.
Fear. Uncertainty. Doubt.
Open Data Excuse Bingo
Terrorists will use it
We'll get spam It's too big It's not very interesting
Thieves will use it
I don't mind, but someone
else might
We will get too many enquiries
Lawyers want a custom License
There's no API Poor Quality There's already a project to...
We might want to use it in a
paper
It's too complicated
Data Protection People may misinterpret
the data
What if we want to sell it
later
Don’t get depressed! Go here for antidotes: http://is.gd/odbingo
Menu
Burger ….. £3.50Chips ….. £1.50 ≠
Greater than the sum of
its parts
Interoperable datasets
allow results that are
greater than the sum
of the parts…
11
bu
http://bus.southampton.ac.uk/
13
14
15
16
http://www.minecraftworldmap.com/worlds/xO3X4/full#/4469/64/-1806/-3/0/0
data.southampton.ac.uk
DiscreteFacts
Statistitics
What I want from data
• Where am I going?
• How can I get there?
• Where can I get a coffee enroute?
Why aren’t they using
our data?
“If you build it, they will come.”
“If you build it, they will come.”
Value of dataset to audienceX
Potential audience sizeX
Ease of discoveryX
Ease of grasping the value of the datasetX
Ease of exploiting dataset
Probability of open dataset reuse =
Value of dataset to audienceX
Potential audience sizeX
Ease of discoveryX
Ease of grasping the value of the datasetX
Ease of exploiting datasetX
Perceived quality & reliability
Probability of open dataset reuse =
…Autodiscoverable
and interoperable data
can massively increase
the potential audience
28
$ ./generate-world Demo --postcode PO381NL --size 250
29
$ ./generate-world Demo --postcode PO381NL --size 250
30
data.ac.uk
• Automatically discovers equipment data from all .ac.uk sites
– 2769 websites
– 42 providing data
– 11,028 records
• Automation massively reduces staffing costs
• Low effort for institutions-
– A third just provide a well-structured spreadsheet!
• Not a single-point-of-failure
32
.ac.uk
33
UK National Equipment Portal
34http://equipment.data.ac.uk
UNIQUIP
Column Heading Required
Type No
NameAt least one of these fields must be completed.
Description
Related Facility ID No
Technique(:cpv) or (:N8) No
Location No
Contact Name No
Contact Telephone
At least one of these fields must be completed.Contact URL
Contact Email
Secondary Contact Name No
Secondary Contact TelephoneAt least one of these fields must be completed with second contact name.
Secondary Contact URL
Secondary Contact Email
ID No
Photo No
Department No
Site Location Yes
Building No
Service Level No
Web Address No35
36
.ac.uk
Doin’ it on the cheap
37
Doin’ it on the cheap
38
Ensuring a sustainable
service through
autodiscovery
39
Sustainability via Autodiscovery
• How do we add new datasets?
• How are changes made?
• How do we know the data is open data?
Sustainability via Autodiscovery
• Have a machine readable document
describing the institution and any open
datasets (with licences)
• Place a link to it on the Institutions homepage
/.well-known/openorg
http://www.soton.ac.uk/.well-known/openorg
or
<link rel=“openorg” href=“http://id.southampton.ac.uk/dataset/profile/latest”>
/.well-known/openorg
http://www.soton.ac.uk/.well-known/openorg
or
<link rel=“openorg” href=“http://id.southampton.ac.uk/dataset/profile/latest”>
What is an Organisation Profile Document,
44
A RDF Document that describes the organisation:
– General information provided:
• Official name, Postal address, Contact phone number,The correct logo,
Physical location
– Links to the parts of the organisation,
• Admissions, Alumni, Freedom of Information, Complaints
– A semantic sitemap
• Key pages such as jobs, news, events…
– Links to the organisation’s discoverable open data sets and APIs
• The equipment dataset
What is an Organisation Profile Document,
45
46
Autodiscovery
47
Autodiscovery
48
• Dataset publicly available on website.
• Dataset has to be added manually along with all the institutions details,
contacts etc
Requires staff time (especially if any dataset changes location)
Autodiscovery
49
• Dataset publicly available on website.
• Dataset has to be added manually along with all the institutions details,
contacts etc
Requires staff time (especially if any dataset changes location)
• Organisation has an OPD linking to dataset
• The OPD has to be added manually, but the dataset location and
institution info is consumed directly from the OPD.
Requires less staff time (as any changes made to OPD will get updated)
Autodiscovery
50
• Dataset publicly available on website.
• Dataset has to be added manually along with all the institutions details,
contacts etc
Requires staff time (especially if any dataset changes location)
• Organisation has an OPD linking to dataset
• The OPD has to be added manually, but the dataset location and
institution info is consumed directly from the OPD.
Requires less staff time (as any changes made to OPD will get updated)
• Link to OPD from organisation’s home page
• OPD autodiscovered, so the dataset is automatically added to the
service.
Requires no staff time (as data is autodiscovered)
Never appeal to a man’s “better nature.” He may not have one.
Invoking his “self—interest” gives you more leverage.
- Robert Heinlein, “The Notebooks of Lazarus Long”
Status Report – Contributors and data statistics
52
Bronze Silver Gold
Data is on the internet and in an acceptable format.
✔ ✔ ✔
Description of dataset is provided by a remotely hosted OPD
✔ ✔
The OPD is discovered via autodiscovery.
✔
The OPD/dataset has a recognised and supported open licence (eg CCO, ODCA or OGL)
✔
53
Bronze Silver Gold
Data is on the internet and in an acceptable format.
✔ ✔ ✔
Description of dataset is provided by a remotely hosted OPD
✔ ✔
The OPD is discovered via autodiscovery.
✔
The OPD/dataset has a recognised and supported open licence (eg CCO, ODCA or OGL)
✔
All items in the dataset are assigned an ID code which is unique within theassigning organisation.
✔
54
Exploiting profile
documents
Exploiting profile documents
• We’ve barely begun
• Lets try a live demo....
Warning:
Metaphor mixing detected
63
Needless heterogeneity means research doesn’t join up.
Aligning datasets every timecosts too much.
Tools can’t be reused
So what do we do about it?
Building easy-to-use tools to cross between formats, platforms and paradigms is very specialist work.
Building easy-to-use tools to cross between formats, platforms and paradigms is very specialist work.
The solutions need to be discoverable.
Building easy-to-use tools to cross between formats, platforms and paradigms is very specialist work.
The solutions need to be discoverable.
Just putting it on Github is not making a tool discoverable!
Building easy-to-use tools to cross between formats, platforms and paradigms is very specialist work.
The solutions need to be discoverable.
Just putting it on Github is not making a tool discoverable!
https://github.com/cgutteridge/
Organisation Datasets
Well known formats available for:
• Events
• Publications
• News headlines
Nothing in common use for:
• Staff Expertise
• Programmes of Events
• Vacancies
• Organisational Structure
• Buildings, Rooms
• Points of service
• Products– Food Menus
RDF or XML Vocabularies don’t solve the problem
by themselves.
You need:
Examples to copy.
Tools which consume and produce the format.
Online checking tools.
A dataset should at least solve one
usecase.
Over modelling is fun.
Stop it.
• TODO:
• OPD DOCUMENTATION
Thank-you.
Christopher GutteridgeUniversity of Southampton@[email protected]://opd.data.ac.uk/