Linked Library Data in the wild

Preview:

DESCRIPTION

How we use Linked Open Data to drive our next generation discovery interface, and how we've gone about it.

Citation preview

Linked Library Datain the wild

Technical Lead for Prism

Phil John

Introductions...

So, what’s Prism then?

Introductions...

a next generation discovery interface

Prism

Introductions

(yes…even configuration settings)

Built entirely on Linked Data

Prism

Discovery of library catalogue resources

Prism

but grander plans afoot...

...some future sources...

Prism

journal metadata

archives/records (e.g. DS Calm)

thesis repositories

rare items/special collections

and more!

SaaS/Cloud Based

Prism

MARC 21 RDF

Performs data conversion

Prism

this ensures it keeps in sync with the LMS

Initial “bulk” conversion then periodic “delta” files

Prism

provided by a suite of RESTful web services

Borrower/Availability data pulled from LMS “live”

Prism

just add .rss to collectionsor .rdf/.nt/.ttl/.json to items

Linked Data API

Prism

The Challenges

Prism

Extracting data from MARC 21

The Challenges

Some quotes...

Extracting Data from MARC 21

...cataloguers may want to look away now

...and even if it does, there are millions of existing records that we’ll want to convert

MARC 21 is not goingaway anytime soon...

Extracting Data from MARC 21

How are we approaching it?

Extracting Data from MARC 21

By tackling it in small chunks!

Extracting Data from MARC 21

We’ve created a solution that...

Extracting Data from MARC 21

allows us to build the model iteratively

compartmentalises code for different sections

provides robustness

is performant

allows us to experiment

Parser Observer Handlers

Our conversion pipeline

Extracting Data from MARC 21

Parser Observer Handlers

fires events when it encounters a MARC 21 data structure; very strict with syntax

MARC 21 Parser

Extracting Data from MARC 21

Parser Observer Handlers

listens for MARC 21 data structures and hands control over to one or more handlers

Event Observer

Extracting Data from MARC 21

Parser Observer Handlers

know how to convert MARC 21structures and fields into linked data

Bibliographic Handlers

Extracting Data from MARC 21

So, where are we up to?

Extracting Data from MARC 21

we tackled this one first as it allows us to reason more fully about the record

Format (and duration)

Extracting Data from MARC 21

In theory quite easy...

Format

...in practice not so much...

Format

no code for CD (12cm sound disk, 1.4m/s)

DVD and LaserDisc share(d) a code

LC slow(ish) to support new formats in M21

limited use of control field (007) codings...

...so need to parse text from 3xx, 5xx fields

LDR: 01425ngm a22005058 4504001: 750785003: xxxxxxx005: 20090824164118.0007: vd||s||||008: 080623s2007 enk||| e v|eng d020: , | $c Retail (S24.99) |024: 3, | $a 7321900108089 |028: 4, 0 | $a BDY10808 | $b Warner Home Video |029: , | $a 7321900108089 |082: , | $a 812245: 0, 0 | $a Goodfellas | $h [videorecording] / | $c directed by Martin Scorsese ; music by Christopher Brooks260: , | $b Warner Home Video, | $c 2007. |300: , | $a 1 Blu-Ray (139 min.) : | $b col. |306: , | $a 021900 |366: , | $b 20070611 |511: , | $a Starring Robert De Niro, Ray Liotta and Joe Pesci521: 8, | $a BBFC code: 18. |538: , | $a Blu-Ray. |700: 1, | $a Scorsese, Martin |700: 1, | $a Brooks, Christopher |852: , | $b John Harvard | $c BLU-RAY DISC | $m 18 | $z , $z Blu Ray Disc. 18Cert

Teasing format from a MARC 21 Record

Which gives us...

an important part of the recordto model, or so I’ve been told

Title

Extracting Data from MARC 21

Quite tricky because...

Title

don’t want to duplicate data that appears elsewhere (e.g. in 100/700)

‡c must be last subfield in a 245...

...so sometimes data from ‡n / ‡p is in ‡c instead...

...which means we can’t just drop the ‡c

http://journal.code4lib.org/articles/3832

Got a helping hand from Code4Lib Journal (thanks!)

Title

Now with more title

sounds easy...acronyms from EAN to UPC describing 13 digit codes...right?

Identifier

Extracting Data from MARC 21

what are all those other things doing in the ‡a?

...STOP!

Identifier

Identifier

“For a hardbound resource, there is no attempt to use a consistent term other than to use one that conveys the condition intelligibly.”

Library of Congress Rule Interpretation 1.8

(and then validate whatever’s left)

So we need to parse them out

Identifier

LDR: 01425ngm a22005058 4504001: 750785003: xxxxxxx005: 20090824164118.0007: vd||s||||008: 080623s2007 enk||| e v|eng d020: , | $c Retail (S24.99) |024: 3, | $a 7321900108089 |028: 4, 0 | $a BDY10808 | $b Warner Home Video |029: , | $a 7321900108089 |082: , | $a 812245: 0, 0 | $a Goodfellas | $h [videorecording] / | $c directed by Martin Scorsese ; music by Christopher Brooks260: , | $b Warner Home Video, | $c 2007. |300: , | $a 1 Blu-Ray (139 min.) : | $b col. |306: , | $a 021900 |366: , | $b 20070611 |511: , | $a Starring Robert De Niro, Ray Liotta and Joe Pesci521: 8, | $a BBFC code: 18. |538: , | $a Blu-Ray. |700: 1, | $a Scorsese, Martin |700: 1, | $a Brooks, Christopher |852: , | $b John Harvard | $c BLU-RAY DISC | $m 18 | $z , $z Blu Ray Disc. 18Cert

Phew, this one’s easy, no (pbk), (hbk) or even (pbk. , alk. paper) to contend with

Now we can start performing lookups against other sources!

hardest of the lot...

Author

Extracting Data from MARC 21

...why?

Author

Newt Scamander

Rowling, J.K. vs Rowling, Joanne K.

Few records with relator term in 100/700 ‡e...

...so we have to parse that from the 245 ‡c...

...and we don’t just deal with English records.

we’ve licensed the names/subjects authority files, and created RDF from them

Library of Congressto the rescue!

Author

LDR: 01425ngm a22005058 4504001: 750785003: xxxxxxx005: 20090824164118.0007: vd||s||||008: 080623s2007 enk||| e v|eng d020: , | $c Retail (S24.99) |024: 3, | $a 7321900108089 |028: 4, 0 | $a BDY10808 | $b Warner Home Video |029: , | $a 7321900108089 |082: , | $a 812245: 0, 0 | $a Goodfellas | $h [videorecording] / | $c directed by Martin Scorsese ; music by Christopher Brooks260: , | $b Warner Home Video, | $c 2007. |300: , | $a 1 Blu-Ray (139 min.) : | $b col. |306: , | $a 021900 |366: , | $b 20070611 |511: , | $a Starring Robert De Niro, Ray Liotta and Joe Pesci521: 8, | $a BBFC code: 18. |538: , | $a Blu-Ray. |700: 1, | $a Scorsese, Martin |700: 1, | $a Brooks, Christopher | $e music852: , | $b John Harvard | $c BLU-RAY DISC | $m 18 | $z , $z Blu Ray Disc. 18Cert

A contrived example (sorry!) with and without relator terms

Hope you can all read this at the back!

A closer look atAuthority Matching

Author

Some requirements:

Author

needs to be fast...

...(able to process 2M records in several hours)

requires accuracy

must handle pseudonyms and variant spellings

which means that for bulk conversions we aren’t incurring HTTP overhead millions of times

So we store as RDF,but index in SOLR

Author

You can tell J.K. Rowling is successful, she’s been translated lots

Language/Alternate Graphical Representation

Extracting Data from MARC 21

Nice “high impact” feature

Language

allows switching between representations

both forms can be searched for

uses RDF content language feature, so useful for people using machine readable RDF

001: | 3013197008: | 080624s2007\\\\cc\a\\\\\\\\\\000\0\chi\d041: , | $a chi043: , | $a a-cc--- |050: , 4 | $a NE1300.8.C6 | $b S48 2007 |100: 1, | $6 880-01 | $a Shu, Huifang. |245: 1, 0 | $6 880-02 | $a Fan chen su zi : | $b Min jian nian hua zhong de wen qing feng su / | $c Shu Huifang, Shen Hong zhu. |246: 3, 1 | $6 880-03 | $a Min jian nian hua zhong de wen qing feng su |250: , | $6 880-04 | $a Di 1 ban. |260: , | $6 880-05 | $a Beijing : | $b Zhongguo gong ren chu ban she, | $c 2007. |300: , | $a 3, 3, 229 p. : | $b col. ill. ; | $c 24 cm. |440: , 0 | $6 880-06 | $a Zhongguo min su wen hua cong shu |700: 1, | $6 880-07 | $a Shen, Hong. |880: 1, | $6 100-01/$1 | $a 舒惠芳 . |880: 1, 0 | $6 245-02/$1 | $a 凡尘俗子 : | $b 民间年画中的温情风俗 / | $c 舒惠芳 , 沈泓著 . 880: 3, 1 | $6 246-03/$1 | $a 民间年画中的温情风俗 |880: , | $6 250-04/$1 | $a 第 1 版 . |880: , | $6 260-05/$1 | $a 北京 : | $b 中国工人出版社 | $c 2007. |880: , 0 | $6 440-06/$1 | $a 中国民俗文化丛书 |880: 1, | $6 700-07/$1 | $a 沈泓 . |852: , | $b Main Library | $c East Asian Coll.,Purple 2 | $h 398.351 | $m S4 | Dealing with language in MARC 21

MARC Parser Observer Handlers

tagged with an ISO-639-2 language and masquerading as the field listed in ‡6

Passes 880s back into Observer

Language

Which gives us...

it’s part of the reason we use Linked Data...but it’s got some challenges at the moment

Using/Linking toExternal Datasets

The Challenges

Pitfalls:

Language

what if a datasource suffers downtime...

...or worse, is taken offline permanently?

can we trust this data?

can we display it, or is it susceptible to vandalism?

Potential solutions (YMMV):

Language

harvest datasets and keep them close to the app...

...or, if that’s not practical, proxy requests using a caching proxy such as Squid

if using Wikipedia and worried about vandalism...

...check for lots of rapid edits, consider caching (or turning off temporarily)

...or – what we’d like to seehappen to Linked Library Data

The Future...

especially on the peripheries – authority data, author information, links to other resources

More library data as LOD

The Future

seriously – this would makeour lives so much simpler

LMS vendors adopting LOD

The Future

LOD replacing MARC 21 as the standard representation of

bibliographic records

The Future

Photo Credits

Slide 15 - http://www.flickr.com/photos/gammaman/5241860326/ Slide 21 - http://www.flickr.com/photos/agizienski/3778965891/ Slide 40 - http://www.flickr.com/photos/54409200@N04/5070012761/ Slide 42 - http://www.flickr.com/photos/proimos/4199675334/ Slide 48 - http://www.flickr.com/photos/maveric2003/91198458/ Slide 63 - http://richard.cyganiak.de/2007/10/lod/ Slide 67 - http://www.flickr.com/photos/markchapmanphoto/5139429152/ Slide 72 - http://www.flickr.com/photos/-bast-/349497988/

Recommended