Text to data

+

Text to data

MashCat 2012

Ed Chamberlain

+Me

Librarian (systems)

Data ‘munger’

Data consumer?

+The way it used to be …

Control over record consumption

Control over record environment

Control over technology

+

+Competition …

No longer the single authority for content and description

Commercial, social and academic discovery mechanisms

Explosion of digital content

Illusion of ‘all on the web’

+Fit for purpose?

Studies into Google Generation / ‘Generation Y’ 1

Cambridge Arcadia IRIS report 2009 2

Preference for search engine over catalogue

Online over in-building

Trust tutors and peers over Librarian

Still respect the library ‘brand’

1) ”The Google generation: the information behaviour of the researcher of the future”Aslib Proceedings, V60, issue 4 10.1108/00012530810887953

2) Arcadia IRIS Project report - http://arcadiaproject.lib.cam.ac.uk/docs/Report_IRIS_final.pdf

http://dx.doi.org/10.1108/00012530810887953

+ Keyword based discovery services

New ways to exploit old data

Relevancy ranking

Rich faceting

Greater linking

Search is the new browse

Repositories and archives

Is the OPAC dead?

Improve catalogues

+Different but the same?

Catalogue data is now:

Consumed as keywords (not left anchored access points)

Faceted (not browsed) Supplemented Transformed Merged Amalgamated

+Prepare for the future …

‘Use case you’ve not yet thought of’

‘Consumer as producer’

‘Pro-Am’

‘Free from silo’

Developers as well as readers

Preference for data over text

+

Library data

Our local catalogues

National / international aggregations

Joe Public

Teenage software developer / hacker

Booksellers

Web start-ups

Search engines

Wikipedia

Other libraries

Research group website

+Libraries have a lot to offer

Bibliographic data linked to many aspects of successful teaching and research Citation lists – measure output

Shared bibliography – core of research group work

Reading lists – backbone of undergraduate teaching

High quality data needed for re-use

Not all possible whilst data resides in the library ‘silo’

+

'Open metadata creates the opportunity for enhancing impact through the release of descriptive data about library, archival and museum resources. It allows such data to be made freely available and innovatively reused to serve researchers, teachers, students, service providers and the wider community in the UK and internationally.'

http://discovery.ac.uk

+Open data releases …

+But …

Is Marc21 the right format for developers (or libraries?)

Is it easy to convert into something more palatable?

+What can we do with an ISBN?

Build Union catalogues

Find existing or alternative records (copy catalogue)

Find related works (XISBN, ISBNThing)

Match and mash with resources on the web: Images Reviews Citations and references

+020 - ISBN

What cataloguer record users want:

Accuracy

Contextualization

Access point

Something legible to read

What data consumers want:

– Accuracy

– Contextualization

– Access point

– Reusability

– Granularity

+So …

Take ISBN from an 020$a my $isbn = $record->field('020')->as_string("a"); 0123456789(pbk)

(pbk) ?

Is it the same as (.pbk) I noticed earlier?

I’m a developer – I can solve this …

Regex /^[0-9]+$/ - just gets numbers …

Oh hang on, don’t some ISBNS end in X?

And all that information on hardback /paperback is lost …

+Non Marc …

<identifier type=“isbn” relation=“hardback”>0123456789x</isbn>

identifier: {"id": "0123456789", "type": "isbn”, “rel”:”hardback”}

<http://data.lib.cam.ac.uk/id/entry/cambrdgedb_100045> <http://purl.org/dc/terms/identifier >"urn:isbn:2853990060" .<http://data.lib.cam.ac.uk/id/type/46657eb180382684090fda2b5670335d> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> http://purl.org/ontology/bibo/Book.

http://purl.org/ontology/bibo/Book



+Advantages

Self describing (if you read English)

Granular

Data NOT text for display (although this can be easily generated)

+$100 …

"author" : [

{

"birthDate" : "1832",

"firstname" : " James",

"deathDate" : "1929",

"name" : "Greenwood, James",

"lastname" : "Greenwood"

}

]

• 1001_ |a Greenwood, James, |d 1832-1929.

• Greenwood, James, 1832-1929.

+ my @exportAuthors=(); my @authors =(); my $eachAuthor =''; if ($record->field('100')) { @authors = $record->field('100'); foreach $eachAuthor(@authors) { my %exportAuthor =(); my $authorFull = trim($eachAuthor->subfield('a')); $exportAuthor{'name'} = $authorFull; my @parsed_author=split(/,/, $authorFull); $exportAuthor{'lastname'} = $parsed_author[0]; $exportAuthor{'firstname'} = $parsed_author[1]; my $dates = $eachAuthor->subfield('d'); my ($birthDate,$deathDate); # The glorious 100$d disassembled ... if ($dates) { #first of all, get rid of ca. and fl. which aren't real birth or death dates if ($dates=~/fl\.|ca\./){ #do nothing } #otherwise, if date contains a hyphen, assume range #but fix also works for unterminated dates? elsif ($dates=~/\-/) { my @dates=split(/\-/,$dates); $exportAuthor{'birthDate'} = trim($dates[0]); if ($dates[1]) { $exportAuthor{'deathDate'} = trim($dates[1]); } #No Hyphen - assume single date - look for definitive birth event with a 'd' ... } elsif ($dates=~/\b\./) { $exportAuthor{'birthDate'} = trim($dates[0]); # - look for definitive death event with a 'd' ... } elsif ($dates=~/\d\./) { $exportAuthor{'deathDate'} = trim($dates[0]); # Final assumption for authors with recorded dates but with single date no hyphen. Assume its a birthdate? } else { $exportAuthor{'birthDate'} = trim($dates[0]); } # produce output for dates ... } # Assemble author object push(@exportAuthors,\%exportAuthor); # End author loop } # Add list of authors to export object $exportRecord{'author'} = \@exportAuthors; }

+How is this being solved?

Fix it at the source: RDA Marc transition initiative Other initiatives – BL, OCLC linked data releases Onyx Mods

+Pragmatism: the end of big standards

Adoption of one new standard (or several) for its own sake is pointless

Fit in around changing needs of libraries and systems

Data needs to be flexible and re-purposable

No standard to ‘rule them all’ in the post Marc21 world

+If we do nothing?

Education

Text to data