New approaches to the catalog

New approaches to the catalog

T. Hickeyhttp://errol.oclc.org/laf/n82-54463.html

Svensk Biblioteksförening 2005 October 28

http://errol.oclc.org/laf/n82-54463.html

OCLC

Founded 1967 Nonprofit membership

organization > 53,000 libraries 96 countries ~1,000 employees

Cataloging Interlibrary Loan Preservation Dewey Decimal Classification netLibrary FirstSearch

OCLC Research

Research for both• OCLC services• Membership

Metadata management Knowledge organization Content management Interoperability Systems & interaction design ~30 employees

What do users want?

The right information– with minimum effort

How to give them what they want

Catch them where they are Increase our data Improve our data Make the data work harder Interconnect with other systems Do all this efficiently

What has changed

Computers and telecommunications• User expectations• Digital materials• Remoteness of our users• Huge amounts of bandwidth, storage

The competition

Online booksellers• Reviews• Tables of contents• Excerpts• Inside-the-book searching

Web search engines• Speed• Full-text searching• Global coverage (of web resources)• Good enough

Ourselves• Electronic journals

Current projects (my group)

Live search Registries, PURLs Dewey browser Harvesting, electronic theses VIAF, LAF SRU/W, OpenURLs, OAI FRBR, xISBN Beowulf cluster Map-reduce Text searching Batch loading

Open WorldCat WorldCat Wiki Publisher Names MXG

Other Research Projects

FictionFinder, Curiouser Schema Transformation Terminology Services Digital Preservation Collection Analysis Dublin Core FAST User Studies Data mining

Also: http://www.oclc.org/research/researchworks/

http://www.oclc.org/research/researchworks/

Catch them where they are

Google, Yahoo, etc.• Open WorldCat• Open URL• OAI-PMH

Creation too• WCat Wiki• Tags?

OpenWorldCat

Editions

OpenURL

OpenURL registry• Supports version 1.0• Also registry of OpenURL servers• Used for WikiD

WorldCat ‘Wiki’

Opening up WorldCat to user annotations• Reviews• Notes• Tables of contents• Cover art?• Book lists?

Based on WikiD software• Full Wiki

• Many features off for WorldCat• Uses OpenURL 1.0 protocol internally• Allows collections of pages of arbitrary XML schemas• Tools for the creation of simple collections

Doesn’t look like a Wiki

Reviews

Tags?

Folksonomies? User-generated key words We’ve been here before

• Is it different?• Is there another direction?

Opening Dewey

More data

Harvesting• OAI-PMH• ETDs

Batch load• 60 million records• 3 million new manifestations

Other• Cover art• Reviews• WC

Better data and organization

VIAF FRBR Authority files in general

• LAF• Publisher names• Genre• FAST

Registries• PURLs• Generalized solution?Get them nearer to creation

FRBR

Work-set algorithm• Keys based on author/title• Authority files• Auxiliary authority files• xISBN

Used for• xISBN• Open WorldCat• FirstSearch (coming)• Collection analysis (coming)• Research

Authority Files

LAF• http://errol.oclc.org/laf/n82-54463.html

Publisher names• Not normally controlled• Looking for variations with ISBN prefixes• Also worked with dissertations


VIAF

Merge national-level files Library of Congress (NACO) and Die Deutsche Bibliothek

• Bibliographic records analyzed• 15% would be erroneous based just on names

Basic matching now completed• 435,000 matching names• < 1% mismatched

Working on• Public interface• OAI harvesting• Persistent identifiers

Maj

Registries

Show relationships between metadata Often associated with an identifier General solution? Examples

• Authority files• WorldCat• PURLs

PURLs

Persistent URLs• Map one URL to another• http://purl.org/hickey/outgoing ->

• http://outgoing.typepad.com/• 500,000+ PURLs• 111 million resolutions

Port to Wiki’D platform?• http://www.oclc.org/research/projects/wikid/

String of PURL servers?• Use OAI-PMH for synchronization• Spread responsibility

Generalized solution?

http://purl.org/hickey/outgoing

http://outgoing.typepad.com/

http://www.oclc.org/research/projects/wikid/



More connectivity

Open URL RSS feeds OpenSearch, SRU/W OAI-PMH

OpenURL

Developed to address the ‘appropriate copy’ problem Transitioning to OpenURL 1.0 OpenURL resolver

• Accepts requests specifying• Resource• Services

Generalized syntax• Specifying a resource• Services to be performed

Metadata elements specified in registry• http://purl.org/openurl/

http://purl.org/openurl/



SRU

Simplified version of Z39.50• Web based• SRW – SOAP• SRU – URL

Even simpler?• OpenSearch• No search syntax• Looking for common ground

MXG• Metasearch XML Gateway• Simplifies metasearcher’s lives

OAI-PMH

Method of harvesting metadata• More generally, a way of synchronizing databases

No real restriction to metadata Becomes a repository protocol

• Identifiers• Timestamps

Layered implementation• OAI• SRU• Pears

Efficient processing

Beowulf cluster Map reduce Text searching

Beowulf Cluster 24 nodes

• 2 processors, 4 gigabytes of RAM, 120 gigabytes disk• Gigabit network

Use it for• FRBR processing• Text indexing• Text searching

~ 30-fold speed up on many tasks• 1 year ⇒ 2 weeks• 1 week ⇒ 1 day• 1 day ⇒ 1 hour• 1 hour ⇒ 2 minutes

Extremely cheap processing

Map reduce

Pioneered by Google• Petabytes of data on thousands of nodes

Adapted to our cluster• Tens of gigabytes of data on dozens of nodes

Simple functional programming paradigm Allows batch processing across cluster

Text Searching

Spread database across cluster Two levels of aggregation

• 3 servers/node• 24-way aggregation• Aggregators run across cluster

SRU used• HTTP based• SRW (SOAP) slowed it down

Open source software

Better interfaces

More interactive• Live search• Dewey Browser

Better connected

Post-coordination of Services

Systems that expose low level services Higher level coordination of those services Loosely coupled services Examples from OCLC

• Validation service• RSS feeds• SRU• OpenURL, OAI-PMH• xISBN• DDC Browser built this way

• Very different interfaces have been built

DDC Browser XML <?xml version="1.0" encoding="utf-8"?><?xml-stylesheet

type="text/xsl" href="/ddcbrowser/xsl/wcat.xsl" ?> <cells>

• <language>swe</language>• <cell ddc="330" count="23" /> • <cell ddc="331" count="28" /> • <cell ddc="332" count="5" /> • <cell ddc="333" count="7" /> • <cell ddc="334" count="2" /> • <cell ddc="335" count="1" /> • <cell ddc="336" count="3" /> • <cell ddc="337" count="2" /> • <cell ddc="338" count="26" /> • <cell ddc="339" count="5" />

</cells>

Do We Need It?

Just have Google harvest everything• Our experience with Google• Fielded searching• Reliable searching

Possibility of user-supplied metadata Cost of good metadata Cost of non-existent metadata

Conclusions

Shift to remote users Online availability – trend towards centralization More flexibility in implementations

Patrons are better served Less emphasis on physical collections

Thank you

T. Hickeyhttp://errol.oclc.org/laf/n82-54463.html

Swedish Library Association2005 October 28


Documents

New approaches to the catalog