If you can't read please download the document
Upload
kevinreiss
View
1.848
Download
0
Embed Size (px)
Citation preview
Blue and Grey
XML for Catalogers in 2009:
Emerging Technologies, Tools, and Trends
Kevin [email protected] LibrarianOffice of Library ServicesCity University of New York
AJL-NYMA's 2009 Cataloging Workshop 4/22/2009
Outline
XML Basics
XML and MARC
XML Formats
Usage Scenarios
XML Tools
Experimentation & Questions
Purpose
I'm not here to teach you how to catalog in XML
Give a basic understanding of XML syntax
Put in XML in the context of library, specifically cataloging, work
Highlight usage scenarios for XML
Discuss tools for editing XML
XML Basics
Extensible Markup Language
World Wide Web Consortium (W3C) StandardOfficially a Recommendation
First Published in 1997
SGML for the WebStandardized General Markup Language
Came out of the text-encoding communitySoftware Documentation (Docbook)
Literary Texts (TEI)
XML is:
So useful it has outlived it's own hype. It is ubiquitous within most modern applications and on the web. It isn't even cool any longer.
Future Proof Your Data
Data Outlasts CodeIan Davis Code4lib 2009
How many of you have lived through an ILS migration?
XML is:
The best data format we have to deal with this issue at the moment since MARC, in some respects, is becoming a liability where modern software is concerned.
XML is also:
Machine-readable
Human-readable
Platform Independent
Verbose
Unicode-compliant
Used in data-centric applications
Used in document-centric applications
Editable by any editor that can handle plain-text files
XML is a meta-language
Self describing Data
Machine-readable semantic data
You define your application vocabularyXML applications are defined with a schema
Example (X)HTML is an XML application
Adhere to a few simple rulesHierarchy
Nested Tags
Quoted attributes
Two Approaches to Markup
DescriptivePage Title Paragraph one. Paragraph two.
ProceduralPage Title
Paragraph one.
Paragraph two.
Similar Display/Different Approaches
Descriptive Markup
Seeks to separate content from presentation
Which of the previous code snippets succeeds?
Descriptive markup makes dataMore portable
Easier to repurpose and share
In many ways MARC is a partially descriptive, partially procedural markup languageField/subfield definitions and validation rules
ISBD Punctuation
090 |a ML410 .S18 |b J3 200724500 |a J. B. Sancho : |b compositor pioner de Califrnia = compositor pionero de California : pioneer composer of California / |c William J. Summers ... [et. al.] ; ed. Antoni Piz.250 |a 1a ed.260 |a Palma : |b Universitat de les Illes Balears, |c c2007.300 |a 366 p. : |b ill., music ; |c 30 cm. + |e 1 CD-ROM.500 |a Parallel text in Catalan, Spanish, and English.504 |a Includes bibliographical references and thematic catalogue of the works of J. B. Sancho.500 |a CD-ROM contains Artaserse facsimiles; transcriptions of Misa de los ngeles, Gloria, and Misa del sol; and audio recordings of Misa de los ngeles and Gloria de la Misa en sol.590 |a At GC, CD-ROMs shelved at Circulation Desk under call no.: CD-ROM 5450500 |t Sancho : l'eminent msic de l'Alta Califrnia / |r William J. Summers -- |t Juan Bautista Sancho : a la recerca dels orgens del primer compositor de Califrnia i de 'estil musical primitiu de les missions / |r Craig H. Russell -- |t Els Sanzo d'Art / |r Antoni Gili -- |t Catleg temtic / |r William J. Summers.650 0 |a Composers |z California |x Biography.60010 |a Sancho, Juan Bautista, |d 1772-1830.60010 |a Sancho, Juan Bautista, |d 1772-1830 |v Thematic catalogs.7001 |a Piz, Antoni.7001 |a Summers, William John.7001 |a Russell, Craig H.7001 |a Gili Ferrer, Antonio.
Procedural or Descriptive?
Basic XML Syntax
Files end in .xml
Individual XML documents are instances
Documents must adhere to a nested hierarchy
Start with an option XML declaration
Declares XML version used
Declares the character set
The Root Element
Every document instance has only one
All other elements nest within this one
For example every XHTML Document has only one Tag
Start
End
Web Page Source
Elements
Sometimes called tags
Can contain other elements and text
Must have a and tag
Sometimes elements are empty
These must also be closed
The image element in XHTML is a good example
Elements in MODS
City and town life
Fiction
Attributes
Attached to a specific element
Must be quoted ex; myattribute=my attribute content
Order is not important when attached to a given element
HTML ExampleVisit Google
MARCXML Example
Ulysses
[by] James Joyce.
Entities
Five reserved special characters XML general entities& - &
> - >
< - MARCXML
This step requires programming
Utilize Perl Programming to parse MARC to MARCXML
PHP also has a MARC library
These have internal crosswalks that produce a MARCXML representation
MARC => MARCXML
Ulysses
[by] James Joyce.
[New York,
Random house,
1934]
Tough Example
24500 |a J. B. Sancho : |b compositor pioner de Califrnia = compositor pionero de California : pioneer composer of California / |c William J. Summers ... [et. al.] ; ed. Antoni Piz.
MARCXMLifying this isn't necessarily going to help make this more easily digestible to a piece of software
MARCXML essentially maintains MARC as it is and puts it into a parsable XML wrapper
Other XML Formats
MARC-DerivativesMODS (The Semantic or Readable MARC)
MARCXML
Dublin CoreMARCXML's little brother
EAD
TEI
XHTML
RSS/ Atom
RDF
Data v. Document Centric
Data CentricDatabase export formats
Spreadsheet export formats
Metadata
Most cataloging formats fall into this category
Document CentricEncoding full-text resources
Mixed content
MODS
Metadata Object and Description Schemahttp://www.loc.gov/standards/mods/
The semantic or descriptive XML MARC Surrogate
Inconsistent supportILS Systems
Institutional Repositories
MADS
Metadata Authority Description Standardhttp://www.loc.gov/standards/mads/
Computer programming
Computers
Programming languages
Systems Analysis
Dublin Core
Popular simple metadata format
15 basic elements
key=>value pairsTitle =
Publisher =
DC Element Name =
Qualified vocabulary available
Default format for the OAI-PMH Protocol for Metadata Harvesting
EAD
Encoded Archival Description
Archival Findings Aids
One of the oldest XML formats
Straddles the data and document-centric worlds
Crosswalks available in MarcEdit and other places
TEI
Text Encoding Initiative
Designed to encode any kind of text
Humanities Computing Initiative
Support in the special collections community
Intellectually rich XML application
Many dialects ranging from:Basic descriptive encoding of a text's structure
Detailed linguistic analysis
XTHML
Extensible HTML
HTML that confirms to XML rules
Has become ubiquitous on the web
Used in conjunction with Cascading Style SheetsXHTML provides the content
CSS controls how it displays
If your Content Management System (CMS) doesn't use XHTML you are in trouble
RSS Syndication
Really Simple Syndication
An instance of RSS is known as a feed
Users can subscribe to a particular RSS feed
New additions to the feed are pushed out
RSS feeds are easily incorporate into webpages
Most web portals (i.e. your yahoo, or google account are built around RSS feeds)
In a catalog
RSS within a Catalog
RSS and Repositories
Emerging area of functionality for RSS
RSS can be used an export protocol to a repository, i.e. turn something into connexion for a institutional repositories
Any content creation tool could send items to a repository
SWORD (Simple Web-service Offering Repository Deposit)
Uses Atom, an RSS dialect to accomplish this
http://www.swordapp.org/
RDF
Resource Description Framework
Semantic Web Technology
Linked Data using URI(L)s
Machine Readable semantics a level above what XML provides
RDF fragment of Project Gutenberg data
Sample RDF Assertion describing a Person
taken from RDF Primer
RDA and XML
Some crosswalks in the works
XML versions of RDA will likely be produced in RDF
Early Example - Using Library of Congress MARC datahttp://code.google.com/p/code4rda/wiki/MilestoneOne
RDA in RDF/XML
XML Usage Scenarios
Web Interfaces (AJAX)
Data processing (ILS go-between)
Crosswalks (MARCXML=>All of the Above)
Metadata Harvesting (OAI-PMH)
Full-text Indexing
AJAX XML Behind the Scenes
ILS Go-between Format
OCLC ConnexionConnexion records are actually created in MARCXML
Get converted to MARC for export
ILS Example - AlephNotices
Reports
Customizable XSL stylesheets to format the XML produced by these transactions
Crosswalks
Library of CongressVarious MARCXML crosswalks
Other formatsEAD => MARCXML
Anything to Dublin Core
OAI - PMH
Open Archives Initiative Protocol for Metadata Harvesting
Dublin Core is the default format here
Expose information about digital collections/repository content to the wider world
Participants in METRO grants have data available via OAI in XMLCollection List
OAI Metadata Example with Dublin Core
Indexing XML
There are numerous full-text indexing tools for XML, some utilized by ILS systems
Parse XML into their own indexing formatSolr (actually uses it's own XML format)
Lucene
Native XML IndexerseXist
Ex Libris' PrimoCatalog Records are converted to OAI-PMH Dublin Core and then indexed
MarcEdit
Simplest tool to integrate into existing library workflows; open-source, freely downloadable
Direct MARC Support
Global Editing of MARC Data
Crosswalk utilities
Most useful for:Special Collections Work
Electronic MARC Record Processing
MarcEdit Crosswalk Options
Harvest OAI Data
End of OAI Harvest in MarcEdit
Specialty Editors
Archivist's ToolkitUseful for EAD
Also has MARC support
OxygenMost useful low-cost option for:Special Collections work
Document-centric work
General authoring XML
Oxygen
Low-cost
Complete XML Management Solution
Supports all types of XML Schema
XSLT Support w/debugger
Many academic users
XML Aware Editing in Oxygen
XML and Programming Languages
Strong native XML support in all programming languages
Familiar data structure to programmersRemember the tree structure?
Internationalization support via Unicode
Library data has a better chance of strong support in XML than not in XML
MARC and Programming Languages
Full Support by a small number of software vendors
Perl/PHP/Python/Ruby all have support with varying levels of MARC support
Marc tools in these languages are typically:Specialty modules
maintained by a small, but dedicated group of programmers
Not part of most languages' standard distribution
For Future Reference
A Classic introduction to basic XML concepts from the TEI A Gentle Introduction to XML
Terry Reese's Weblog
Watch for how RDA interacts with XML
Eric Lease Morgan's Workshop for those with a more technical bent - XML in Libraries
Conclusion
XML is just a tool
It is a useful one
The intellectual work of cataloging will still be the same
Relying on the MARC format as our primary data store is becoming problematic