Xml Overview

Embed Size (px)

Citation preview

Blue and Grey

XML for Catalogers in 2009:
Emerging Technologies, Tools, and Trends

Kevin [email protected] LibrarianOffice of Library ServicesCity University of New York

AJL-NYMA's 2009 Cataloging Workshop 4/22/2009

Outline

XML Basics

XML and MARC

XML Formats

Usage Scenarios

XML Tools

Experimentation & Questions

Purpose

I'm not here to teach you how to catalog in XML

Give a basic understanding of XML syntax

Put in XML in the context of library, specifically cataloging, work

Highlight usage scenarios for XML

Discuss tools for editing XML

XML Basics

Extensible Markup Language

World Wide Web Consortium (W3C) StandardOfficially a Recommendation

First Published in 1997

SGML for the WebStandardized General Markup Language

Came out of the text-encoding communitySoftware Documentation (Docbook)

Literary Texts (TEI)

XML is:

So useful it has outlived it's own hype. It is ubiquitous within most modern applications and on the web. It isn't even cool any longer.

Future Proof Your Data

Data Outlasts CodeIan Davis Code4lib 2009

How many of you have lived through an ILS migration?

XML is:

The best data format we have to deal with this issue at the moment since MARC, in some respects, is becoming a liability where modern software is concerned.

XML is also:

Machine-readable

Human-readable

Platform Independent

Verbose

Unicode-compliant

Used in data-centric applications

Used in document-centric applications

Editable by any editor that can handle plain-text files

XML is a meta-language

Self describing Data

Machine-readable semantic data

You define your application vocabularyXML applications are defined with a schema

Example (X)HTML is an XML application

Adhere to a few simple rulesHierarchy

Nested Tags

Quoted attributes

Two Approaches to Markup

DescriptivePage Title Paragraph one. Paragraph two.

ProceduralPage Title

Paragraph one.

Paragraph two.

Similar Display/Different Approaches

Descriptive Markup

Seeks to separate content from presentation

Which of the previous code snippets succeeds?

Descriptive markup makes dataMore portable

Easier to repurpose and share

In many ways MARC is a partially descriptive, partially procedural markup languageField/subfield definitions and validation rules

ISBD Punctuation

090 |a ML410 .S18 |b J3 200724500 |a J. B. Sancho : |b compositor pioner de Califrnia = compositor pionero de California : pioneer composer of California / |c William J. Summers ... [et. al.] ; ed. Antoni Piz.250 |a 1a ed.260 |a Palma : |b Universitat de les Illes Balears, |c c2007.300 |a 366 p. : |b ill., music ; |c 30 cm. + |e 1 CD-ROM.500 |a Parallel text in Catalan, Spanish, and English.504 |a Includes bibliographical references and thematic catalogue of the works of J. B. Sancho.500 |a CD-ROM contains Artaserse facsimiles; transcriptions of Misa de los ngeles, Gloria, and Misa del sol; and audio recordings of Misa de los ngeles and Gloria de la Misa en sol.590 |a At GC, CD-ROMs shelved at Circulation Desk under call no.: CD-ROM 5450500 |t Sancho : l'eminent msic de l'Alta Califrnia / |r William J. Summers -- |t Juan Bautista Sancho : a la recerca dels orgens del primer compositor de Califrnia i de 'estil musical primitiu de les missions / |r Craig H. Russell -- |t Els Sanzo d'Art / |r Antoni Gili -- |t Catleg temtic / |r William J. Summers.650 0 |a Composers |z California |x Biography.60010 |a Sancho, Juan Bautista, |d 1772-1830.60010 |a Sancho, Juan Bautista, |d 1772-1830 |v Thematic catalogs.7001 |a Piz, Antoni.7001 |a Summers, William John.7001 |a Russell, Craig H.7001 |a Gili Ferrer, Antonio.

Procedural or Descriptive?

Basic XML Syntax

Files end in .xml

Individual XML documents are instances

Documents must adhere to a nested hierarchy

Start with an option XML declaration

Declares XML version used

Declares the character set

The Root Element

Every document instance has only one

All other elements nest within this one

For example every XHTML Document has only one Tag

Start

End

Web Page Source

Elements

Sometimes called tags

Can contain other elements and text

Must have a and tag

Sometimes elements are empty

These must also be closed

The image element in XHTML is a good example

Elements in MODS


City and town life
Fiction

Attributes

Attached to a specific element

Must be quoted ex; myattribute=my attribute content

Order is not important when attached to a given element

HTML ExampleVisit Google

MARCXML Example


Ulysses
[by] James Joyce.

Entities

Five reserved special characters XML general entities& - &

> - >

< - MARCXML

This step requires programming

Utilize Perl Programming to parse MARC to MARCXML

PHP also has a MARC library

These have internal crosswalks that produce a MARCXML representation

MARC => MARCXML


Ulysses
[by] James Joyce.


[New York,
Random house,
1934]

Tough Example

24500 |a J. B. Sancho : |b compositor pioner de Califrnia = compositor pionero de California : pioneer composer of California / |c William J. Summers ... [et. al.] ; ed. Antoni Piz.

MARCXMLifying this isn't necessarily going to help make this more easily digestible to a piece of software

MARCXML essentially maintains MARC as it is and puts it into a parsable XML wrapper

Other XML Formats

MARC-DerivativesMODS (The Semantic or Readable MARC)

MARCXML

Dublin CoreMARCXML's little brother

EAD

TEI

XHTML

RSS/ Atom

RDF

Data v. Document Centric

Data CentricDatabase export formats

Spreadsheet export formats

Metadata

Most cataloging formats fall into this category

Document CentricEncoding full-text resources

Mixed content

MODS

Metadata Object and Description Schemahttp://www.loc.gov/standards/mods/

The semantic or descriptive XML MARC Surrogate

Inconsistent supportILS Systems

Institutional Repositories

MADS

Metadata Authority Description Standardhttp://www.loc.gov/standards/mads/


Computer programming



Computers


Programming languages


Systems Analysis

Dublin Core

Popular simple metadata format

15 basic elements

key=>value pairsTitle =

Publisher =

DC Element Name =

Qualified vocabulary available

Default format for the OAI-PMH Protocol for Metadata Harvesting

EAD

Encoded Archival Description

Archival Findings Aids

One of the oldest XML formats

Straddles the data and document-centric worlds

Crosswalks available in MarcEdit and other places

TEI

Text Encoding Initiative

Designed to encode any kind of text

Humanities Computing Initiative

Support in the special collections community

Intellectually rich XML application

Many dialects ranging from:Basic descriptive encoding of a text's structure

Detailed linguistic analysis

XTHML

Extensible HTML

HTML that confirms to XML rules

Has become ubiquitous on the web

Used in conjunction with Cascading Style SheetsXHTML provides the content

CSS controls how it displays

If your Content Management System (CMS) doesn't use XHTML you are in trouble

RSS Syndication

Really Simple Syndication

An instance of RSS is known as a feed

Users can subscribe to a particular RSS feed

New additions to the feed are pushed out

RSS feeds are easily incorporate into webpages

Most web portals (i.e. your yahoo, or google account are built around RSS feeds)

In a catalog

RSS within a Catalog

RSS and Repositories

Emerging area of functionality for RSS

RSS can be used an export protocol to a repository, i.e. turn something into connexion for a institutional repositories

Any content creation tool could send items to a repository

SWORD (Simple Web-service Offering Repository Deposit)

Uses Atom, an RSS dialect to accomplish this

http://www.swordapp.org/

RDF

Resource Description Framework

Semantic Web Technology

Linked Data using URI(L)s

Machine Readable semantics a level above what XML provides

RDF fragment of Project Gutenberg data

Sample RDF Assertion describing a Person
taken from RDF Primer

RDA and XML

Some crosswalks in the works

XML versions of RDA will likely be produced in RDF

Early Example - Using Library of Congress MARC datahttp://code.google.com/p/code4rda/wiki/MilestoneOne

RDA in RDF/XML

XML Usage Scenarios

Web Interfaces (AJAX)

Data processing (ILS go-between)

Crosswalks (MARCXML=>All of the Above)

Metadata Harvesting (OAI-PMH)

Full-text Indexing

AJAX XML Behind the Scenes

ILS Go-between Format

OCLC ConnexionConnexion records are actually created in MARCXML

Get converted to MARC for export

ILS Example - AlephNotices

Reports

Customizable XSL stylesheets to format the XML produced by these transactions

Crosswalks

Library of CongressVarious MARCXML crosswalks

Other formatsEAD => MARCXML

Anything to Dublin Core

OAI - PMH

Open Archives Initiative Protocol for Metadata Harvesting

Dublin Core is the default format here

Expose information about digital collections/repository content to the wider world

Participants in METRO grants have data available via OAI in XMLCollection List

OAI Metadata Example with Dublin Core

Indexing XML

There are numerous full-text indexing tools for XML, some utilized by ILS systems

Parse XML into their own indexing formatSolr (actually uses it's own XML format)

Lucene

Native XML IndexerseXist

Ex Libris' PrimoCatalog Records are converted to OAI-PMH Dublin Core and then indexed

MarcEdit

Simplest tool to integrate into existing library workflows; open-source, freely downloadable

Direct MARC Support

Global Editing of MARC Data

Crosswalk utilities

Most useful for:Special Collections Work

Electronic MARC Record Processing

MarcEdit Crosswalk Options

Harvest OAI Data

End of OAI Harvest in MarcEdit

Specialty Editors

Archivist's ToolkitUseful for EAD

Also has MARC support

OxygenMost useful low-cost option for:Special Collections work

Document-centric work

General authoring XML

Oxygen

Low-cost

Complete XML Management Solution

Supports all types of XML Schema

XSLT Support w/debugger

Many academic users

XML Aware Editing in Oxygen

XML and Programming Languages

Strong native XML support in all programming languages

Familiar data structure to programmersRemember the tree structure?

Internationalization support via Unicode

Library data has a better chance of strong support in XML than not in XML

MARC and Programming Languages

Full Support by a small number of software vendors

Perl/PHP/Python/Ruby all have support with varying levels of MARC support

Marc tools in these languages are typically:Specialty modules

maintained by a small, but dedicated group of programmers

Not part of most languages' standard distribution

For Future Reference

A Classic introduction to basic XML concepts from the TEI A Gentle Introduction to XML

Terry Reese's Weblog

Watch for how RDA interacts with XML

Eric Lease Morgan's Workshop for those with a more technical bent - XML in Libraries

Conclusion

XML is just a tool

It is a useful one

The intellectual work of cataloging will still be the same

Relying on the MARC format as our primary data store is becoming problematic