AND WHY DO I CARE? What is XML?. In the age of Google, why have fielded data? More efficient for...

Preview:

Citation preview

AND WHY DO I CARE?

What is XML?

In the age of Google, why have fielded data?

More efficient for both data entry and for systems to search, retrieve and ingest

Parsed, discretely fielded data can be recombined mechanically for a variety of outputs and uses, including XML

A popular YouTube to illustrate the power of XML:

“The Machine is Using Us” http://youtube.com/watch?v=NLlGopyXT_g

By Michael Wesch, an Assistant Professor of Cultural Anthropolgy at Kansas State University, this clip illustrates how he can supply the same data content to many Web 2.0 sites. The same principles can be applied to the model of supplying data to various software interfaces and tools in an automated fashion—stop and watch it now—it will get you in the XML mood!

So…..? This changes the landscape of digital

tools for users and support staff

It is no longer a matter of “one-size fits all” tools, but a new scenario of multiple tools to fit the users and the use. Supporting multiple tools is less of a burden because the data can be generated once and be automatically transformed by XML stylesheets for each tool or interface or digital collection

What is XML?

Extensible Markup Language (XML) is a universal language for sharing data between applications. XML is most appropriate for situations where the volume of data is generally small, as the data is transmitted as text, and controlling the structure of the data is important.

TRANSLATION: It shuffles data between applications, and users can grab it and send it to a new application too

What XML does

Tags informationFacilitates transfer of that information

between applications and also out to the Web (Web 2.0)

Allows information to be provided by schemas, which organize information and can represent standards (like MARC or VRA Core 4 or Dublin Core)

How does XML work?

It “tags” data—identifies what that data is (what meaning it holds).

MARC tags by using numeric designators:for instance a “245” field is always a title, a “700” or “7xx” field is a personal name (creator)

MARC example

XML tags

XML tags with natural language—easy to see what the information (the data value) is within the “chicken lips”

><

XML example (in VRA Core 4)

<!-- AGENT   --> <set><display>Jasper Francis Cropsey (American painter,

1823-1900)</display> <index><agent><name type="personal" vocab="ULAN" refid="500012491">Cropsey,

Jasper Francis</name> <dates type="life"><earliestDate>1823</earliestDate> <latestDate>1900</latestDate> </dates><culture>American</culture> <role vocab="AAT" refid="300025136">painter</role> </agent></index></set>

Schema: Where the data standard and XML meet

Once a data standard like VRA Core 4.0 is devised, with all the elements and qualifiers laid out, the standard can then be expressed in one XML document called the schema—a road map to then apply to a specific XSLT style sheet that tells a database (or another type of application) how to export data into (Core 4) XML. A schema is a set of rules to which the xml document must conform to be “valid”

VRA Core 4.0 XML schema (a small sample)

<!-- Agent   --> <xsd:complexType name="agentType"><xsd:annotation><xsd:documentation>VRA Agent element.

Subelements are used for different types of data (names, roles, dates, etc.). At least one subelement must be provided.</xsd:documentation>

</xsd:annotation><xsd:sequence minOccurs="1" maxOccurs="unbounded"><xsd:element name="attribution" type="basicString"

minOccurs="0" /> <xsd:element name="culture" type="basicString" minOccurs="0" /> <xsd:element name="dates" type="agentDateType"

minOccurs="0" /> <xsd:element name="name" type="agentNameType"

minOccurs="0" /> <xsd:element name="role" type="basicString" minOccurs="0" /> </xsd:sequence><xsd:attributeGroup ref="vraAttributes" />

XML example (compare this output to the previous slide--schema outline for the agent

data element)

<!-- AGENT   --> <set><display>Jasper Francis Cropsey (American painter,

1823-1900)</display> <index><agent><name type="personal" vocab="ULAN" refid="500012491">Cropsey,

Jasper Francis</name> <dates type="life"><earliestDate>1823</earliestDate> <latestDate>1900</latestDate> </dates><culture>American</culture> <role vocab="AAT" refid="300025136">painter</role> </agent></index></set>

What is XSLT?

You can export XML data from FileMaker or Access (and many other programs) to use in an assortment of applications simply by applying the appropriate Extensible Stylesheet Language Transformation (XSLT) stylesheet. XSLT is also XML-based. You can use a stylesheet to take an XML document and turn it into plain text, PDF documents, web pages, or to import fielded data into other applications.

XLST Sample—how the XML is actually exported from a database (in this case FMP)

<!-- Agent   --> <set><display><xsl:value-of select="fm:AgentDisplay" /> </display><index><xsl:for-each

select="fm:AgentSortName/fm:DATA"><xsl:variable name="i"><xsl:value-of select="position()" /> </xsl:variable><agent>

File Extensions for the 3 parts of XML

So when you see these file extensions, you will know what you are looking at:

The XML document is .xmlThe XML schema is .xsdThe XSLT stylesheet is .xsl

Ummm, yeah, OK

Will you do coding/tagging for schemas? (No, you will use schemas provided/published for standards—MARC (MODS), VRA 4.0, CDWA lite, etc.)

Will you do coding/tagging for XSLT? (Maybe, if you take a class and are interested. More likely you will get tech support or support from user groups)

Will you be able to look at an XML document and basically understand it and edit it? (Yes, this is similar to learning HTML and HTML editors)

So how does this fit into my cataloging?

VRA Core 4 and CCO were both formed with an eye to output and expression in XML

They can be used in “flat” systems, but there is a clear benefit to using relational databases, and XML is also good at capturing/transmitting relational structure

Relational Databases

Relate information stored in multiple tablesIdeally, there is no redundancy of data entry

—each value that might be reused in data entry is only entered once and stored in one table that is related for use everywhere else in the database (made available anywhere needed in the data entry workflow)

Numeric keys are normally used in this process

Excel sample (“flat file” output)

Notice that each row represents an image file and conflates the work and image records (repeats the information about the work for each image).

Each repeating value (like Artist) must have a column reserved for possible use.

A pithy answer to “why relational?” (for cataloging)

Message from Jan Eklund to VRA-L, Feb 20, 2008, subject: Re: CONTENTdm and metadata (search list archive for full message) Complexity: “complexity cannot be captured efficiently

in a flat data model because basically you have to leave space in every record to accommodate the most complex object you will ever encounter. This adds up to a lot of wasted space, and wasted space means more money…”

Consistency: “all the descriptive data about the work is entered once, and every image that shows this work inherits the same information”

Image and Work records (example from VCat)

A note field is possible for every Core 4 element

Repeating values are supported for each element

Numeric key

“indexed” value (in this case the sort name)

“display” value done to CCO recommended formatting. Note that the Agent Nationality is supplied automatically here by theLink (numeric key) to the Agent Authority

Authority record

Numeric key

All the information about the agent is supplied from this file on the basis of the numeric key

<agentSet><display>ACT Architecture (French architectural firm, ca. 1982-present); Gaetana Aulenti (Italian interior designer, born 1927); Victor Alexandre Frédéric Laloux (French architect, 1850-1937)</display><notes>ACT Architecture (Renaud Bardon, Pierre Colboc and Jean-Paul Philippon)</notes><agent><name vocab="ULAN" refid="500023967" type="personal">Laloux, Victor Alexandre Frédéric</name><dates type="life"><earliestDate>1850</earliestDate><latestDate>1937</latestDate></dates><culture>French</culture></agent><agent><name vocab="LCNAF" refid="nr 95039966" type="corporate">ACT Architecture</name><dates type="activity"><earliestDate>1982</earliestDate><latestDate>2082</latestDate></dates><culture>French</culture></agent><agent><name vocab="ULAN" refid="500031019" type="personal">Aulenti, Gaetana</name><dates type="life"><earliestDate>1927</earliestDate><latestDate>9999</latestDate></dates><culture>Italian</culture></agent></agentSet>

The same information expressed in Core 4 XML—this is automatically output from the database

The Element Set of Core 4

Format and Global Attributes

Reciprocity in Relationships

Easy to show relationships between works in a relational database and via XML. In this case the XSLT stylesheet (in conjunction with programming within the database) can be written to supply the reciprocity (the other related work) based on the numeric key.

Stylesheets can do a lot!

They literally do “transformations”—they can change the XML into other formats, they can recombine parsed information—and they can even take that more efficient and consistent relational data and “flatten” it, and output it in csv (Excel) for import into delivery systems or other uses that are not yet XML-compatible!

Other Data Standards (field structures) and XML

MARC; MODSCDWADublin CoreVRA Core 4.0EADMETS

MARC—Machine Readable Cataloging

Emerged from a Library of Congress-led initiative that began in the 1970s for bibliographic (reprographic) materials

Uses numeric tags to designate the fields (“245” means title, “700” fields are makers/creators etc)

This enabled computer protocols to share data worldwide

“The future of the MARC formats is a matter of some debate in the worldwide library science community. On the one hand, the formats are quite complex and are based on outdated technology. On the other, there is no alternative bibliographic format with an equivalent degree of granularity. The huge user base, billions of records in tens of thousands of individual libraries, also creates inertia” (Wikipedia entry)

MODS—Metadata Object Description Schema

A schema that allows the traditional numerically tagged MARC to be turned into XML

Can carry data from existing MARC plus allows creation of new XML-based records—a way to integrate and move forward?

http://www.loc.gov/standards/mods/

CDWA—Core Description of Works of Art

Developed by the Getty specifically to describe art, architecture and cultural artifacts

A very granular standard—the fields are very narrowly defined and there are many specific fields (as opposed to a few fields that use “qualifiers”) Example: Creation - Commissioner - Commissioner Role

See the CDWA lite xml schema:http://www.getty.edu/research/

conducting_research/standards/cdwa/cdwalite.html

Dublin (Ohio) Core

Developed by OCLC (headquartered in Dublin OH) (serving 53,500 libraries in 96 countries)

Created to describe “born digital” items in particular

Simple “bins” of data that can be further “qualified” (difference in Simple DC and Qualified DC)

A qualifier is an element refinement—example Date. Creation

The Simple Dublin Core Metadata Element Set (DCMES) consists of 15:

Title Creator Subject Description Publisher Contributor Date Type Format Identifier Source Language Relation Coverage Rights

VRA Core 4.0

Published in April 2007:http://www.vraweb.org/datastandards/VRA_Core4_Welcome.html

A data standard guiding data structure Formed with an eye to expressing content in XML—

with both index and display values Formed like library records with a “bib” (work) record

and an item (image) record Formed as is Dublin Core with a 1:1 relationship—one

record describes one object

EAD (Encoded Archival Description)

Started 1993 at Berkeley—now maintained by Library of Congress with SAA (Society of American Archivists) Began using SGML, now uses XML

So, tagged and machine-readable, but not necessarily 1:1 records—simple way to make groups/boxes of material retrievable

Sample EAD Finding Aid

http://webtext.library.yale.edu/art/art.VRC1.htm

152 boxes; 64 linear feet of mounted photographs of American painting now in storage

Simply used the outline of the original filing/drawers and tagged them—this translates now to boxes of material with barcodes

METS (Metadata Encoding and Transmission Standard)

http://www.loc.gov/standards/mets/Think of it as an XML “wrapper”—it can

describe a group of objects, a collection of different objects, can “wrap” around a set of XML items that are different formats and therefore may be a way to integrate and present these

METS Profiles

UCSD Simple Object Profileabstract:

The UCSD Libraries uses the UCSD Simple Object profile for composing METS instances for digital objects consisting of a single digital content file and associated descriptive, administrative, and structural metadata. The single digital content file may be of any format type, e.g., audio, image, text, or video, and it may be represented in the METS instance with content equivalent file versions. For example, a digital image may be represented in the METS instance by a TIFF file, a JPEG file, and a GIF file, with each containing the same content image.

What do [book] librarians have that VR professionals don’t?

Tools and networked utilities for COPY CATALOGING: MARC (Machine Readable Cataloging) for field

structure (data standard) AACR2 (Anglo-American Cataloging Rules) for data

formatting (data content) XML and Z39.50 (and other protocols) for transmitting

data OCLC as a shared records repository (sustainable

business model)

How do we get to shared VR image cataloging?

Have to develop the same general mechanisms as the library worldVRA Core 4.0 = MARCCCO = AACR2XML will be one transmission

vehicle/protocolOAI (Open Archives Initiative) may

become a harvesting and retrieval mechanism for record sharing

OAI (Open Archives Initiative)—XML Based

http://www.openarchives.org/Started by 2 computer scientists at Cornell to

quickly share information via mechanical “harvesting”—databases are opened to allow harvesting and results are then put in a central repository for searching. It is a “low-barrier” interoperability framework using Dublin Core (in XML) as its minimum standard, but one can also use other standards (expressed in XML) on top of that.

Google is using OAI to harvest data from the National Library of Australia. (See also U Michigan’s OAIster project).

See—XML matters!

Susan Jane WilliamsIndependent Cataloging and Consultingwilliams.susanjane@gmail.com

Recommended