XML Basics for Digital Humanists Alabama Digital Humanities Center September 19 & 23, 2011...

Preview:

Citation preview

XML Basics for Digital Humanists

Alabama Digital Humanities CenterSeptember 19 & 23, 2011Instructor:Shawn Averkamp, Metadata Librariansmaverkamp@ua.edu

What is XML?

eXtensible

Markup

Language

Language• XML is a language for structuring data. (other

methods of structuring data: database, excel spreadsheet, etc.)

• Not a data model, but a way of encoding a data model or knowledge domain so that it is machine-processable.

• XML is composed of syntax rules (just like any other language).

Markup• XML uses “markup” to structure data.• XML uses labels within angle brackets (like in

HTML) to “tag” text.

Ingredients3 avocados1/4 cup onions1/4 teaspoon garlic salt12 corn tortillas1 bunch fresh cilantro leavesjalapeno pepper sauce

<ingredients> <ingredient qty=“3”>avocados</ingredient> <ingredient qty=“1/4” unit=“cup”>onions,diced</ingredient> <ingredient qty=“1/4” unit=“t”>garlic salt</ingredient> <ingredient qty=“12”>corn tortillas</ingredient> <ingredient qty=“1”>fresh cilantro leaves</ingredient> <ingredient>jalapeno pepper sauce</ingredient></ingredients>

element

attribute

Elements = things we care aboutAttributes = properties of those things

eXtensible• You can extend your data model with other

XML data models (“schemas”).

<mods>

<titleInfo>

<title>Pac-man shaped magnetic tunnel junctions for magnetic flip flops for space applications</title>

</titleInfo>

<name type="personal">

<namePart>Red Ghost<namePart>

<role>

<roleTerm>Author</roleTerm>

</role>

</name>

<name type="personal">

<namePart>Dot Chomper<namePart>

<role>

<roleTerm>Advisor</roleTerm>

</role>

</name>

<abstract>Pac-man shaped magnetic tunnel junctions are proposed for CMOS-based magnetic flip flops for space applications…</abstract>

<extension>

<etd:degree>Ph.D.</etd:degree>

<etd:discipline>Electrical and Computer Engineering</etd:discipline>

</extension>

</mods>

The etd schema (in red) “extends” the mods schema

Where is XML?

XML drives applications and information you use every day:•RSS feeds (Real Simple Syndication) for blogs, podcasts, more•iTunes stores your music library metadata and usage data in XML•Google uses XML to display geographic data in Google Maps and Earth (more info: http://code.google.com/apis/kml/documentation/kml_tut.html )

What’s XML good for?

• Sharing/exchanging data online• Storing data• Controlling data display• Syndication

The XML Family

XML The document language

XPath Language for navigating XML documents

XSD Schema language

XSLT (XML Stylesheet Language Transformations) Language for transforming XML into other formats (HTML, text, other XML documents)

XQuery Language for querying XML (similar to SQL database querying)

XForms Language for creating web input forms

XML in the Humanities

• TEI– Shakespeare Quartos Archive:

http://www.quartos.org/– Lewis & Clark Journals:

http://lewisandclarkjournals.unl.edu/

• Syriac Reference Portal: http://www.syriac.ua.edu/

Getting Started• Open Oxygen • Open movies.xml example (in left sample.xpr sidebar) or

paste code below into a new document

<?xml version="1.0" encoding="UTF-8"?><movies> <movie id="1"> <title>The Green Mile</title> <year>1999</year> </movie> <movie id="2"> <title>Taxi Driver</title> <year>1976</year> </movie> <movie id="3"> <title>The Matrix: Revolutions</title> <year>2004</year> </movie> <movie id="4"> <title>Shrek II</title> <year>2004</year> </movie></movies>

Well-formedness

XML documents must be “well-formed” to be machine-readable. •XML documents must have a root element•XML elements must have a closing tag•XML tags are case sensitive•XML elements must be properly nested•XML attribute values must be quoted

Exercise 1Copy and paste the following code into a new XML document in Oxygen. Correct all errors necessary to make this a well-formed XML document. <movie id=1> <title>The Green Mile<title> <year>1999</year> </movie> <movie id="2"> <title>Taxi Driver</title> <year>1976</year> </movie> <movie id="3"> <title>The Matrix: Revolutions</title> <Year>2004</year> </movie> <movie id="4"> <title>Shrek II</title> <year>2004</movie> </year>

<!-- Comments -->

Enclose comments within double-hyphen/angle bracket notation:<!-- a brief comment -->

<!--This is a very long block of comments…… … … more comments… … … comments…(still more comments here…)-->

5 special symbols

To use the following characters in a text value, you must replace them with these entities:

& &amp;

< &lt;

> &gt;

“ &quot;

‘ &apos;

Exercise 2

In your movies.xml document, add another movie to the collection. Add a comment somewhere in the document (or “comment out” a block of elements). When you’ve finished, check for well-formedness (blue check icon).

XML Schemas

Schemas describe the syntax rules for encoding a data model in XML:– Allowable elements, attributes, and values– Element types -- simple or complex

• Simple – contains a value• Complex – contains other elements

– Constraints of elements, attributes, and values• Repeatability (how many instances of each element allowed)• Obligation (is the element or attribute mandatory?)

– Datatypes of values (integer, string, date, etc.)

<movies xmlns="http://example.com/schema.xsd"> <movie id="1"> <title>The Green Mile</title> <year>1999</year> </movie> <movie id="2"> <title>Taxi Driver</title> <year>1976</year> </movie> <movie id="3"> <title>The Matrix: Revolutions</title> <year>2004</year> </movie> <movie id="4"> <title>Shrek II</title> <year>2004</year> </movie></movies>

XML Schemas

• Schemas are themselves XML files but with a .xsd file extension.

• In our XML document, we reference the schema by using a “namespace”

Namespaces

The namespace is the unique identifier for the schema.

<mods xmlns=“http://www.loc.gov/mods/v3”> <titleInfo> <title>Pac-man shaped magnetic tunnel junctions for magnetic flip flops for space applications</title> </titleInfo>……</mods>

Namespace prefixes

When two or more schemas are used in an XML document, we use “prefixes” to distinguish between the elements of each.

<mods xmlns="http://www.loc.gov/mods/v3" xmlns:etd="http://www.ndltd.org/standards/metadata/etdms/1.0/">…… <dateIssued>2011</dateIssued> <extension> <etd:degree>Ph.D.</etd:degree> <etd:discipline>Electrical and Computer Engineering</etd:discipline> </extension></mods>

Valid XML

To be “valid” an XML document must:•Be well-formed•Include the schema declaration in the root element (e.g., <mods xmlns=“http://www.loc.gov/mods/v3”>)

•Conform to the rules of the schema

Exercise 3

Copy and paste the code on the next slide into a new XML document in Oxygen. Add a <name> element to the document, then validate (red check icon). If it validates, then introduce an error into your document to see what error messages Oxygen gives you.

<mods xmlns="http://www.loc.gov/mods/v3" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:etd="http://www.ndltd.org/standards/metadata/etdms/1.0/" xsi:schemaLocation="http://www.loc.gov/mods/v3 http://www.loc.gov/standards/mods/v3/mods-3-4.xsd" version="3.4">

<titleInfo> <title>Pac-man shaped magnetic tunnel junctions for magnetic flip flops for space applications</title> </titleInfo> <name type="personal"> <namePart>Red Ghost</namePart> <role> <roleTerm>Author</roleTerm> </role> </name> <name type="personal"> <namePart>Dot Chomper</namePart> <role> <roleTerm>Advisor</roleTerm> </role> </name> <abstract>Pac-man shaped magnetic tunnel junctions are proposed for CMOS-based magnetic flip flops for space applications…<abstract> <originInfo> <dateIssued>2011</dateIssued> </originInfo> <extension> <etd:degree>Ph.D.</etd:degree> <etd:discipline>Electrical and Computer Engineering</etd:discipline> </extension></mods>

Using and creating schemas

• Always start with the data model!• Decide what entities and properties are

important to you and your project before choosing or creating a schema.

Things to consider

• Are there existing schemas that meet your needs? • Are there commonly used schemas within your field? • If you find a schema that almost meets your needs, can

you extend it to cover the entire scope of what you want to model?

• Who (or what software applications) will you be sharing the data with?

• What kind of functionality do you want to support? Indexing? Flexible display? Visualizations?

Tailor schemas to meet your needs

• You can make schema rules more strict (but not more lax)

• Extend schemas with other schemas (Your primary schema must allow extensions)

• If you expect use of your XML data to be very limited, you can change the schema. (Not recommended if you plan to share your data widely or beyond your own software applications)

Documentation

• Data dictionaries, markup guidelines, best practices are important, especially if you have assistants entering your data.

• Examples of documentation:– MODS guidelines:

http://www.loc.gov/standards/mods/userguide/generalapp.html

– UVa Library TEI guidelines: http://www.lib.virginia.edu/digital/reports/teiPractices/dlpsPractices_postkb.html

Exercise 4Work together to create a data model for a dictionary (or a knowledge domain of your choosing). What should the root element be? What are the elements that will be contained within the root? What are the attributes* (properties) of each of your elements?

Create an instance of your data model in XML. What adjustments or enhancements would you need to make for your schema to be extensible?

*How do you know when something should be an attribute or an element? There is often no wrong answer to this. Use your best judgment—if you think you will not need to further refine a property (for instance, in our recipe example we would not need to refine quantity or unit any further), an attribute is probably the best choice.

Resources

• Books, tutorials, and other resources: http://www.lib.ua.edu/digitalhumanities/xml-resources

• http://www.xml.com/

Recommended