Upload
harry-preston-barnett
View
221
Download
1
Tags:
Embed Size (px)
Citation preview
5 February 2008 Kaiser: COMS E6125 1
COMS E6125 Web-COMS E6125 Web-enHanced Information enHanced Information Management (WHIM)Management (WHIM)
COMS E6125 Web-COMS E6125 Web-enHanced Information enHanced Information Management (WHIM)Management (WHIM)
Prof. Gail KaiserProf. Gail Kaiser
Spring 2008Spring 2008
5 February 2008 Kaiser: COMS E6125 2
Today’s Topic: Markup Languages
• History of markup languages• SGML = Standard Generalized
Markup Language• HTML = HyperText Markup Language
• XML = eXtensible Markup Language
5 February 2008 Kaiser: COMS E6125 3
What is Markup?• Special text (“mark”) that is added to
the regular text of a document in order to convey some information about it
• A markup language is a formalized way of providing markup, and specifies:– what markup is allowed (the lexicon) – what markup is required – how markup is distinguished from content
text – what the markup “means”
5 February 2008 Kaiser: COMS E6125 4
Specific Coding• Historically, electronic manuscripts
contained procedural control codes (markup) that caused the text to be formatted in a particular way– tj6– troff– TeX
5 February 2008 Kaiser: COMS E6125 5
Procedural Markup• Advantages:
– Instructs agent how to process text – Generally concerned with formatting and
presentation – Is “efficient” because requires little further
interpretation• Disadvantages
– Often specific to one proprietary processing system – Usually ties a document to a single purpose
• printing on a paper • viewing on a screen • provides no information on “meaning”
5 February 2008 Kaiser: COMS E6125 6
Markup Steps1. Author first analyzes the information structure
and other attributes of the document; that is, s/he identifies each meaningful separate element, and characterizes it as a paragraph, heading, ordered list, footnote, or some other element type
2. Author then determines, from memory or a style book, the processing instructions (“marks”) that will produce the format desired for that type of element
3. Finally, s/he inserts the chosen marks into the text
5 February 2008 Kaiser: COMS E6125 7
Example Specific Coding
.SK 1 Text processing and word processing systems typically require additional information to be interspersed among the natural text of the document being processed. This added information, called "markup", serves two purposes:
.TB 4 TaB stopTaB stop
.OF 4 OFfsetOFfset
.SK 1 1.#Separating the logical elements of the document; and .OF 4 .SK 1 2.#Specifying the processing functions to be performed
on those elements. .OF 0 .SK 1 SKipping vertical spaceSKipping vertical space
5 February 2008 Kaiser: COMS E6125 8
Generic Coding• In contrast, generic (or
generalized, or descriptive) coding uses descriptive tags (e.g., “heading”)– Scribe– LaTeX– HTML
5 February 2008 Kaiser: COMS E6125 9
Descriptive Markup• Advantages:
– Identifies the logical components of a document
– Generally concerned with what text is – Does not specify what procedures are
to be applied to text – Therefore requires that other
process(es) supply formatting and presentation
5 February 2008 Kaiser: COMS E6125 10
Descriptive Markup• Disadvantages
– Is (usually) human and machine readable
– Identifies information content – Is not directed towards a particular
purpose or rendition of the document – Therefore can be non-proprietary
5 February 2008 Kaiser: COMS E6125 11
Markup Steps1. Author first analyzes the information
structure and other attributes of the document; that is, s/he identifies each meaningful separate element, and characterizes it as a paragraph, heading, ordered list, footnote, or some other element type same as abovesame as above
2. Author then associates each significant element with the mnemonic tag (“mark”) that s/he feels best characterizes it
5 February 2008 Kaiser: COMS E6125 12
Example Generic Coding
<p> Text processing and word processing systems typically require additional information to be interspersed among the natural text of the document being processed. This added information, called <em>markup</em>, serves two purposes:
<ol> <li>Separating the logical elements of the
document; and <li>Specifying the processing functions to
be performed on those elements. </ol>
5 February 2008 Kaiser: COMS E6125 13
The Case for Generalized Markup
• Markup should describe a document's structure and other attributes rather than specify processing to be performed on it, so markup need be done only once and will suffice for all future processing
• Markup should be rigorous so that the techniques available for rigorously-defined objects like programs and data bases can be used for processing documents as well
5 February 2008 Kaiser: COMS E6125 14
Who Invented Markup?• Specialized markup: ???• Generalized markup:
– Many credit William Tunnicliffe, chairman of the Graphic Communications Association Composition Committee, who presented a talk on the separation of information content of documents from their format during a meeting at the Canadian Government Printing Office, September 1967
– Others credit Stanley Rice, a New York book designer, who proposed the idea of a universal catalog of parameterized editorial structure macros in several articles, e.g., "Editorial Text Structures," Memorandum to Standards Planning and Requirements Committee, ANSI, March 17, 1970
5 February 2008 Kaiser: COMS E6125 15
An Early Implementation
• At IBM in 1969, Charles Goldfarb, Ed Mosher and Ray Lorie invented Generalized Markup Language (GML) as part of a law office project integrating text editing with information retrieval and page composition
• Instead of a simple tagging scheme, GML introduced the concept of a formally-defined document type (DTD = Document Type Definition) with an explicit nested element structure
• By 1971 developed first DTD, for the manuals for IBM's “Telecommunications Access Method”, which enabled all the headings of a given head-level to be automatically formatted identically
• Productized in 1973 in IBM’s Document Composition Facility (DCF)
5 February 2008 Kaiser: COMS E6125 16
Example GML:h1.Chapter 1: Introduction :p.GML supported hierarchical containers, such as :ol :li.Ordered lists (like this one),:li.Unordered lists, and :li.Definition lists :eol. as well as simple structures. :p.Markup minimization (later generalized andformalized in SGML), allowed the end-tags to beomitted for the "h1" and "p" elements.
5 February 2008 Kaiser: COMS E6125 17
SGML = Standard GML• Standardization effort started in 1978, when
ANSI (American National Standards Institute ) creates The Computer Languages for the Processing of Text Committee
• Series of draft standards 1980-1986 (1983 version adopted by IRS and DoD), ISO (International Standard Organization joins ANSI effort in 1984
• Final international standard in 1986 based in part on an SGML system developed by Anders Berglund, then of the European Particle Physics Laboratory (CERN)
• Hmm… isn’t CERN where Tim Berners-Lee invented the “World Wide Web” in 1989?
5 February 2008 Kaiser: COMS E6125 18
SGML• A metalanguage (grammar) • How to write tags, how to define the document
structure• Structural paradigm is that of
– an inverted tree structure, a root component branching out into leaves
– or a series of nested containers • Defines three kinds of objects
– Elements are the basic structural components – Attributes are qualities of elements – Entities are a short representation of special
characters
5 February 2008 Kaiser: COMS E6125 19
SGML Pro and Con• Advantages:
– Documents held in a standards-based, non-proprietary, platform-independent storage format
– Scope for document re-use and re-presentation, enhancement of retrieval possibilities
– Easy to process– Can (optionally) validate against DTDs
• Disadvantages:– Remained a niche market in the 1980s, unknown to
the masses– Not well supported by the major document processing
vendors, tools expensive
5 February 2008 Kaiser: COMS E6125 20
Then Came the Web… • HyperText Markup Language
(HTML) is derived from SGML• As an SGML-compliant language, it
has a DTD with a fixed set of tags• Initially, the number of tags were
very limited ( ~ 10 ) and very easy to remember and to use
5 February 2008 Kaiser: COMS E6125 21
HTML Example<html> <head> <title> My title </title> </head><body> <h1> A huge heading </h1> <h2> A smaller one </h2> <ul> <li> a list item in <b>bold</b> </li> <li> a list item in <i>italics</i> </li> </ul> <p> A paragraph </p> </body> </html>
5 February 2008 Kaiser: COMS E6125 22
Another HTML Example• From original IETF Internet Draft
for HTML
See <A HREF="http://info.cern.ch/">CERN</A>'s information for more details.
A <A NAME=serious>serious</A> crime is one which is associated with imprisonment.
The Organization may refuse employment to anyone convicted of a <a href="#serious">serious</A> crime.
Warning: < IMG SRC ="triangle.gif" ALT="Warning:"> This must b e done by a qualified technician.
< A HREF="Go">< IMG SRC ="Button"> Press to start</A>
5 February 2008 Kaiser: COMS E6125 23
HTML Pro and Con• Advantages
– Simple to learn and to use – Easy to create from scratch or by converting
legacy text files – Easy to parse and render
• Drawbacks– Syntaxless – Much more a presentation language than a
structural language – Too limited, not a good substitute for a word
processor
5 February 2008 Kaiser: COMS E6125 24
HTML History
• 1990: First implementation by TBL on a NeXT computer at CERN – Used SGML tools to create original HTML
language (DTD, parser) – Scalability and simplicity of HTML (and HTTP),
compared to OHS or Gopher part of the basis for WWW success
• 1991-1992: Various text-only and graphical browsers developed, latter usually platform-specific
5 February 2008 Kaiser: COMS E6125 25
HTML History• 1993: NCSA Mosaic
– First widely available graphical WWW browser (Unix X-Windows and Mac)
– Developed primarily by UIUC undergraduate Marc Andreessen
– The killer application of the Internet is born and the number of Web servers explode
• 1994: Competition– Mosaic team leaves NCSA to found Netscape – Microsoft adopts the Web (Internet Explorer bundled
with Windows 95) – Divergence of supported HTML tags between
Internet Explorer and Netscape –> browser wars– HTTP traffic becomes more common than telnet and
ftp
5 February 2008 Kaiser: COMS E6125 26
HTML History• 1994-1995: HTML 2.0 adds image
maps, forms• 1995 and beyond: Commercial websites
– Java development started (as “Oak”) for programming settop boxes in 1991, BIG FAILURE - but launched on Web in March 1995 (in HotJava) and May 1995 (in Netscape), BIG SUCCESS
– Amazon.com opens in July 1995– “dot com” era begins (and soon ends)
5 February 2008 Kaiser: COMS E6125 27
HTML History• Jan 1997: HTML 3.2 adds tables,
applets, text flow around images, superscripts and subscripts
• Dec 1997: HTML 4.0 adds frames, cascading style sheets, more multimedia options, scripting languages, web accessibility conventions, internationalization
5 February 2008 Kaiser: COMS E6125 28
XHTML = eXtensible HyperText Markup Language
• XHTML 1.0 W3C Recommendation January 2000, revised August 2002 (XHTML 1.1 still working draft)
• Made element and attribute names case-sensitive (in particular, use lowercase)
• Include end tags, e.g., <p> … </p>• Add a “/” to empty elements, e.g., <br/> and
<hr/> • Quote all attribute values, e.g.,
<img src="duck.jpg" alt="A Duck"/> • Most browsers still work fine with older HTML
5 February 2008 Kaiser: COMS E6125 29
Where did the “X” come from?
• XML = eXtensible Markup Language• XHTML is a reformulation of HTML 4.x in XML• XHTML can be used in conjunction with other
XML vocabularies – SMIL (Synchronized Multimedia Integration
Language) – SVG (Scalable Vector Graphics)– MathML (Mathematical Markup Language)– Plus hundreds dedicated to specific applications
(the extensible part)
5 February 2008 Kaiser: COMS E6125 30
What is XML for?• The universal markup format for
structured documents and data on the Web
• For data exchange (messages) and persistent data
• Syntax• Data Modeling • Data Processing
5 February 2008 Kaiser: COMS E6125 31
XML History• XML 1.0 became a W3C Recommendation
in February 1998, revised several times - most recently September 2006
• XML 1.1 draft released Nov 2003, recommendation last revised September 2006 (addresses various issues wrt Unicode and mainframe compatibility)
• Conceptually an SGML descendant• Unlike SGML, it quickly became
widespread
5 February 2008 Kaiser: COMS E6125 32
SGML->XML• Like SGML, XML is a grammar (or a
metalanguage), NOT a specific language • Specification simplified
– SGML spec ~600 pages– XML spec 36 pages (initial 1.0) ->
54 pages (1.1 2nd edition)• Parsing made simpler through two-level
mechanism– Well-formed– Valid
5 February 2008 Kaiser: COMS E6125 33
Well-Formed• (Optionally) starts with XML declaration
<?xml version="1.0"?>• Rest of document inside the root element
<myroot>…</myroot>• All text contained in some element
<someelement>text text text</someelement>• Explicit empty elements
<anotherelement></anotherelement><anotherelement/>
5 February 2008 Kaiser: COMS E6125 34
Well-Formed• Element tags must be properly nested (no
crossing tags)NO <i><b>blah blah blah</i></b>
• Start and end tags must match exactly (same case)
• Quotes placed around all attribute values<a href=“stuff.html”>stuff</a>
5 February 2008 Kaiser: COMS E6125 35
Valid• Well-formed, plus• Conforms to a DTD or Schema
– tags and attributes are all declared– tags and attributes are used correctly
• XML browsers and editors usually require validity
• Other tools might not (e.g., search engines)
5 February 2008 Kaiser: COMS E6125 36
XML Goes Beyond Document Processing
• XML more oriented to distributed computing than to document markup
• Thus complements rather than replaces HTML (or XHTML)
• DOM = Document Object Model
• SAX = Simple API for XML
• SOAP = Simple Object Access Protocol
• Web Services
5 February 2008 Kaiser: COMS E6125 37
Let’s Reinvent XML• Someone in the far future sends a message in
a virtual bottle, containing parts of the universal library of human and post-human literature, back into the 1970s when ...
• … the Web, XML, P2P, Java were unheard of• ... computer manufacturers talked about mips
and kilobytes• … music was played by rotating vinyl discs
under a diamond-tip stylus or on cassette tapes
5 February 2008 Kaiser: COMS E6125 38
… and Microsoft looked like
5 February 2008 Kaiser: COMS E6125 39
The Message in the Bottle, 1st tryÐÏ^Qࡱ^Zá^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@>^@^C^@þÿ^@
^F^@^@^@^@^@^@^@^@^@^@^@^A^@^@^@#^@^@^@^@^@^@^@^@^P^@^@%^@^@^@^A^@^@^@þÿÿÿ^@^@^@^@"^@^@^@ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿì¥Á^@q^@^D^@^@^@^R¿^@^@^@^@^@^@^P^@^@^@^@^@^D^@^@Ç^G^@^@^N^@bjbjt+t+^@^@^@
^@Some Quotations from the Universal Library^M1 Famous Quotes^M1.1 By William I^M[2, Sonnet XVIII]^MShall I compare thee to a summer's day?^MThou art more lovely and more temperate.^MRough winds do shake the darling buds of May,^MAnd summer's lease hath all too short a date.^MSometime too hot the eye of heaven shines,^MAnd often is his gold complexion dimmed.^MAnd every fair from fair some declines,^MBy chance or nature's changing course untrimmed.^MBut thy eternal summer shall not fade,^MNor lose possession of that fair thou owest,^MNor shall Death brag thou wander'st in his shade^MWhile in eternal lines to time thou growest.^MSo long as men can breathe, or eyes can see,^MSo long live this, and this gives life to thee.^M1.2 ^M[2] W. Shakespeare. The Sonnets of Shakespeare.609.^M^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
5 February 2008 Kaiser: COMS E6125 40
The Message in the Bottle, 2nd try\documentclass{article} \begin{document} \title{Some Quotations from the Universal Library} ...\section{Famous Quotes} \subsection{By William I} \textbf{\cite[Sonnet XVIII]{shakespeare-sonnets-
1609}} \begin{verse} Shall I compare thee to a summer's day?\\ Thou art more lovely and more temperate. \\ Rough winds do shake the darling buds of May, \\ … \end{verse} \bibliographystyle{abbrv} \bibliography{msg} \end{document}
5 February 2008 Kaiser: COMS E6125 41
The Message in the Bottle, finally<?xml version=“1234.56"?> <universal_library> <books> <book> <title>Some Quotations from the Universal
Library</title> <section> <title>Famous Quotes</title> <subsection> <title>By William I</title> <quote bibref="shakespeare-sonnets-1609"> <title>Sonnet XVIII</title> <verse> <line>Shall I compare thee to a summer's day?</line> <line>Thou art more lovely and more temperate.
</line> <line>Rough winds do shake the darling buds of May,
</line> … </verse>
</section> </book> … </books></universal_library>
5 February 2008 Kaiser: COMS E6125 42
XML as a Self-Describing Data Exchange Format
• Someone from the 1970s receives the message in the virtual bottle, and it …
• … can be easily “understood” (even using CP/M & edlin)
• … can be parsed easily• … allows the application programmer to
rediscover schema and semantics (sort of…)• … may include an explicit schema
description• … allows separation of marked-up content
from presentation
5 February 2008 Kaiser: COMS E6125 43
XML Anatomyelement name element
element content<bibliography>
<paper ID= “goto”> <authors> <author>Edsger W. Dijkstra </author> </authors> <title>Go To Statement Considered Harmful</title> <booktitle>Communications of the ACM</booktitle> <year>1968</year> <fullPaper source=“harmful”/> </paper>
</bibliography>
attribute name
attribute value(attributes cannot contain elements)
empty element character contentnumber content
5 February 2008 Kaiser: COMS E6125 44
Perspectives on XML• Document (SGML) Community
– data = linear text documents– markup (annotate) text to describe context, structure,
semantics• Database Community
– XML as a prominent example of the semi-structured data model
– captures the whole spectrum from highly structured, regular data to unstructured data
XML is the cure for your data exchange, information integration, e-commerce, … problems” (also cures baldness, lose 28 pounds in 14 days, get rich quick, …)
5 February 2008 Kaiser: COMS E6125 45
Pure XML - Instance Model
<A> <B>foo</B> <C>bar</C> <C>psl</C></A>
A
B C
"foo" "bar"
C:"bar"
A:
B: "foo"
C:"psl"
"psl"
C
children are ordered
• XML 1.0 implicit data model (infoset): – nested containers ("boxes within boxes")– labeled ordered trees (= semistructured data
model)– relational, object-oriented easy to encode
5 February 2008 Kaiser: COMS E6125 46
Identifying Vocabularies
• My element may not be your element: – geometry context:
<element>line</element> – chemistry context:
<element>oxygen</element>
5 February 2008 Kaiser: COMS E6125 47
Identifying Vocabularies
• An XML Schema (with XML 1.1) defines a vocabulary of names of type definitions, element and attribute declarations [Schema ~= new improved DTD]
• Use XML Namespaces (with XML 1.1) to identify which vocabulary– Simple method for qualifying element and
attribute names used in XML documents– Useful when a single XML document contains
elements and attributes that are defined for and used by multiple software modules
5 February 2008 Kaiser: COMS E6125 48
Namespace Scoping• XML namespaces
are declared with an xmlns attribute, which can associate a prefix with the namespace
• The declaration is in scope for the element containing the attribute and all its descendants
<html:html xmlns:html='http://www.w3.org/1999/xhtml'>
<html:head> <html:title>Frobnostication </html:title></html:head><html:body> <html:p>Moved to
<html:a href='http://frob. example.com'>here.</html:a>
</html:p></html:body></html:html>
5 February 2008 Kaiser: COMS E6125 49
Namespace Defaulting<?xml version="1.1"?>
<!-- elements are in the HTML namespace, in this case by default -->
<html xmlns='http://www.w3.org/1999/xhtml'> <head> <title>Frobnostication</title> </head> <body> <p>Moved to <a href='http://frob.example.com'>here</a>.</p> </body></html>
5 February 2008 Kaiser: COMS E6125 50
Multiple Namespaces
<bk:book xmlns:bk='urn:loc.gov:books' xmlns:isbn='urn:ISBN:0-395-36341-6'
xmlns:money='urn:Finance:AllAboutMoney'>
<bk:title>Cheaper by the Dozen</bk:title><isbn:number>1568491379</isbn:number>
<bk:price money:currencySymbol="$">99.99</bk:price>
</bk:book>
All element types are prefixed
5 February 2008 Kaiser: COMS E6125 51
Namespace Defaulting with Multiple Namespaces
<book xmlns='urn:loc.gov:books' xmlns:isbn='urn:ISBN:0-395-36341-6'> <title>Cheaper by the Dozen</title> <isbn:number>1568491379</isbn:number>
</book>
Unprefixed element types are from books
5 February 2008 Kaiser: COMS E6125 52
Nested Scoping<?xml version="1.1"?><!-- initially, the default namespace is "books" --><book xmlns='urn:loc.gov:books'
xmlns:isbn='urn:ISBN:0-395-36341-6'><title>Cheaper by the Dozen</title><isbn:number>1568491379</isbn:number><notes>
<!-- make HTML the default namespace for some commentary -->
<p xmlns='urn:w3-org-ns:HTML'> This is a <i>funny</i> book! </p></notes>
</book>
5 February 2008 Kaiser: COMS E6125 53
How to Define the Actual Namespace
• W3C namespace specification doesn’t say (!)• A namespace doesn’t actually have to exist as
a physical or conceptual entity• All that is needed is a qualifier—the XML
namespace URI — that, in combination with an element type or attribute name, creates a universal (and universally unique) name
• In other words, there doesn’t actually have to be a definition or anything else at that URI
5 February 2008 Kaiser: COMS E6125 54
XML Namespaces
• Allows mixing of different tag vocabularies
• Only identifies the vocabulary (lexicon)
• Additional mechanisms required for structure and meaning of tags
5 February 2008 Kaiser: COMS E6125 55
Processing XML• Non-validating parser:
– checks that XML doc is syntactically well-formed
• Validating parser:– checks that XML doc is also valid wrt a
given XML Schema (or, historically, DTD)
5 February 2008 Kaiser: COMS E6125 56
Processing XML• Tree representation:
– Document Object Model (DOM) API– Cursor APIs, e.g., .NET’s XPathNavigator
, Java StAX• Stream of events representation:
– Push Model, e.g., Simple API for XML (SAX)
– Pull Model, e.g., Common API for XML Pull Parsing (XmlPull)
• Others
5 February 2008 Kaiser: COMS E6125 57
Document Object Model
• Object-oriented approach to traversing the XML document as a tree
• Typically loads the entire XML document into memory (random access but memory intensive)
• Provides mechanisms for loading, saving, accessing, querying, modifying, and deleting nodes from an XML document
5 February 2008 Kaiser: COMS E6125 58
DOM API• Hierarchy of Node objects mapping to XML
concepts: document, element, attribute, processing instruction, comment, …
• Language-independent API:– get first/last child, previous/next sibling, set of
nodes– insert before/after, replace– getElementsByTagName
• W3C DOM offers fairly limited functionality, so implementations often add helper method extensions
5 February 2008 Kaiser: COMS E6125 59
Push Model• XML producer (typically an XML parser)
controls the pace of the application and informs the XML consumer when certain events occur (e.g., reports events when encountering begin/end tags)
• XML consumer registers callbacks with the producer, which invokes the callbacks as various parts of the XML document are seen (as events are reported)
• Does not necessarily build a parse tree
5 February 2008 Kaiser: COMS E6125 60
Push Model Pro• The entire XML document does not need to be
stored in memory, only the information about the node currently being processed is needed
• This makes it possible to process large XML documents without incurring massive memory costs
• Can also process XML streams whose contents arrive over time
• Allows consumer to ignore less interesting data
5 February 2008 Kaiser: COMS E6125 61
Push Model Con• Certain context and state information such as
the parents of the current node or its depth in the XML tree must be tracked by the programmer
• Limited expressive power (query/update) when working on streams
• To register callbacks one needs to create a class devoted to handling events from the producer
• Many developers find callbacks to be an unintuitive way to control program flow
5 February 2008 Kaiser: COMS E6125 62
Pull Model• XML Consumer controls the program flow
by requesting events from the XML producer as needed
• Operates in a forward-only, streaming fashion while only showing information about a single node at any given time
• Programmer creates a loop that continually reads from the XML document until the end of the document is reached, but acts solely on items of interest as they are seen
5 February 2008 Kaiser: COMS E6125 63
Pull Model Comparison• As memory efficient as push model processing
but with a more familiar programming model• Does not require a specialized class for
handling XML processing to implement specific interfaces or subclass certain classes to register callbacks
• The need to explicitly track application states using boolean flags and similar variables is significantly reduced
5 February 2008 Kaiser: COMS E6125 64
XML Cursors• Cursor acts like a lens that focuses on one XML
node at a time, but, unlike pull-based or push-based APIs, the cursor can be positioned anywhere along the XML document at any given time
• Allows one to navigate, query, and manipulate an XML document loaded in memory
• Does not require the heavyweight interface of a traditional tree model API, where every significant token in the underlying XML must map to an object
• Can create XML views of non-XML data
5 February 2008 Kaiser: COMS E6125 65
Other Alternatives• Object to XML Mapping APIs
– Represent nodes and text as classes and programming language primitives
– Cannot represent all XML information with full fidelity, e.g., lose processing instructions and comments, element ordering
– Impedance mismatches between XML Schema and object-oriented concepts
• XML-specific languages – XPath, XQuery, XSLT, …
5 February 2008 Kaiser: COMS E6125 66
Summary• Webpages intended for human audience
usually written in HTML, where descriptive markup is interpreted by browser
• Webpages intended for machine processing (other than browser) usually written in some XML vocabulary understood by both the producer and the consumer
5 February 2008 Kaiser: COMS E6125 67
Second Assignment: Revised Paper
Proposal• Due Monday February 18th at 5pm• Maximum three pages (not including
figures, if any), plus references (required)• Plan and outline your paper (which will be
~15 pages)• See
http://york.cs.columbia.edu/classes/cs6125/revised_paper_proposal.htm
5 February 2008 Kaiser: COMS E6125 68
Revised Paper Proposal• Each full paper should have title, author,
abstract (~200 words), introduction, body sections, conclusions, bibliography (cited references)
• The point of this assignment is to determine what will be in those sections
• Assume a reader who is taking the class but may not know anything at all about your specific topic
5 February 2008 Kaiser: COMS E6125 69
Revised Paper Proposal: Introduction
and Conclusion• What is your topic?• What is the problem being addressed?• What is the solution, or design space of
solutions, proposed or actualized?• What is your argument?• What is your point of view?• What is the opposing point of view?
5 February 2008 Kaiser: COMS E6125 70
Revised Paper Proposal: Body Sections
• What sections? (usually 3-5)• What subsections? (perhaps down to
subsubsections)• Motivate your literature reading to fill
those sections• Full paper will be due March 14th
5 February 2008 Kaiser: COMS E6125 71
A Note about Citations and Bibliographic
References• References should be cited in the text
like this “Kaiser said blah blah [1]” or this “[Kai07] describes mumble”
• Bibliography entry should appear something like this[Kai07] Gail Kaiser, COMS E6125 Web-enHanced Information Management, Columbia University Department of Computer Science, 2007, http://york.cs.columbia.edu/classes/cs6125/.
5 February 2008 Kaiser: COMS E6125 72
Second Assignment: Logistics
• Due Monday February 18th by 5pm• Maximum three pages when printed (not
including optional figures and required reference list)
• Submit by posting in Revised Paper Proposal folder on CourseWorks
• Must be in a format I can read, which means pdf, word, powerpoint, html, plain ascii text (with all figures embedded or viewable in an ordinary browser)
5 February 2008 Kaiser: COMS E6125 73
Heads Up on Project• Preliminary Proposal due Monday March 10th
(note this is before the full paper)• Optionally work in teams (see
http://york.cs.columbia.edu/classes/cs6125/team_advice)
• Build a new system or extend an existing system – submit code, demo system
• OR evaluate/compare one or more existing system(s) – submit procedures and findings, show system(s)
• You may "continue" your paper topic towards the project, or do something entirely different
5 February 2008 Kaiser: COMS E6125 74
Heads Up on Presentation• Individual ~10 talk in class during one
of last few class sessions• No proposal, just do it• May be based on paper, project, or
some other topic (in the case of team members all presenting on the same project, please coordinate to avoid redundancy and discuss your plans with the instructor in advance)
5 February 2008 Kaiser: COMS E6125 75
Reminders
• Class participation is important! (10% corresponds to a whole letter grade)
• Revised paper proposal due February 18th
• Preliminary project proposal due March 10th
• Paper must be individual, projects may optionally be done in teams
5 February 2008 Kaiser: COMS E6125 76
COMS E6125 Web-COMS E6125 Web-enHanced Information enHanced Information Management (WHIM)Management (WHIM)
COMS E6125 Web-COMS E6125 Web-enHanced Information enHanced Information Management (WHIM)Management (WHIM)
Prof. Gail KaiserProf. Gail Kaiser
Spring 2008Spring 2008