XML BASIC Laurea Magistrale in Informatica Chapter 02 Modulo del corso Thecnologies for Innovation

XML BASIC

Laurea Magistrale in InformaticaLaurea Magistrale in Informatica

Chapter 02 Chapter 02 Modulo del corsoModulo del corso

Thecnologies for InnovationThecnologies for Innovation

XML Basic

2

Agenda

Syntax : element and attributes

XML Prolog

Examples

Additional Resource

DTD and XML Schema : introduction

Well Formed and Valid Documents

Validation

Syntax : element and attributes

XML Prolog

Examples

Additional Resource

DTD and XML Schema : introduction

Well Formed and Valid Documents

Validation

XML Basic

3

Sintassi di un documento XML (I)

Un documento XML è un file di testo che contiene una serie di tag, attributi e testo secondo regole sintattiche ben definite

Un documento XML è intrinsecamente caratterizzato da una struttura gerarchica

Esso è composto da componenti denominati elementi

Ciascun elemento rappresenta un componente logico del documento e può contenere altri elementi (sottoelementi) o del testo

XML Basic

4

Gli elementi possono avere associate altre informazioni che ne descrivono le proprietà. Queste informazioni sono chiamate attributi

L’organizzazione degli elementi segue un ordine gerarchico ad albero che prevede un elemento principale, chiamato root element o semplicemente root o radice

La radice contiene l’insieme degli altri elementi del documento. Possiamo rappresentare graficamente la struttura di un documento XML tramite un albero, generalmente noto come document tree

Sintassi di un documento XML(II)

XML Basic

5

articolo

testo

paragrafoparagrafo

testo

immagine

paragrafo

codice

testo

titolo

titolo

titolotitolo

file

Document Tree Example (I)

<?xml version="1.0" ?><articolo titolo="Titolo dell’articolo"> <paragrafo titolo="Titolo del primo paragrafo"> <testo> Blocco di testo del primo paragrafo </testo> <immagine file="immagine1.jpg"> </immagine> </paragrafo> <paragrafo titolo="Titolo del secondo paragrafo"> <testo> Blocco di testo del secondo paragrafo </testo> <codice> Esempio di codice </codice> <testo> Altro blocco di testo </testo> </paragrafo> <paragrafo tipo="bibliografia"> <testo> Riferimento ad un articolo </testo> </paragrafo></articolo>

XML Basic

6

Document Tree Example (Newspaper)

<newspaper><section><page><article><headline>XML 8 Announced</headline><byline>Jan Doe</byline><body>The W5C today announced...</body></article><ad><client>Crazy Ed's Cars</client><size>1/4 page</size><run>2 weeks</run></ad></page></section></newspaper>

The structure of the document reflects the structure of the newspaper: The newspaper contains sections, which in turn have pages, and on each page are articles and advertisements.

XML Basic

7

Trees and Relationships

As you can see from the preceding example, XML documents are structured as trees, and there are relationships that exist between the elements in an XML document.

For example, with these elements:

<newspaper>

<section>

</section>

</newspaper>

the <newspaper> element is the parent of the <section> element, and the <section> element is the child of the <newspaper> element.

These relationships become very important as you move into more advanced areas of XML, as you will use these relationships for navigating and locating information within the XML tree with technologies such as XPath.

XML Basic

8

ELEMENTS

The bulk of actual data in your XML documents will be in the form of elements.

Elements are tag pairs, which are case sensitive, consisting of both a start tag, and an end tag.

The name of the element itself is called the element type, whereas within a document, when the element occurs it is referred to as an instance of the element.

<example>An Example Element</example>

The element type here is "example"; The element itself is actually the entire string,with the start tag,

content, and end tag all together. The text contained between the tags is called the element

content.

XML Basic

9

ELEMENTS:different types of content

PCData (text) When elements have PCData or text content, they do not contain

any child elements, only text.

The "PCData" stands for "Parsed Character Data," which is simply data that is read by the XML parser.

Element If an element has only child elements as its content:

<example><child>Some text...</child></example> then the element is said to have element content.

Mixed If an element has both text and element content:

<example>Text and <bold>emphasized</bold> text.</example> then the element is said to have mixed content.

XML Basic

10

Empty Tags

There are instances where you might have an element that is empty, or does not contain any text or child elements.

If this is the case, you can write the element with both start and end tags:

<empty></empty>

However, there is also a shorthand that can be used for elements that do not have any content:

<empty/>

XML Basic

11

ATTRIBUTES

Not all data in XML documents is stored in element content. Some information may be stored in attributes.

Attributes are simply a means for associating named values with elements.

HTML example: <img src="myimage.gif"> img tag, the src specification is an attribute.

Attributes are placed in the start-tag of the element, separated with a space. The content of the attribute is enclosed in quotation marks, either single or double, and an element can have any number of attributes, so long as each attribute name is unique.

<shirt size="medium"/>;<pants size="30">Bell Bottoms</pants>

As you can see, attributes can be used with empty elements or elements with text or mixed content as well.

XML Basic

12

Structure: XML Declaration

version The version attribute is required, and it is used to alert the XML

processor to the version of XML which was used to author this particular XML document. Currently, the only acceptable version is "1.0."

encoding The encoding attribute is used to specify the character set that is used

for encoding the document. You can use any Unicode character set here, and the default value is "UTF-8." This attribute is not required.

standalone The standalone attribute is used to denote whether or not the

document requires a DTD in order to be processed. If the value is "No" then the XML Processor will assume that the document needs a DTD, and if there is not one, it will cause an error. This attribute (or declaration) is not required, and the default value is "Yes."

Every XML document should begin with the XML Declaration, which takes the following form: <?xml version="1.0" encoding="UTF-8" standalone="yes"?>

The XML declaration always starts with the "<?xml" and always ends with "?>".

XML Basic

13

Structure: XML Prolog

The XML Prolog consists of least two parts—the XML

Declaration which we have just discussed, and a

DOCTYPE Declaration.

The DOCTYPE Declaration is used to associate an internal

set of declarations with the document, or to link the XML

document to an external DTD file for validation.

The XML Prolog is not required to work with well-formed

XML; however, to work with valid XML you will need to

use the DOCTYPE declaration.

XML Basic

14

EXAMPLE (I)

In this example, we're going to create a simple XML document for a technical journal. This journal XML document will contain some elements that describe the cover, a table of contents, the journal articles themselves, and an index.

First, we need to start our XML document with the XML declaration and the root element:<?xml version="1.0"?><journal></journal>

The XML declaration contains the mandatory version attribute, but because we are not going to do anything with special character sets or with validation, we can leave out the encoding and standalone attributes. By not specifying them, the default values will be used.

Next, we will create the element for the cover of our journal, and call it <cover>.<cover art="photo.jpg"><slug>Learn the Secrets of XML</slug></cover>

The cover element has an art attribute, which is used to specify the cover art. The cover element has element content; that is, it contains another element, which is called slug which contains the text for the slug, or teaser, that will appear on the cover as well.The slug element contains PCData content, which is just text.

XML Basic

15

EXAMPLE (II)

Next, we want to create the element for the table of contents. We'll call this element contents and like the cover element, it will have element content, in the form of a title element. The title element will contain text that is the title of the article, as it would appear in the table of contents. The other piece of information we need in the table of contents is the page number on which the article appears.<contents><title page="3">Authoring XML Documents</title></contents>

For the articles, we're going to use a number of elements to describe the article:

• article— This element will contain the child elements which contain the data for the article and its author.

• headline— This element is the headline of the article.

• byline— The byline for the author of the article.

• body— The text of the article.

The resulting XML code looks like this:<article><headline>Authoring XML Documents</headline><byline>Joe Smith</byline><body>So you want to work with XML...</body></article>

XML Basic

16

EXAMPLE (III)

Finally, we want to create an index to track references to technologies within the article.

The index element will be used to store each reference that will appear in the index, and

it will contain child elements for each reference. That reference element will also

need to have a page number associated with it, so we can once again make use of a page

attribute to track the page number of the reference.

The resulting XML code is as follows:

<index>

<reference page="4">XML Prolog</reference>

</index>

XML Basic

17

EXAMPLE complete listing

<?xml version="1.0"?><journal><cover art="photo.jpg"><slug>Learn the Secrets of XML</slug><slug>XSLT Transforms the Web</slug><slug>Namespaces and Why You Need Them</slug></cover><contents><title page="3">Authoring XML Documents</title><title page="6">Transforming the Web with XSLT</title><title page="9">What's in a Namespace?</title><title page="12">Graphics and XML with SVG</title></contents><article><headline>Authoring XML Documents</headline><byline>Joe Smith</byline><body>So you want to work with XML...</body></article><article><headline>Transforming the Web with XSLT</headline><byline>Jane Doe</byline><body>XML can easily be turned into HTML...</body></article>

<article><headline>What's in a Namespace?</headline><byline>Jane Jones</byline><body>When is an name not a name...</body></article><article><headline>Graphics and XML with SVG</headline><byline>Sally Smith</byline><body>Drawing on the Web with SVG...</body></article><index><reference page="4">XML Prolog</reference><reference page="8">apply-templates</reference><reference page="11">xmlns</reference><reference page="15">SVG</reference></index></journal>

XML Basic

18

Riepilogo sintassi (I)

Prologo XML, necessario per ogni documento XML

Ogni documento XML deve contenere un unico elemento di massimo livello (root) che contenga tutti gli altri elementi del documento.

Ogni elemento deve avere un tag di chiusura o, se vuoti, possono prevedere la forma abbreviata (/>)

Gli elementi devono essere opportunamente nidificati, cioè i tag di chiusura devono seguire l’ordine inverso dei rispettivi tag di apertura

XML è case-sensitive I valori degli attributi devono sempre essere

racchiusi tra singoli o doppi apici

<?xml version="1.0" ?>

XML Basic

19

La violazione di una qualsiasi di queste regole fa in modo che il documento risultante non venga considerato ben formato. Anche se queste regole possono sembrare semplici, occorre prestarvi molta attenzione se si usa un semplice editor di testo. Codice del tipo

<articolo titolo=test>...</Articolo>

darà qualche problema, e lo stesso dicasi per situazioni analoghealla seguente:

<paragrafo><testo>abcdefghi...</paragrafo></testo>

Riepilogo sintassi (II)

XML Basic

20

Riepilogo sintassi (III)

The text enclosed by the root tags may contain an arbitrary number of XML elements.

The basic syntax for one element is:

<element_name attribute_name="attribute_value">Element Content</element_name>

The two instances of »element_name« are referred to as the start-tag and end-tag, respectively.

«Element Content» is some text which may again contain XML elements.

So, a generic XML document contains a tree-based data structure.

XML Basic

21

<recipe name="bread" prep_time="5 mins" cook_time="3 hours">

<title>Basic bread</title>

<ingredient amount="8" unit="dL">Flour</ingredient>

<ingredient amount="10" unit="grams">Yeast</ingredient>

<ingredient amount="4" unit="dL" state="warm">Water</ingredient>

<ingredient amount="1" unit="teaspoon">Salt</ingredient>

<instructions>

<step>Mix all ingredients together.</step>

<step>Knead thoroughly.</step>

<step>Cover with a cloth, and leave for one hour in warm room.</step>

<step>Knead again.</step>

<step>Place in a bread baking tin.</step>

<step>Cover with a cloth, and leave for one hour in warm room.</step>

<step>Bake in the oven at 180(degrees)C for 30 minutes.</step>

</instructions>

</recipe>

Recipe Data Structure

XML Basic

22

Anche la scelta dei nomi dei tag deve seguire alcune regole. Un tag può iniziare con un lettera o un underscore (_) e può contenere lettere, numeri, il punto, l’underscore (_) o il trattino (-). Non sono ammessi spazi o altri caratteri. Potrebbe essere necessario inserire in un documento XML dei caratteri particolari che potrebbero renderlo non ben formato. Ad esempio, se dobbiamo inserire del testo che contiene il simbolo <, corriamo il rischio che possa venire interpretato come l’inizio di un nuovo tag, come nel seguente esempio:

<testo> il simbolo < indica minore di</testo>

Riepilogo sintassi (IV)

XML Basic

23

Entity references

In the markup languages a character entity reference is a reference to a particular kind of named entity that has been predefined or explicitly declared in a Document Type Definition (DTD).

The replacement text of the entity consists of a single character from the Universal Character Set/Unicode. The purpose of a character entity reference is to provide a way to refer to a character that is not universally encodable.

Actually, XML has two relevant concepts:

a "predefined entity reference" is a reference to one of the special

characters denoted by <, >, &, ", or ';

while a "character reference" (or "numeric character reference") is a construct such as or that refers to a character by means of its numeric Unicode codepoint.

XML Basic

24

Entity Reference Examples (I)

Here is an example using a predeclared XML entity to

represent the ampersand in the name "AT&T":<company_name>AT&T</company_name>

<testo> il simbolo < indica minore di</testo>

An example of a numeric character reference is "€", which refers to the Euro symbol by means of its Unicode codepoint in hexadecimal

XML Basic

25

Entity references DTD declaration

Additional entities (beyond the predefined ones) can be declared in the document's Document Type Definition (DTD). Declared entities can describe single characters or pieces of text, and can reference each other.

<?xml version="1.0" encoding="UTF-8"?>

<!DOCTYPE example [

<!ENTITY copy "©">

<!ENTITY copyright-notice "Copyright © 2006, XYZ Enterprises">

]>

<example>

&copyright-notice;

</example>

When viewed in a suitable browser, the XML document above appears as:

Copyright © 2006, XYZ Enterprises

XML Basic

26

Numeric character references

Numeric character references look like entity references, but instead of a name, they contain the "#" character followed by a number.

The number (in decimal or "x"-prefixed hexadecimal) represents a Unicode code point.

They have typically been used to represent characters that are not easily encodable, such as an Arabic character in a document produced on a European computer.

The ampersand in the "AT&T" example could also be escaped like this (decimal 38 and hexadecimal 26 both represent the Unicode code point for the "&" character):

<company_name>AT&T</company_name>

<company_name>AT&T</company_name>

Similarly, in the previous example, notice that "©" is used to generate the “©” symbol.

XML Basic

27

In determinate situazioni gli elementi da sostituire con le entità possono essere molti, il che rischia di rendere illeggibile il testo ad essere umano. Si consideri il caso in cui un blocco di testo illustri proprio del codice XML:

<codice> <libro> <capitolo> </capitolo> </libro></codice>

In questo caso, al posto di sostituire tutte le occorrenze dei simboli speciali con le corrispondenti entità è possibile utilizzare una sezione CDATA.

CDATA SECTION

XML Basic

28

Una sezione CDATA (Character DATA) è un blocco di testo che viene considerato sempre come testo, anche se contiene codice XML o altri caratteri speciali. Per indicare una sezione CDATA è sufficiente racchiuderla tra le sequenze di caratteri <![CDATA[ e ]]>. Il nostro esempio diventerà come segue:

<codice> <![CDATA[ <libro> <capitolo> </capitolo> </libro> ]]></codice>

Character Data

XML Basic

29

Comments

can be placed anywhere in the tree, including in the text if the content of the element is text or #PCDATA.

XML comments start with <!- - and end with - ->.

Two consecutive dashes (--) may not appear anywhere in the text of the comment.



XML Basic

30

Processing Instruction(PI)

XML provides the processing instruction as an alternative means of passing information to particular applications that may read the document.

A processing instruction begins with <? and ends with ?>.

Immediately following the <? is an XML name called the target , possibly the name of the application for which this processing instruction is intended or possibly just an identifier for this particular processing instruction.

The rest of the processing instruction contains text in a format appropriate for the applications for which the instruction is intended.

XML Basic

31

PI EXAMPLE

The most common processing instruction, xml-stylesheet, is used to attach stylesheets to documents.

It always appears before the root element

In this example, the xml-stylesheet processing instruction tells browsers to apply the CSS stylesheet person.css to this document before showing it to the reader.

<?xml-stylesheet href="person.css" type="text/css"?>

<person>

Alan Turing

</person>

XML Basic

32

SCHEMI

Sistema per la catalogazione delle specie a rischio di estinzione EndML

Elementi Animale Sottospecie Popolazione minacce

1. Non si scrivono documenti in XML2. Si usa XML per creare specifici linguaggi di marcatura personalizzati

(applicazioni XMLapplicazioni XML)3. Si scrivono i documenti in quei linguaggi

1. Non si scrivono documenti in XML2. Si usa XML per creare specifici linguaggi di marcatura personalizzati

(applicazioni XMLapplicazioni XML)3. Si scrivono i documenti in quei linguaggi

Lo specifico linguaggio si definisce specificando quali elementi ed attributi sono ammessi o necessari in un documento conformeconforme

Lo specifico linguaggio si definisce specificando quali elementi ed attributi sono ammessi o necessari in un documento conformeconforme

Insieme di regoleSchema del Schema del documentodocumento

XML Basic

33

DTD & XML Schema

Definiscono regole per la produzione di documenti strutturati

Una DTD: Document Type Definition contiene le definizioni dei tipi di elementi, degli attributi, delle entità, delle notazioni. Un DTD dichiara

quali elementi, tipi, entità notazioni sono legali

…. ed in quale parte del documento lo sono

XML Schema: Successore delle DTD

Basato su XML, fornisce un’alternativa alle DTD, più potente

XML Basic

34

Document Type Definition DTD

The oldest schema format for XML inherited from SGML.

It has no support for newer features of XML, most importantly namespaces.

It lacks expressiveness. Certain formal aspects of an XML document cannot be captured in a DTD.

It uses a custom non-XML syntax, to describe the schema.

DTD is still used in many applications because it is considered the easiest to read and write.

XML Basic

35

XML Schema

XML schema language, described by the W3C as the successor of DTDs…….

Initialism : XSD (XML Schema Definition). XSDs are far more powerful than DTDs in describing

XML languages. They use a rich datatyping system, allow for more

detailed constraints on an XML document's logical structure, and must be processed in a more robust validation framework.

XSDs also use an XML-based format, which makes it possible to use ordinary XML tools to help process them, although XSD implementations require much more than just the ability to read XML.

XML Basic

36

Well-formed and valid XML documents two levels of correctness

Well-formed. A well-formed document conforms to all of XML's syntax rules. For example, if a start-tag appears without a

corresponding end-tag, it is not well-formed. A document that is not well-formed is not considered

to be XML; a conforming parser is not allowed to process it.

Valid. A valid document additionally conforms to some semantic rules. These rules are either user-defined, or included as an XML schema, especially DTD. For example, if a document contains an undefined

element, then it is not valid; a validating parser is not allowed to process it.

XML Basic

37

Parser validanti e non validanti

Il cuore di un applicazione XML è il parser, ovvero quel modulo che legge il documento XML e ne crea una rappresentazione interna utile per successive elaborazioni (come la visualizzazione)

Un parser validante, in presenza di un DTD, è in grado di verificare la validità del documento, o di segnalare gli errori di markup presenti

Un parser non validante invece, anche in presenza di un DTD è solo in grado di verificare la buona forma sintattica del documento

Un parser non validante è molto più semplice e veloce da scrivere, ma è in grado di fare meno controlli. In alcune applicazioni, però, non è necessario validare i documenti, solo verificare la loro buona forma

XML Basic

38

Validazione del Linguaggio di MARKUP

Documento XML

Parser validator

e

SCHEMA XML

Collegato al

documento

Documento valido se conforme a tutte le regole

Documento XML

Parser non

validatore

SCHEMA XML

Collegato al

documento

Documento well formed se sintatticamente corretto

XML Basic

39

Well formed verification ; book markup

t ito lo

p rim o no m e

se co n do no m e

a u to re

p re fa zio ne

ca p ito lo 1

ca p ito lo 2

a p pe nd ice

ca p ito lo m e d ia

b o ok

<?xml version = "1.0"?>



<book isbn = "999-99999-9-X">

<title> XML Primer</title>

<author>

<firstName>Paul</firstName>

<lastName>Deitel</lastName>

</author>

<chapters>

<preface num = "1" pages = "2">Welcome</preface>

<chapter num = "1" pages = "4">Easy XML</chapter>

<chapter num = "2" pages = "2">XML Elements</chapter>

<appendix num = "1" pages = "9">Entities</appendix>

</chapters>

<media type = "CD"/>

</book>

XML Basic

40

Well formed verification ; book markup (II)




<book isbn = "999-99999-9-X">

<title> XML Primer</title>

<author>

<firstName>Paul</firstName>

<lastName>Deitel</lastName>

</author>

<chapters>

<preface num = "1" pages = "2">Welcome</preface>

<chapter num = "1" pages = "4">Easy XML</chapter>

<chapter num = "2" pages = "2">XML Elements</chapter>

<appendix num = "1" pages = "9">Entities</appendix>



<media type = "CD"/>

</book>

XML Basic

41

Markup del libro con output ottenuto con un foglio di stile

Usage.xml applico Foglio di stileUsage.xsl

ottengo

Istruzione di elaborazione (PI o Processing Instruction :

<?xml:stylesheet type ="text/xsl" href ="usage.xsl"?>

1. <? E ?> delimitano le PI

2. Target o riferimento (xml:stylesheet)

3. Valore type ="text/xsl" href ="usage.xsl”

Istruzione di elaborazione (PI o Processing Instruction :

<?xml:stylesheet type ="text/xsl" href ="usage.xsl"?>

1. <? E ?> delimitano le PI

2. Target o riferimento (xml:stylesheet)

3. Valore type ="text/xsl" href ="usage.xsl”

XML Basic

42

Analisi della validazione

Documento DTD intro.dtd

<!ELEMENT myMessage ( message )>

Dichiara l’elemento myMessage come root con un unico child di nome message

<!ELEMENT message ( #PCDATA )>

Dichiara che l’elemento message deve contenere dati di caratteri riconosciuti dal parser XML

Documento XML intro.xml


<!DOCTYPE myMessage SYSTEM "intro.dtd">

Prologo del documento Dichiarazione di tipo !DOCTYPE

myMessage nome del tipo (nome dell’elemento root

SYSTEM la dichiarazione è esterna al documento e si trova alla URL: intro.dtd

<myMessage>

<message>Welcome to XML!</message>

</myMessage>

XML Basic

43






<!DOCTYPE myMessage SYSTEM "intro.dtd">



<myMessage>

</myMessage>

XML Basic

44

Additional Resources (I)

XML 1.0 Recommendation (http://www.w3.org/TR/REC-xml ) The XML 1.0 Recommendation (Second Edition) from

the W3C is the final word on XML. If you have a question about a technical aspect of XML, this should be the first source you consult.

Annotated XML Recommendation (http://www.xml.com/axml/testaxml.htm) The Annotated XML Recommendation is an excellent

resource for making sense of the sometimes difficult-to-read XML Recommendation. Written by Tim Bray (one of the XML 1.0 Editors), the Annotated XML Recommendation provides some clarification on confusing areas of the Rec, and offers some historical tidbits as well.

XML Basic

45

Additional Resources (II)

XML-DEV ([email protected]) The XML-DEV mailing list is a good resource for

developers actively working with XML. Discussion ranges from Recommendation debates to practical tips. To subscribe, send an e-mail to the address with "subscribe“ in the body of the message.

comp.text.xml The comp.text.xml USENET Newsgroup can also be a

great resource for interacting with other XML developers.

The XML FAQ (http://www.ucc.ie/xml/) The XML Frequently Asked Questions can address

some issues such as why XML is structured the way it is, and when it might be appropriate to use XML as a solution.offers some historical tidbits as well.

XML Basic

46

Additional Resources (III)

XML.com (http://www.xml.com) XML.com is a commercial Web site dedicated to

tracking and reporting on XML and XML-related issues. The site covers not only XML 1.0, but also any and all related activities and can be a great source of tutorials, articles,and general XML information.

XML.org (http://www.xml.org) XML.org is another commercial site, billing itself as

the industry portal for XML. The site features the XML Cover Pages, which is Robin Cover's news column tracking developments in SGML/XML.

XML Basic

47

PARSER

AltovaXML free parser from Altova, also included in XMLSpy, MapForce, and StyleVision

RomXML Embedded XML commercial toolkit written in ANSI-C.

XDOM open-source XML parser (and DOM and XPath implementation) in Delphi/Kylix.

XML resources at the Open Directory Project TinyXml Simple and small C++ XML parser. FoX fully validating XML parser library, written in Fortran. Intel_XSS XML parsing, validation, XPath, XSLT. sw8t.xml Lightweight, high-performance, intuitive

JavaScript XML Parser. Includes API docs and developer's guide