41
XML for Information Management – Day 4 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor: Professor Airi Salminen http://users.jyu.fi/~airi/ 26.4.-30.4.2010

XML for Information Management – Day 4 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor:

Embed Size (px)

Citation preview

Page 1: XML for Information Management – Day 4 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor:

XML for Information Management – Day 4Airi Salminen

XML for Information Management

University of Erlangen-NurembergComputational Linguistics

Instructor: Professor Airi Salminenhttp://users.jyu.fi/~airi/

26.4.-30.4.2010

Page 2: XML for Information Management – Day 4 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor:

XML for Information Management – Day 4Airi Salminen

2

1. Entity types2. Entity declarations and references3. XML processor treatment of entity

references4. Motivations for the use of entities5. XML family of languages

Outline

Page 3: XML for Information Management – Day 4 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor:

XML for Information Management – Day 4Airi Salminen

3

3. Entity types

Physical structure of XML documents consists of entities.

An entity is a unit recognized by the XML processor, the content of an entity is text or other kind of data.

Page 4: XML for Information Management – Day 4 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor:

XML for Information Management – Day 4Airi Salminen

4

parsed entities -- unparsed entities

internal entities -- external entities

general entities -- parameter entities

3-dimensional categorization:

3. Entity types

Page 5: XML for Information Management – Day 4 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor:

XML for Information Management – Day 4Airi Salminen

5

parsed entity

intended to be parsed by the XML processor, content consists of marked-up text

unparsed entity

not intended to be parsed by the XML processor, content can be whatever data

3. Entity types

Page 6: XML for Information Management – Day 4 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor:

XML for Information Management – Day 4Airi Salminen

6

internal entity

name and value given in an entity declaration

always a parsed entity

external entity

not internal

parsed or unparsed

3. Entity types

Page 7: XML for Information Management – Day 4 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor:

XML for Information Management – Day 4Airi Salminen

7

general entity

used in elements and attributes

parsed or unparsed

internal or external

parameter entity

used in the document type definition

always parsed

internal or external

3. Entity types

Page 8: XML for Information Management – Day 4 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor:

XML for Information Management – Day 4Airi Salminen

8

Alternatives

parsed internal parameter

internal general

external parameter

internal general

unparsed external general

3. Entity types

Page 9: XML for Information Management – Day 4 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor:

XML for Information Management – Day 4Airi Salminen

9

• root entity, external subset of DTD

• other files intended for XML processing

INPUT FILES for XML processing:

UNPARSED ENTITIES:

XMLprocessor

Information about: application

• elements and attributes

• comments• processing instructions• character data• namespaces• notations and

locations of unparsed entities

• files not intended for XML processing but referred to by entity references in the INPUT FILES

INTERNAL ENTITIES:

• name and textual content given in DTD

3. Entity types

Page 10: XML for Information Management – Day 4 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor:

XML for Information Management – Day 4Airi Salminen

10

4. Entity declarations and references

EntityDecl ::= GEDecl | PEDecl

GEDecl ::= '<!ENTITY' S Name S EntityDef S? '>'

PEDecl ::= '<!ENTITY' S '%' Name S PEDef S? '>'

EntityDef ::= EntityValue | ( ExternalID NDataDecl?)

PEDef ::= EntityValue | ExternalID

entity definition for external entityentity definition for internal entity

Page 11: XML for Information Management – Day 4 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor:

XML for Information Management – Day 4Airi Salminen

11

internal entity

name and value ( = literal value) given

<!ENTITY % Shape "(rect | circle | poly | default )">

<!ENTITY JY "Jyväskylän yliopisto">

name literal value

4. Entity declarations and references

Page 12: XML for Information Management – Day 4 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor:

XML for Information Management – Day 4Airi Salminen

12

name and system identifier (possibly together with public identifier) given, for an unparsed entity also notation

external entity

<!ENTITY % HTMLsymbol PUBLIC "-//W3C//ENTITIES Symbols for XHTML//EN"

"xhtml-symbol.ent"><!ENTITY % HTMLspecial PUBLIC "-//W3C//ENTITIES Special for XHTML//EN"

"xhtml-special.ent">http://www.w3.org/TR/2002/REC-xhtml1-20020801/dtds.html

Declarations from XHTML specification:

<!ENTITY virtuaaliyliopistouutiset SYSTEM "http://virtuaaliyliopisto.jyu.fi/kotisivut/sisalto/etusivu/newsfeed.xml">

4. Entity declarations and references

Page 13: XML for Information Management – Day 4 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor:

XML for Information Management – Day 4Airi Salminen

13

Unparsed entity

notation name

The notation must have been declared, for example:

<!ENTITY image1 SYSTEM "../images/birdnest.gif" NDATA gif>

4. Entity declarations and references

<!NOTATION gif PUBLIC "-//ISBN 0-7923-9432-1::Graphic Notation//NOTATION CompuServe Graphic Interchange Format//EN" >

Page 14: XML for Information Management – Day 4 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor:

XML for Information Management – Day 4Airi Salminen

14

References to parameter entities:

%Shape;

&JY;

%HTMLsymbol;

&virtuaaliyliopistouutiset;

References to parsed general entities:

Reference to an unparsed general entity:

<poem image="image1">

The type of the attribute has to be ENTITY or ENTITIES

4. Entity declarations and references

Page 15: XML for Information Management – Day 4 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor:

XML for Information Management – Day 4Airi Salminen

15

In addition to entity references, XML documents may contain character references.

Refers to a specific character of Unicode

Provides a decimal or hexadecimal representation of the character’s code point in Unicode

&#34;Example:

One-character entity defined: <!ENTITY quot "&#34;">

4. Entity declarations and references

Page 16: XML for Information Management – Day 4 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor:

XML for Information Management – Day 4Airi Salminen

16

Where an entity or character reference can occur?

reference to

can occur inparameter entity ‣document type definition

parsed general entity ‣element content‣attribute value (either in the start-

tag or in the attribute definition)‣entity value

unparsed general entity ‣attribute value (either in the start-tag or in the attribute definition)

character ‣element content‣attribute value (either in the start-

tag or in the attribute definition)‣entity value

4. Entity declarations and references

Page 17: XML for Information Management – Day 4 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor:

XML for Information Management – Day 4Airi Salminen

17

5. XML processor treatment of entity references

References to unparsed entities

Validating processor makes the identifiers for the entities and associated notations available to the application.

<poem image="figure1"><!-- From a poem of Aale Tynni --><line>Seisoin ikkunassa ja nauroin. Ihana puu.</line><line>Ihana pesä.</line></poem>

Page 18: XML for Information Management – Day 4 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor:

XML for Information Management – Day 4Airi Salminen

18

References to parsed entities

Dealing with two kinds of entity values:

literal value - the character string written between quotes in the entity definition

replacement text - derived by replacing the character references and parameter entity references in the literal value by their character values and replacement texts, respectively.

The XML processor replaces the entity reference by its replacement text.

5. XML processor treatment of entity references

Page 19: XML for Information Management – Day 4 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor:

XML for Information Management – Day 4Airi Salminen

19

<!ENTITY rhyme1 "<rhyme xml:lang="fi"><line>Ole aina iloinen</line><line>niin kuin pikku varpunen</line></rhyme>">

entity declaration

The XML processor is not able to parse this! Problem with the quotes inside the quotes!

5. XML processor treatment of entity references

Page 20: XML for Information Management – Day 4 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor:

XML for Information Management – Day 4Airi Salminen

20

<!ENTITY rhyme1 "<line>Ole aina iloinen</line><line>niin kuin pikku varpunen</line></rhyme>">

replacement text = literal value

entity declaration

entity reference <rhymecollection>&rhyme1; </rhymecollection>

<rhyme><line>Ole aina iloinen</line><line>niin kuin pikku varpunen</line></rhyme>

5. XML processor treatment of entity references

Page 21: XML for Information Management – Day 4 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor:

XML for Information Management – Day 4Airi Salminen

21

<!ENTITY rhyme1 "<rhyme xml:lang=&#34;fi&#34;><line>Ole aina iloinen</line><line>niin kuin pikku varpunen</line></rhyme>">

replacement text

entity declaration with character references

entity reference<rhymecollection>&rhyme1; </rhymecollection>

<rhyme xml:lang="fi"><line>Ole aina iloinen</line><line>niin kuin pikku varpunen</line></rhyme>

5. XML processor treatment of entity references

literal value <rhyme xml:lang=&#34;fi&#34;><line>Ole aina iloinen</line><line>niin kuin pikku varpunen</line></rhyme>

Page 22: XML for Information Management – Day 4 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor:

XML for Information Management – Day 4Airi Salminen

22

<!ENTITY % StyleSheet "CDATA"> <!-- style sheet data -->

<!ENTITY % Text "CDATA"> <!-- used for titles etc. -->

<!ENTITY % coreattrs "id ID #IMPLIED class CDATA #IMPLIED

style %StyleSheet; #IMPLIED title %Text; #IMPLIED">

http://www.w3.org/TR/2002/REC-xhtml1-20020801/dtds.html

Declarations from XHTML specification:

literal value of coreattrs: id ID #IMPLIED class CDATA #IMPLIED

style %StyleSheet; #IMPLIED title %Text; #IMPLIED

replacement text of coreattrs: id ID #IMPLIED class CDATA #IMPLIED

style CDATA #IMPLIED title CDATA #IMPLIED

5. XML processor treatment of entity references

Page 23: XML for Information Management – Day 4 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor:

XML for Information Management – Day 4Airi Salminen

23

<!ENTITY % Block " (%block; | form | %misc; )*">

Exercise

Entity declaration from XHTML Strict-DTD:

What is the (a) literal value(b) replacement text

of entity Block

(a) literal value: (%block; | form | %misc; )*

5. XML processor treatment of entity references

Page 24: XML for Information Management – Day 4 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor:

XML for Information Management – Day 4Airi Salminen

24

<!ENTITY % heading "h1| h2| h3| h4| h5| h6"><!ENTITY % lists "ul | ol | dl"><!ENTITY % blocktext "pre | hr | blockquote | address"><!ENTITY % block "p | %heading; | div | %lists; | %blocktext; | fieldset | table"><!ENTITY % misc.inline "ins | del | script"><!ENTITY % misc "noscript | %misc.inline;">

http://www.w3.org/TR/2002/REC-xhtml1-20020801/dtds.html

Declarations from XHTML specification:

Other entity declarations needed from the DTD:

5. XML processor treatment of entity references

Page 25: XML for Information Management – Day 4 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor:

XML for Information Management – Day 4Airi Salminen

25

Deriving the replacement text of Block : references to parameter entities in the literal value (%block; | form | %misc;)* replaced by their replacement texts.

p | %heading; | div | %lists; | %blocktext; | fieldset | table

Literal value of block:

Replacement text of block:p | h1| h2| h3| h4| h5| h6 | div | ul | ol | dl | pre | hr | blockquote | address | fieldset | table

Literal value of misc : noscript | %misc.inline;

Replacement text of misc : noscript | ins | del | scriptReplacement text of Block : (p | h1| h2| h3| h4| h5| h6 | div | ul | ol | dl | pre | hr | blockquote |

address | fieldset | table | form | noscript | ins | del | script )*

5. XML processor treatment of entity references

Page 26: XML for Information Management – Day 4 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor:

XML for Information Management – Day 4Airi Salminen

26

6. Motivations for the use of entities

• use of non-textual data (audio, graphics, etc.) in XML documents (but can be added also in stylesheets)

• modularization of documents

• consistency

• multiuse of definitions

• adding semantic information by informative entity names and comments attached to entity declarations

The use of entities supports:

Page 27: XML for Information Management – Day 4 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor:

XML for Information Management – Day 4Airi Salminen

27

5. XML family of languages

Specification of XML 1.0 was just the first step in the development of languages for the management of data on the Web.

‣W3C (World Wide Web Consortium) developes specifications to support the use of the web, the specifications are publicly available at http://www.w3.org/TR/

‣Development is systematic

‣Development process is specified and published

Page 28: XML for Information Management – Day 4 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor:

XML for Information Management – Day 4Airi Salminen

28

‣Working Draft: represents work in progress.

‣Candidate Recommendation: has received significant review from its immediate technical community, explicit call for implementation and technical feedback.

‣Proposed Recommendation: represents consensus in the development group, proposed to the Advisory Committee for review.

‣Recommendation: represents consensus within W3C, widespread implementation encouraged.

Phases of the W3C development process

5. XML family of languages

Page 29: XML for Information Management – Day 4 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor:

XML for Information Management – Day 4Airi Salminen

29

XML family =

XML + XML-related languages

A. Salminen, XML Family of Languages. Overview and Classification. http://users.jyu.fi/~airi/xmlfamily.html

5. XML family of languages

Page 30: XML for Information Management – Day 4 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor:

XML for Information Management – Day 4Airi Salminen

30

XML-related languages fall into the following categories: XML accessory: intended for wide use to extend the

capabilites of XML

XML transducer: intended for transducing some input XML data into some output form

XML application: intended for some special application domain, defines constraints for XML data on the domain

5. XML family of languages

Page 31: XML for Information Management – Day 4 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor:

XML for Information Management – Day 4Airi Salminen

31

additional rules extending the capabilities specified in XML

intended for wide use development primarily at W3C for realizing the modularization principle of W3C: keep

XML itself small and as stable as possible

most important: XML Names, XML Schema, XPath, XLink

XML Accessory

5. XML family of languages

Page 32: XML for Information Management – Day 4 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor:

XML for Information Management – Day 4Airi Salminen

32

5. XML family of languages

W3C Recommendations for XML Accessories:

Language Purpose Recommendation

XML Names Qualifying element and attribute names 1999, 2004, 2006, 2009

XML Stylesheet Associating style sheets with an XML document 1999

XPath Addressing parts of XML documents 1999, 2007

XML Schema Constraining a class of XML documents 2001, 2004

XLink To create and describe links 2001

XML Base A base URI service 2001

XPointer Fragment identifiers especially for URI references 2003

xml:id Attribute xml:id in XML documents 2005

ITS Mechanism to support internationalization and localization of content

2007

Page 33: XML for Information Management – Day 4 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor:

XML for Information Management – Day 4Airi Salminen

33

To convert XML input data (a document, part of document, a set of documents) into output

Associated with a processing model Active development at W3C

most important: CSS, XSL, XSLT, XQuery

XML Transducer

5. XML family of languages

Page 34: XML for Information Management – Day 4 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor:

XML for Information Management – Day 4Airi Salminen

34

5. XML family of languages

W3C Recommendations for XML Transducers:

Language Purpose Recommendation

CSS Rendering (1996), 1998

XSLT Transformation 1999, 2007

Canonical XML Canonicalization 2001, 2002

XSL Rendering 2001, 2006

XInclude Merging 2004, 2006

XQuery Querying 2007

Page 35: XML for Information Management – Day 4 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor:

XML for Information Management – Day 4Airi Salminen

35

Defines constraints for a class of XML data on a particular application domain

Usually defined by a DTD or some other schema language

development work both at W3C and outside

Examples from W3C: SMIL, RDF, XHTML

XML Application

5. XML family of languages

Page 36: XML for Information Management – Day 4 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor:

XML for Information Management – Day 4Airi Salminen

36

• Non-textual Data

• Web Publishing

• Metadata and Semantic Web

• Web Communication and Services

5. XML family of languages

XML Applications developed at W3C for:

Page 37: XML for Information Management – Day 4 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor:

XML for Information Management – Day 4Airi Salminen

37

5. XML family of languages

W3C Recommendations for non-textual data:Language Purpose Recommendation

SMIL (Syncronized Multimedia Integration Language)

Integrating a set of independent multimedia objects into a syncronized multimedia presentation

1998, 2001, 2005

MathML (Mathematical Markup Language)

Mathematical notation, especially for eabling encoding mathematical material for the Web

1999, 2001, 2003

Ruby Annotation Markup for ruby, short annotations alongside the base text typically used in East Asian documents

2001

SMIL Animation Animation functionality in XML documents 2001

SVG To describe two-dimensional vector and mixed vector/raster graphic

2001, 2003

VoiceXML (Voice Extensible Markup Language)

To describe audio dialogs and thus support interactive voice response applications on the Web

2004, 2007

SSML (Speech Synthesis Markup Languages)

To assist generation of synthetic speech in Web and other applications

2004

EMMA (Extensible MultiModal Annotation markup language)

To enable Web access using multimodal interfaces

2009

Page 38: XML for Information Management – Day 4 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor:

XML for Information Management – Day 4Airi Salminen

38

5. XML family of languages

W3C Recommendations for Web publishing:

Language Purpose Recommendation

XHTML Reformulation of HTML 4.0 in XML specified by three document types: Strict, Transitional, Frameset

1999, 2000, 2002

XHTML Modularization Defining XHTML elements and attributes in a set of modules

2001

XHTML Basic The minimal core of XHTML 2000

XML Events To represent asynchronous occurrences, such as mouse clicks, in XHTML or in other XML markup

2003

XForms For Web forms allowing online interaction between human users and software, to be used in XHTML or in other XML markup

2003, 2006

XHTML-Print Simple XHTML suitable for printing from mobile devices as well as for display

2006

Page 39: XML for Information Management – Day 4 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor:

XML for Information Management – Day 4Airi Salminen

39

5. XML family of languages

W3C Recommendations for Semantic Web:

Language Purpose Recommendation

RDF (Resource Description Framework)

A model and XML-based language for metadata describing Web resources

1999, 2004

RDF Schema To define RDF vocabularies 2004

OWL (Web Ontology Language) Publishing and sharing ontologies 2004

WebCGM XCF Metadata for WebCGM pictures 2007

GRDDL (Gleaning Resource Descriptions from Dialects of Languages)

Markup for declaring that an XML document includes RDF compatible data

2007

SPARQL Query language for RDF 2008

POWDER Metadata to describe a group of resources 2009

Page 40: XML for Information Management – Day 4 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor:

XML for Information Management – Day 4Airi Salminen

40

5. XML family of languages

W3C Recommendations for Web communication and services:

Language Purpose Recommendation

P3P (Platform for Privacy Preferences)

To enable Web sites to express their practices to collect and use data collected from users of sites

2002

XML-Signature Associating digital objects by digital signatures in XML format

2002

XML Encryption Encrypting data and representing the result in XML 2002

SOAP (Simple Object Access Protocol)

Rules to exchange structured and typed information between peers in a decentralized, distributed environment

2003, 2007

CC/PP (Composite Capabilities/Preference Profiles)

A format for how a client device tells an origin server about its user agent profile

2004

XKMS (XML Key Management Specification)

Protocol for distributing and registering public keys 2005

WSDL (Web Services Description Language)

To describe Web services 2007

SML Service modeling 2009

Page 41: XML for Information Management – Day 4 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor:

XML for Information Management – Day 4Airi Salminen

41

A. Salminen, XML Family of Languages. Overview and Classification. http://users.jyu.fi/~airi/xmlfamily.html

For more information:

1. XML family of languages