20
Structured Structured -Document -Document Processing Languages Processing Languages (3 cu), Spring 2004 (3 cu), Spring 2004 Pekka Kilpeläinen Pekka Kilpeläinen University of Kuopio University of Kuopio Department of CS & Applied Math Department of CS & Applied Math [email protected] [email protected]

Structured-Document Processing Languages (3 cu), Spring 2004 Pekka Kilpeläinen University of Kuopio Department of CS & Applied Math [email protected]

Embed Size (px)

Citation preview

Page 1: Structured-Document Processing Languages (3 cu), Spring 2004 Pekka Kilpeläinen University of Kuopio Department of CS & Applied Math Pekka.Kilpelainen@cs.uku.fi

StructuredStructured-Document -Document Processing LanguagesProcessing Languages

(3 cu), Spring 2004(3 cu), Spring 2004Pekka KilpeläinenPekka Kilpeläinen

University of KuopioUniversity of Kuopio

Department of CS & Applied MathDepartment of CS & Applied [email protected]@cs.uku.fi

Page 2: Structured-Document Processing Languages (3 cu), Spring 2004 Pekka Kilpeläinen University of Kuopio Department of CS & Applied Math Pekka.Kilpelainen@cs.uku.fi

SDPL 2004 Notes 1: Introduction 2

1 Introduction1 Introduction

First: Overview and ArrangementsFirst: Overview and Arrangements

What this course is about?What this course is about?

1.1 Structured Documents1.1 Structured Documents

Review of basic notionsReview of basic notions

Page 3: Structured-Document Processing Languages (3 cu), Spring 2004 Pekka Kilpeläinen University of Kuopio Department of CS & Applied Math Pekka.Kilpelainen@cs.uku.fi

SDPL 2004 Notes 1: Introduction 3

Goals of the CourseGoals of the Course

To get familiar with the most important models To get familiar with the most important models and languages for and languages for – manipulatingmanipulating– representingrepresenting– transforming and transforming and – querying querying

structured documents (or XML)structured documents (or XML) ““Generic XML processing technology”Generic XML processing technology”

– very little about specific XML applications or very little about specific XML applications or commercial systemscommercial systems

Page 4: Structured-Document Processing Languages (3 cu), Spring 2004 Pekka Kilpeläinen University of Kuopio Department of CS & Applied Math Pekka.Kilpelainen@cs.uku.fi

SDPL 2004 Notes 1: Introduction 4

NOT an Exhaustive SurveyNOT an Exhaustive Survey

Bias in selecting course topics:Bias in selecting course topics:– estimated usefulness/valueestimated usefulness/value

» centrality (implying longer-lasting value)centrality (implying longer-lasting value)» maturity: Stable specifications? maturity: Stable specifications?

Existing implementations? Existing implementations?

– Lecturer up-to-date?Lecturer up-to-date?

Emphasis on Emphasis on processingprocessing data in the form of data in the form of documents, rather than describing itdocuments, rather than describing it

Page 5: Structured-Document Processing Languages (3 cu), Spring 2004 Pekka Kilpeläinen University of Kuopio Department of CS & Applied Math Pekka.Kilpelainen@cs.uku.fi

SDPL 2004 Notes 1: Introduction 5

Motivation?Motivation?

Practical relevance: “eBusiness” is HOT!Practical relevance: “eBusiness” is HOT!

Academic interest in models of information Academic interest in models of information processingprocessing

XMLXML

InternetInternet

orderorder

invoiceinvoice

Page 6: Structured-Document Processing Languages (3 cu), Spring 2004 Pekka Kilpeläinen University of Kuopio Department of CS & Applied Math Pekka.Kilpelainen@cs.uku.fi

SDPL 2004 Notes 1: Introduction 6

Preliminary OutlinePreliminary Outline

1 Introduction1 IntroductionOverview and ArrangementsOverview and Arrangements1.1 Structured Documents1.1 Structured Documents

2 Document Instances and Grammars 2 Document Instances and Grammars 2.1 Trees and their Grammars 2.1 Trees and their Grammars 2.2 Review of XML basics: DTDs, Namespaces, 2.2 Review of XML basics: DTDs, Namespaces, SchemasSchemas

3 Programmatic Manipulation of Structured Documents 3 Programmatic Manipulation of Structured Documents (XML APIs)(XML APIs)3.1 SAX3.1 SAX3.2 DOM; 3.3 JAXP3.2 DOM; 3.3 JAXP

Page 7: Structured-Document Processing Languages (3 cu), Spring 2004 Pekka Kilpeläinen University of Kuopio Department of CS & Applied Math Pekka.Kilpelainen@cs.uku.fi

SDPL 2004 Notes 1: Introduction 7

Preliminary Outline (2)Preliminary Outline (2)

4 Styling Structured Documents I4 Styling Structured Documents I4.1 Essentials of Cascading Style Sheets4.1 Essentials of Cascading Style Sheets

5 Transforming Structured Documents5 Transforming Structured Documents5.1 Addressing: XPath5.1 Addressing: XPath5.2 XSLT5.2 XSLT

6 Styling Structured Documents II: XSL6 Styling Structured Documents II: XSL

7 XML wrapping (or translating data to XML)7 XML wrapping (or translating data to XML)

8 Querying Structured Documents8 Querying Structured Documents- W3C XML Query Language XQuery- W3C XML Query Language XQuery

Page 8: Structured-Document Processing Languages (3 cu), Spring 2004 Pekka Kilpeläinen University of Kuopio Department of CS & Applied Math Pekka.Kilpelainen@cs.uku.fi

SDPL 2004 Notes 1: Introduction 8

Methodological GoalsMethodological Goals

Some central professional skillsSome central professional skills– consulting of technical specificationsconsulting of technical specifications– experimenting with SW implementationsexperimenting with SW implementations

Ability to think…?Ability to think…?– to find out relationshipsto find out relationships– to apply knowledge in new situationsto apply knowledge in new situations

("Pidgin English" for scientific communication)("Pidgin English" for scientific communication)

Page 9: Structured-Document Processing Languages (3 cu), Spring 2004 Pekka Kilpeläinen University of Kuopio Department of CS & Applied Math Pekka.Kilpelainen@cs.uku.fi

SDPL 2004 Notes 1: Introduction 9

AdministrationAdministration

An elective graduate-level (laudatur) special An elective graduate-level (laudatur) special coursecourse– suitable for all specialisation lines (esp. CS/SWE) suitable for all specialisation lines (esp. CS/SWE)

3 cu (3 cu (120 hours of work)120 hours of work) LecturesLectures March 9 – May 6, MT2/E26–27 March 9 – May 6, MT2/E26–27

– Lecturer: [email protected]: [email protected]

Assistant: [email protected]: [email protected]

Page 10: Structured-Document Processing Languages (3 cu), Spring 2004 Pekka Kilpeläinen University of Kuopio Department of CS & Applied Math Pekka.Kilpelainen@cs.uku.fi

SDPL 2004 Notes 1: Introduction 10

Administration: ExercisesAdministration: Exercises

ExercisesExercises March 24 – May 12, MT2/E26–27 March 24 – May 12, MT2/E26–27– essential for familiarizing with the technologyessential for familiarizing with the technology– mainly normal homework assignments, some hands-on mainly normal homework assignments, some hands-on

practice; Solutions discussed in classpractice; Solutions discussed in class + a "+ a "mini-projectmini-project""

» programming/modifying a document processing programming/modifying a document processing application (XML/Java/DOM/JAXP/XSLT)application (XML/Java/DOM/JAXP/XSLT)

» individually or in small groups individually or in small groups » to be handed-in to lecturerto be handed-in to lecturer

– credited like other exercises (grading based on quality credited like other exercises (grading based on quality by a factor in [0, 1.5])by a factor in [0, 1.5])

Page 11: Structured-Document Processing Languages (3 cu), Spring 2004 Pekka Kilpeläinen University of Kuopio Department of CS & Applied Math Pekka.Kilpelainen@cs.uku.fi

SDPL 2004 Notes 1: Introduction 11

Administration: GradingAdministration: Grading

Course Course examexam on Tuesday, May 18, in SL on Tuesday, May 18, in SL– minimum of 50% of exam points to pass the courseminimum of 50% of exam points to pass the course

Grade = Grade = (12*Exam/MaxExam + 4*HomeWork/MaxHomeWork - (12*Exam/MaxExam + 4*HomeWork/MaxHomeWork -

4)4)

Opportunity to retake the examOpportunity to retake the exam– June 3 (again June 3 (again 50% to pass; grade with/without 50% to pass; grade with/without

homework credits, whichever is better)homework credits, whichever is better)

Page 12: Structured-Document Processing Languages (3 cu), Spring 2004 Pekka Kilpeläinen University of Kuopio Department of CS & Applied Math Pekka.Kilpelainen@cs.uku.fi

SDPL 2004 Notes 1: Introduction 12

MaterialMaterial

No single textbookNo single textbook Reports, articlesReports, articles Course home pageCourse home page

– http://www.cs.uku.fi/~kilpelai/RDK04/http://www.cs.uku.fi/~kilpelai/RDK04/– lecture notes, exercises, reference material, lecture notes, exercises, reference material,

announcements, …announcements, …

Possible background text: Possible background text: Deitel, Deitel, Nieto, Lin & Sadhu: XML - How to Deitel, Deitel, Nieto, Lin & Sadhu: XML - How to Program. Prentice Hall, 2001.Program. Prentice Hall, 2001.

Page 13: Structured-Document Processing Languages (3 cu), Spring 2004 Pekka Kilpeläinen University of Kuopio Department of CS & Applied Math Pekka.Kilpelainen@cs.uku.fi

SDPL 2004 Notes 1: Introduction 13

Background CheckBackground Check

Basic knowledge of structured documents and document Basic knowledge of structured documents and document standardsstandards– Course ”Introduction to Document standards"?Course ”Introduction to Document standards"?– HTML?HTML?

Programming languages and conceptsProgramming languages and concepts– Java? OO programming?Java? OO programming?– Unix/Linux \ Windows?Unix/Linux \ Windows?

Formal language theory Formal language theory – Theory of Computation / "Ohjelmoinnin ja laskennan teoria"?Theory of Computation / "Ohjelmoinnin ja laskennan teoria"?– regular expressions, automata?regular expressions, automata?– context-free grammars, parse trees?context-free grammars, parse trees?

Page 14: Structured-Document Processing Languages (3 cu), Spring 2004 Pekka Kilpeläinen University of Kuopio Department of CS & Applied Math Pekka.Kilpelainen@cs.uku.fi

SDPL 2004 Notes 1: Introduction 14

Course Expectations?Course Expectations?

Page 15: Structured-Document Processing Languages (3 cu), Spring 2004 Pekka Kilpeläinen University of Kuopio Department of CS & Applied Math Pekka.Kilpelainen@cs.uku.fi

SDPL 2004 Notes 1: Introduction 15

1.1. Structured Documents1.1. Structured Documents

DocumentDocument: : – a structured representation of information on some a structured representation of information on some

medium (medium ( message) message)

– normally for a human readernormally for a human reader» memos, manuals, articles, books, …memos, manuals, articles, books, …

– also application-to-application messagesalso application-to-application messages» EDI (electronic data interchange)EDI (electronic data interchange)

– "prose-oriented XML" vs "data-oriented XML""prose-oriented XML" vs "data-oriented XML"– possibly non-permanent, dynamically generatedpossibly non-permanent, dynamically generated– processable or conceivable as a unit processable or conceivable as a unit

» (a web page vs a web site)(a web page vs a web site)

Page 16: Structured-Document Processing Languages (3 cu), Spring 2004 Pekka Kilpeläinen University of Kuopio Department of CS & Applied Math Pekka.Kilpelainen@cs.uku.fi

SDPL 2004 Notes 1: Introduction 16

Text-Based DocumentsText-Based Documents

We concentrate on textual or text-based We concentrate on textual or text-based documentsdocuments– character data major constituent of information character data major constituent of information

contentcontent– as opposed to, say multimedia documents as opposed to, say multimedia documents

Next: Presentation vs Structure Next: Presentation vs Structure

Page 17: Structured-Document Processing Languages (3 cu), Spring 2004 Pekka Kilpeläinen University of Kuopio Department of CS & Applied Math Pekka.Kilpelainen@cs.uku.fi

SDPL 2004 Notes 1: Introduction 17

Presentation vs StructurePresentation vs Structure

Presentation informs the Presentation informs the human readerhuman reader about the about the meaning of text and the role of its partsmeaning of text and the role of its parts

Markup (Markup (merkkausmerkkaus)): : indicating the presentation indicating the presentation or the meaning of different parts of text or the meaning of different parts of text – originally hand-written annotations for the typesetter originally hand-written annotations for the typesetter – nowadays primarily codes embedded in digital nowadays primarily codes embedded in digital

documentsdocuments

Page 18: Structured-Document Processing Languages (3 cu), Spring 2004 Pekka Kilpeläinen University of Kuopio Department of CS & Applied Math Pekka.Kilpelainen@cs.uku.fi

SDPL 2004 Notes 1: Introduction 18

MarkupMarkup

Procedural markup Procedural markup – formatting commands (start boldface, produce an formatting commands (start boldface, produce an

empty line, indent 5 mm, ...)empty line, indent 5 mm, ...)– proprietary word processor formats, nroff, TeX, ...proprietary word processor formats, nroff, TeX, ...

Descriptive Descriptive oror generic markup generic markup– indicating the logical structure of text using chosen indicating the logical structure of text using chosen

namesnames– LaTeX: LaTeX: \begin{abstract} ... \end{abstract} \begin{abstract} ... \end{abstract} – HTML: HTML: <TITLE> .... </TITLE><TITLE> .... </TITLE>

Markup language (Markup language (merkkauskielimerkkauskieli))– a fixed set of markup notations (e.g. nroff, TeX, HTML, a fixed set of markup notations (e.g. nroff, TeX, HTML,

SVG, …) SVG, …)

Page 19: Structured-Document Processing Languages (3 cu), Spring 2004 Pekka Kilpeläinen University of Kuopio Department of CS & Applied Math Pekka.Kilpelainen@cs.uku.fi

SDPL 2004 Notes 1: Introduction 19

Structured Documents?Structured Documents?

Most liberally, Most liberally, anyany document is structured document is structured (punctuation, words, sentences, fields, …)(punctuation, words, sentences, fields, …)

but especially descriptively marked-up but especially descriptively marked-up documents ...documents ...

especially if they adhere to a rigorous especially if they adhere to a rigorous specification of structurespecification of structure

Page 20: Structured-Document Processing Languages (3 cu), Spring 2004 Pekka Kilpeläinen University of Kuopio Department of CS & Applied Math Pekka.Kilpelainen@cs.uku.fi

SDPL 2004 Notes 1: Introduction 20

Structure in DocumentsStructure in Documents

HierarchyHierarchy or or nestingnesting is ubiquitous is ubiquitous– chapters of books, warnings in maintenance chapters of books, warnings in maintenance

manuals, ...manuals, ... Linear orderLinear order essential in prose documents essential in prose documents

– less important in documents representing data less important in documents representing data objectsobjects

HypertextHypertext and and cross-referencescross-references We'll be mainly dealing with manipulation of We'll be mainly dealing with manipulation of

hierarchical, or tree-like document structureshierarchical, or tree-like document structures

Next: How these are modelled?Next: How these are modelled?