Upload
gertrude-moody
View
213
Download
0
Embed Size (px)
Citation preview
StructuredStructured-Document -Document Processing LanguagesProcessing Languages
(3 cu), Spring 2004(3 cu), Spring 2004Pekka KilpeläinenPekka Kilpeläinen
University of KuopioUniversity of Kuopio
Department of CS & Applied MathDepartment of CS & Applied [email protected]@cs.uku.fi
SDPL 2004 Notes 1: Introduction 2
1 Introduction1 Introduction
First: Overview and ArrangementsFirst: Overview and Arrangements
What this course is about?What this course is about?
1.1 Structured Documents1.1 Structured Documents
Review of basic notionsReview of basic notions
SDPL 2004 Notes 1: Introduction 3
Goals of the CourseGoals of the Course
To get familiar with the most important models To get familiar with the most important models and languages for and languages for – manipulatingmanipulating– representingrepresenting– transforming and transforming and – querying querying
structured documents (or XML)structured documents (or XML) ““Generic XML processing technology”Generic XML processing technology”
– very little about specific XML applications or very little about specific XML applications or commercial systemscommercial systems
SDPL 2004 Notes 1: Introduction 4
NOT an Exhaustive SurveyNOT an Exhaustive Survey
Bias in selecting course topics:Bias in selecting course topics:– estimated usefulness/valueestimated usefulness/value
» centrality (implying longer-lasting value)centrality (implying longer-lasting value)» maturity: Stable specifications? maturity: Stable specifications?
Existing implementations? Existing implementations?
– Lecturer up-to-date?Lecturer up-to-date?
Emphasis on Emphasis on processingprocessing data in the form of data in the form of documents, rather than describing itdocuments, rather than describing it
SDPL 2004 Notes 1: Introduction 5
Motivation?Motivation?
Practical relevance: “eBusiness” is HOT!Practical relevance: “eBusiness” is HOT!
Academic interest in models of information Academic interest in models of information processingprocessing
XMLXML
InternetInternet
orderorder
invoiceinvoice
SDPL 2004 Notes 1: Introduction 6
Preliminary OutlinePreliminary Outline
1 Introduction1 IntroductionOverview and ArrangementsOverview and Arrangements1.1 Structured Documents1.1 Structured Documents
2 Document Instances and Grammars 2 Document Instances and Grammars 2.1 Trees and their Grammars 2.1 Trees and their Grammars 2.2 Review of XML basics: DTDs, Namespaces, 2.2 Review of XML basics: DTDs, Namespaces, SchemasSchemas
3 Programmatic Manipulation of Structured Documents 3 Programmatic Manipulation of Structured Documents (XML APIs)(XML APIs)3.1 SAX3.1 SAX3.2 DOM; 3.3 JAXP3.2 DOM; 3.3 JAXP
SDPL 2004 Notes 1: Introduction 7
Preliminary Outline (2)Preliminary Outline (2)
4 Styling Structured Documents I4 Styling Structured Documents I4.1 Essentials of Cascading Style Sheets4.1 Essentials of Cascading Style Sheets
5 Transforming Structured Documents5 Transforming Structured Documents5.1 Addressing: XPath5.1 Addressing: XPath5.2 XSLT5.2 XSLT
6 Styling Structured Documents II: XSL6 Styling Structured Documents II: XSL
7 XML wrapping (or translating data to XML)7 XML wrapping (or translating data to XML)
8 Querying Structured Documents8 Querying Structured Documents- W3C XML Query Language XQuery- W3C XML Query Language XQuery
SDPL 2004 Notes 1: Introduction 8
Methodological GoalsMethodological Goals
Some central professional skillsSome central professional skills– consulting of technical specificationsconsulting of technical specifications– experimenting with SW implementationsexperimenting with SW implementations
Ability to think…?Ability to think…?– to find out relationshipsto find out relationships– to apply knowledge in new situationsto apply knowledge in new situations
("Pidgin English" for scientific communication)("Pidgin English" for scientific communication)
SDPL 2004 Notes 1: Introduction 9
AdministrationAdministration
An elective graduate-level (laudatur) special An elective graduate-level (laudatur) special coursecourse– suitable for all specialisation lines (esp. CS/SWE) suitable for all specialisation lines (esp. CS/SWE)
3 cu (3 cu (120 hours of work)120 hours of work) LecturesLectures March 9 – May 6, MT2/E26–27 March 9 – May 6, MT2/E26–27
– Lecturer: [email protected]: [email protected]
Assistant: [email protected]: [email protected]
SDPL 2004 Notes 1: Introduction 10
Administration: ExercisesAdministration: Exercises
ExercisesExercises March 24 – May 12, MT2/E26–27 March 24 – May 12, MT2/E26–27– essential for familiarizing with the technologyessential for familiarizing with the technology– mainly normal homework assignments, some hands-on mainly normal homework assignments, some hands-on
practice; Solutions discussed in classpractice; Solutions discussed in class + a "+ a "mini-projectmini-project""
» programming/modifying a document processing programming/modifying a document processing application (XML/Java/DOM/JAXP/XSLT)application (XML/Java/DOM/JAXP/XSLT)
» individually or in small groups individually or in small groups » to be handed-in to lecturerto be handed-in to lecturer
– credited like other exercises (grading based on quality credited like other exercises (grading based on quality by a factor in [0, 1.5])by a factor in [0, 1.5])
SDPL 2004 Notes 1: Introduction 11
Administration: GradingAdministration: Grading
Course Course examexam on Tuesday, May 18, in SL on Tuesday, May 18, in SL– minimum of 50% of exam points to pass the courseminimum of 50% of exam points to pass the course
Grade = Grade = (12*Exam/MaxExam + 4*HomeWork/MaxHomeWork - (12*Exam/MaxExam + 4*HomeWork/MaxHomeWork -
4)4)
Opportunity to retake the examOpportunity to retake the exam– June 3 (again June 3 (again 50% to pass; grade with/without 50% to pass; grade with/without
homework credits, whichever is better)homework credits, whichever is better)
SDPL 2004 Notes 1: Introduction 12
MaterialMaterial
No single textbookNo single textbook Reports, articlesReports, articles Course home pageCourse home page
– http://www.cs.uku.fi/~kilpelai/RDK04/http://www.cs.uku.fi/~kilpelai/RDK04/– lecture notes, exercises, reference material, lecture notes, exercises, reference material,
announcements, …announcements, …
Possible background text: Possible background text: Deitel, Deitel, Nieto, Lin & Sadhu: XML - How to Deitel, Deitel, Nieto, Lin & Sadhu: XML - How to Program. Prentice Hall, 2001.Program. Prentice Hall, 2001.
SDPL 2004 Notes 1: Introduction 13
Background CheckBackground Check
Basic knowledge of structured documents and document Basic knowledge of structured documents and document standardsstandards– Course ”Introduction to Document standards"?Course ”Introduction to Document standards"?– HTML?HTML?
Programming languages and conceptsProgramming languages and concepts– Java? OO programming?Java? OO programming?– Unix/Linux \ Windows?Unix/Linux \ Windows?
Formal language theory Formal language theory – Theory of Computation / "Ohjelmoinnin ja laskennan teoria"?Theory of Computation / "Ohjelmoinnin ja laskennan teoria"?– regular expressions, automata?regular expressions, automata?– context-free grammars, parse trees?context-free grammars, parse trees?
SDPL 2004 Notes 1: Introduction 14
Course Expectations?Course Expectations?
SDPL 2004 Notes 1: Introduction 15
1.1. Structured Documents1.1. Structured Documents
DocumentDocument: : – a structured representation of information on some a structured representation of information on some
medium (medium ( message) message)
– normally for a human readernormally for a human reader» memos, manuals, articles, books, …memos, manuals, articles, books, …
– also application-to-application messagesalso application-to-application messages» EDI (electronic data interchange)EDI (electronic data interchange)
– "prose-oriented XML" vs "data-oriented XML""prose-oriented XML" vs "data-oriented XML"– possibly non-permanent, dynamically generatedpossibly non-permanent, dynamically generated– processable or conceivable as a unit processable or conceivable as a unit
» (a web page vs a web site)(a web page vs a web site)
SDPL 2004 Notes 1: Introduction 16
Text-Based DocumentsText-Based Documents
We concentrate on textual or text-based We concentrate on textual or text-based documentsdocuments– character data major constituent of information character data major constituent of information
contentcontent– as opposed to, say multimedia documents as opposed to, say multimedia documents
Next: Presentation vs Structure Next: Presentation vs Structure
SDPL 2004 Notes 1: Introduction 17
Presentation vs StructurePresentation vs Structure
Presentation informs the Presentation informs the human readerhuman reader about the about the meaning of text and the role of its partsmeaning of text and the role of its parts
Markup (Markup (merkkausmerkkaus)): : indicating the presentation indicating the presentation or the meaning of different parts of text or the meaning of different parts of text – originally hand-written annotations for the typesetter originally hand-written annotations for the typesetter – nowadays primarily codes embedded in digital nowadays primarily codes embedded in digital
documentsdocuments
SDPL 2004 Notes 1: Introduction 18
MarkupMarkup
Procedural markup Procedural markup – formatting commands (start boldface, produce an formatting commands (start boldface, produce an
empty line, indent 5 mm, ...)empty line, indent 5 mm, ...)– proprietary word processor formats, nroff, TeX, ...proprietary word processor formats, nroff, TeX, ...
Descriptive Descriptive oror generic markup generic markup– indicating the logical structure of text using chosen indicating the logical structure of text using chosen
namesnames– LaTeX: LaTeX: \begin{abstract} ... \end{abstract} \begin{abstract} ... \end{abstract} – HTML: HTML: <TITLE> .... </TITLE><TITLE> .... </TITLE>
Markup language (Markup language (merkkauskielimerkkauskieli))– a fixed set of markup notations (e.g. nroff, TeX, HTML, a fixed set of markup notations (e.g. nroff, TeX, HTML,
SVG, …) SVG, …)
SDPL 2004 Notes 1: Introduction 19
Structured Documents?Structured Documents?
Most liberally, Most liberally, anyany document is structured document is structured (punctuation, words, sentences, fields, …)(punctuation, words, sentences, fields, …)
but especially descriptively marked-up but especially descriptively marked-up documents ...documents ...
especially if they adhere to a rigorous especially if they adhere to a rigorous specification of structurespecification of structure
SDPL 2004 Notes 1: Introduction 20
Structure in DocumentsStructure in Documents
HierarchyHierarchy or or nestingnesting is ubiquitous is ubiquitous– chapters of books, warnings in maintenance chapters of books, warnings in maintenance
manuals, ...manuals, ... Linear orderLinear order essential in prose documents essential in prose documents
– less important in documents representing data less important in documents representing data objectsobjects
HypertextHypertext and and cross-referencescross-references We'll be mainly dealing with manipulation of We'll be mainly dealing with manipulation of
hierarchical, or tree-like document structureshierarchical, or tree-like document structures
Next: How these are modelled?Next: How these are modelled?