View
216
Download
2
Tags:
Embed Size (px)
Citation preview
© Maria Indrawan Monash University 2003
1
CSE3201/4500
Information Retrieval Systems
Maria Indrawan
C4.26, [email protected]
© Maria Indrawan Monash University 2003
2
Type of Data
structured non- structured
XML documents
relational database
free text, search engine
• data representation
• query formulation
• matching
© Maria Indrawan Monash University 2003
3
Introduction
• What will I learn in this unit?– how to manage data that cannot be effectively
handled by a relational DBMS.• XML documents
• Text (free text)
• There will be no SQL in this unit.
© Maria Indrawan Monash University 2003
4
Objectives
• On the completion of this unit, you will (hopefully!) be able to: Understand the difference nature of information (structured, semi-
structured, unstructured) and their associated issues when dealing with information retrieval.
understand the XML technologies and their role in Information Retrieval.
Be able to demonstrate the ability to create and manipulate XML documents.
Understand the design issues and various approaches to the development of text databases.
© Maria Indrawan Monash University 2003
5
Prerequisite Knowledge
• Relational database concepts, such as SQL, indexing.
• Basic UNIX commands, eg file, directory manipulation commands.
• HTML.
• Basic level of Maths (year-12 level).
© Maria Indrawan Monash University 2003
6
Assessment
• There are different assessments
for CSE3201 and CSE4500.• Undergraduate students =>
CSE3201
• Masters students => CSE4500
© Maria Indrawan Monash University 2003
7
CSE4500 Assessment
• Component A: – Assignment 1 – XML Schema
10% (week 6)– Assignment 2 - XSLT
15% (week 9)– Unit Test, - XML, XSLT
15% (week 10)
© Maria Indrawan Monash University 2003
8
CSE4500 Assessment
• Component B– Assignment 3 Research Paper 10%
(week 12)
• Component C:– Exam 50%
© Maria Indrawan Monash University 2003
9
CSE3201 Assessment
• Component A: – Assignment 1 – XML Schema
10% (week 6)– Assignment 2 - XSLT
15% (week 9)– Unit Test, - XML, XSLT
15% (week 10)
© Maria Indrawan Monash University 2003
10
CSE3201 Assessment
• Component B– Unit Test on text retrieval
10% (week 12)
• Component C:– Exam
50%
© Maria Indrawan Monash University 2003
11
Assessment Rules
• The result of the unit test will determine the final grade for component A as follow:
Unit Test Maximum grade for Component A
Fail Pass Pass Credit Credit Distinction Distinction High Distinction
© Maria Indrawan Monash University 2003
12
Assessment Rules
• In order to pass this unit you must attain:– 50% overall and– at least 40% of the available marks in
each component A, B and C.
© Maria Indrawan Monash University 2003
13
Textbook
Prescribed:XML:How To Program (1st ed)Deitel, H.M. Deitel P.J. Nieto, TR. Lin, T. and Sadhu, PPrentice Hall
Recommended:Professional XML, 2nd Ed, WROX Publisher.Beginner XML, WROX Publisher.XML SchemaEric Van Der Vlist, O’Reilly Publishing.
© Maria Indrawan Monash University 2003
14
Resources
• Unit website:– www.csse.monash.edu.au/courseware/cse4500– www.csse.monash.edu.au/courseware/cse3201
• Useful links:– WWW consortium http://www.w3.org/– http://www.topxml.com– http://www.xmlsoftware.com.au– XML Editor http://www.xmlspy.com
© Maria Indrawan Monash University 2003
15
Plagiarism/Cheating
• Please read all the necessary university materials on cheating/plagiarism (listed in the unit guide).
© Maria Indrawan Monash University 2003
16
Computing Facilities
• Quota system
• Acceptable policy– http://www.infotech.monash.edu.au/
myfit/students/student_labinfo_rules_netusage.cfm
– http://www.adm.monash.edu.au/unisec/pol/itec12.html
© Maria Indrawan Monash University 2003
17
Being Resourceful and Independent
• I have a question on …
– Read the textbook or reading list.
– Explore additional materials, eg W3C.
– Ask my tutor.
– Ask my lecturer.
• Can I ask my tutor/helpdesk to find the bugs in my work?
– No.
• Will the solution to the tutorial exercises be published?
– No. Students are encourage to discuss their work with the tutors.• Will study the lecture notes be sufficient for this unit?
– No. Students need to read the textbook and additional reading list.
© Maria Indrawan Monash University 2003
18
Basic XML
© Maria Indrawan Monash University 2003
19
Objectives
• Be able to:– Understand XML technologies and their roles.– Understand different components of an XML
document.– Create a well-form XML document.
© Maria Indrawan Monash University 2003
20
What is XML?
• XML=ExtensibleMarkup Language.• Markup Languages:
– HTML– SGML
• Utilise the mark ups to define the – structure– semantics => to a certain level.
• WWW Consortium(W3C) recommendation– www.w3c.org
© Maria Indrawan Monash University 2003
21
XML vs HTML
HTML XML
• tags define the presentation layout<p> CSE3201 </p>
<p> Information Retrieval </p>
tags define the structure and the meaning of the data<unit>
<unitCode> CSE3201
</unitCode>
<unitName> Information
Retrieval </unitName>
</unit>
© Maria Indrawan Monash University 2003
22
Why XML?
• Distributed applications need to share data.– plain text– structure and the meaning of the data are tightly
defined.
• Delivery of data to multi-devices– Separation of data and presentation.
© Maria Indrawan Monash University 2003
23
XML Document – an Example
<bookshop><book><title> Harry Potter and the
Sorcerer’s Stone</title><author> <initials>J.K</initials> <surname>
Rowling</surname></author><price value=“$16.95”></price></book>…</bookshop>
bookshop
book
title
book
author
initials surname
price
value
© Maria Indrawan Monash University 2003
24
XML Technologies
• DTD/Schema– definition of XML structures
• XSL (XSLT and XSL-FO)– presentation
• XPath– locating nodes
• Xlink, Xpointer– linking
• DOM and SAX– APIs to manipulate XML
© Maria Indrawan Monash University 2003
25
XML Parser
• Required to read and manipulate XML documents.
• Read the XML documents as a plain text and transform it into a data structure, typically tree, in the memory.
• The applications, such as web browser, access the data structure and process the data according to their objectives.
• Example: msxml
© Maria Indrawan Monash University 2003
26
XML Usage
• SOAP (simple object access protocol)
• Microsoft BizTalk Server
• WSDL and UDDI in Web Services
• Semantic Web
© Maria Indrawan Monash University 2003
27
XML Issues
• Performance– text processing vs binary processing
• Security
© Maria Indrawan Monash University 2003
28
XML Document – Basic Components
• Elements.
• Attributes.
• Character and Entity References.
• Character Data (CDATA).
• Processing Instruction.
• Comments.
© Maria Indrawan Monash University 2003
29
Elements
Root Element (compulsory)
Branch Elements
Leaf Element
bookshop
book
title
book
author
initials surname
price
value
attribute
© Maria Indrawan Monash University 2003
30
Element
• The basic building block of XML markups.• It may contains:
– Text– Other elements (child elements)– Attributes– Character Data– Other markup, eg comments
• Delimited with a start-tag and an end-tag.• Element can be empty.• The end-tag CANNOT be omitted as in HTML.
• Each tag must consist a valid element type name.
© Maria Indrawan Monash University 2003
31
Element’s Name
• Element’s Name (Tag’s name) is CASE SENSITIVE.– <BOOK> <Book><book>
• Trailing space is legal but will be ignored– <BOOK > = <BOOK>
© Maria Indrawan Monash University 2003
32
Empty Element
• Has no content.
• May be associated with attribute.
• Example: <img src=‘logo.png’></img>
can be abbreviated into
<img src=‘logo.png’/>
© Maria Indrawan Monash University 2003
33
XML Document – Basic Components
• Elements.
• Attributes.
• Character and Entity References.
• Character Data (CDATA).
• Processing Instruction.
• Comments.
© Maria Indrawan Monash University 2003
34
Attributes
• Information regarding the element.“If elements are ‘nouns’ of XML then attributes
are its ‘adjective’.• <tagname attribute_name=“attribute_value”>
<book>
<title> Harry Potter</title>
</book>
<book title=“Harry Potter”>
</book>
© Maria Indrawan Monash University 2003
35
Attributes vs Element
• Determine by the semantic contents.
• Attributes are characteristics of an element.
<book>
<title> Harry Potter</title>
</book>
<book title=“Harry Potter”>
</book>
© Maria Indrawan Monash University 2003
36
XML Document – Basic Components
• Elements.
• Attributes.
• Character and Entity References.
• Character Data (CDATA).
• Processing Instruction.
• Comments.
© Maria Indrawan Monash University 2003
37
Character References
• Use to display characters that are not supported by the input device (keyboard). – entering £ using US-ASCII keyboard.
• Format: &#NNNNN; or &#xXXXX; – N decimal – X hexadecimal
• Example: $ => $ OR $
© Maria Indrawan Monash University 2003
38
Entity References
• Entities may be defined and used for:– Representing character used in mark-up
• < == “<“
• & == “&”
– String • &IR == Information Retrieval
• Predefined entities: <, >, ", etc
© Maria Indrawan Monash University 2003
39
XML Document – Basic Components
• Elements.
• Attributes.
• Character and Entity References.
• Character Data (CDATA).
• Processing Instruction.
• Comments.
© Maria Indrawan Monash University 2003
40
Character Data
• To escape blocks of text containing characters which would otherwise be recognized as markup.
• <![CDATA[…]]>• <![CDATA[<greeting>Hello,
world!</greeting>]]>
© Maria Indrawan Monash University 2003
41
Character Data(2)
<example>
<![CDATA[&Warn;-&Disclaimer;<© 2001; &PM;>]]>
</example>
<example>
&Warn;-&Disclaimer;&lt;&copy 2001; &PM; &gt>
</example>
© Maria Indrawan Monash University 2003
42
XML Document – Basic Components
• Elements.
• Attributes.
• Character and Entity References.
• Character Data (CDATA).
• Processing Instruction.
• Comments.
© Maria Indrawan Monash University 2003
43
Processing Instruction(PI)
• Processing instructions (PIs) allow documents to contain instructions for applications.
• <?target … instruction … ?>
• Target is used to identify the application or other object to which the PI is directed.
• <?xml-stylesheet href=“mystyle.css” type=“text/css”>
© Maria Indrawan Monash University 2003
44
XML Document – Basic Components
• Elements.
• Attributes.
• Character and Entity References.
• Character Data (CDATA).
• Processing Instruction.
• Comments.
© Maria Indrawan Monash University 2003
45
Comments
• Syntax: <!–- comment text -->
• Comments cannot be used within element tags.
<tag>… some content … <tag <!– it is illegal -->>
• Comments may never be nested.<!– Comments cannot <!– be nested --> like this -->
© Maria Indrawan Monash University 2003
46
Structure of XML Document
• XML document has to be well-formed.– Conform to syntax requirements– Conform to a simple container structure
• Common structure of XML document:– Prolog– Body– Epilog
© Maria Indrawan Monash University 2003
47
Prolog
• Includes:– XML Declaration
<?xml version=“1.0” encoding=‘utf-8’ standalone=“yes”>
• Version is mandatory, encoding and standalone are optional
– Document Type Declaration<!DOCTYPE • It is not DTD=Document Type Definition
• A simple well-formed XML does not need it.
– Schema declaration
© Maria Indrawan Monash University 2003
48
Body & Epilog
• Body– Contains 1 or more elements– The “contents”
• Epilog– Hardly used– Can be used to identify end of document
© Maria Indrawan Monash University 2003
49
Well-formed XML Document
• Contains a root element.
• valid tag’s name.
• no overlapping tags.