49
© Maria Indrawan Monash University 2003 1 CSE3201/4500 Information Retrieval Systems Maria Indrawan C4.26, 9903-1916 [email protected] .au

© Maria Indrawan Monash University 2003 1 CSE3201/4500 Information Retrieval Systems Maria Indrawan C4.26, 9903-1916 [email protected]

  • View
    216

  • Download
    2

Embed Size (px)

Citation preview

Page 1: © Maria Indrawan Monash University 2003 1 CSE3201/4500 Information Retrieval Systems Maria Indrawan C4.26, 9903-1916 maria.indrawan@infotech.monash.edu.au

© Maria Indrawan Monash University 2003

1

CSE3201/4500

Information Retrieval Systems

Maria Indrawan

C4.26, [email protected]

Page 2: © Maria Indrawan Monash University 2003 1 CSE3201/4500 Information Retrieval Systems Maria Indrawan C4.26, 9903-1916 maria.indrawan@infotech.monash.edu.au

© Maria Indrawan Monash University 2003

2

Type of Data

structured non- structured

XML documents

relational database

free text, search engine

• data representation

• query formulation

• matching

Page 3: © Maria Indrawan Monash University 2003 1 CSE3201/4500 Information Retrieval Systems Maria Indrawan C4.26, 9903-1916 maria.indrawan@infotech.monash.edu.au

© Maria Indrawan Monash University 2003

3

Introduction

• What will I learn in this unit?– how to manage data that cannot be effectively

handled by a relational DBMS.• XML documents

• Text (free text)

• There will be no SQL in this unit.

Page 4: © Maria Indrawan Monash University 2003 1 CSE3201/4500 Information Retrieval Systems Maria Indrawan C4.26, 9903-1916 maria.indrawan@infotech.monash.edu.au

© Maria Indrawan Monash University 2003

4

Objectives

• On the completion of this unit, you will (hopefully!) be able to: Understand the difference nature of information (structured, semi-

structured, unstructured) and their associated issues when dealing with information retrieval.

understand the XML technologies and their role in Information Retrieval.

Be able to demonstrate the ability to create and manipulate XML documents.

Understand the design issues and various approaches to the development of text databases.

Page 5: © Maria Indrawan Monash University 2003 1 CSE3201/4500 Information Retrieval Systems Maria Indrawan C4.26, 9903-1916 maria.indrawan@infotech.monash.edu.au

© Maria Indrawan Monash University 2003

5

Prerequisite Knowledge

• Relational database concepts, such as SQL, indexing.

• Basic UNIX commands, eg file, directory manipulation commands.

• HTML.

• Basic level of Maths (year-12 level).

Page 6: © Maria Indrawan Monash University 2003 1 CSE3201/4500 Information Retrieval Systems Maria Indrawan C4.26, 9903-1916 maria.indrawan@infotech.monash.edu.au

© Maria Indrawan Monash University 2003

6

Assessment

• There are different assessments

for CSE3201 and CSE4500.• Undergraduate students =>

CSE3201

• Masters students => CSE4500

Page 7: © Maria Indrawan Monash University 2003 1 CSE3201/4500 Information Retrieval Systems Maria Indrawan C4.26, 9903-1916 maria.indrawan@infotech.monash.edu.au

© Maria Indrawan Monash University 2003

7

CSE4500 Assessment

• Component A: – Assignment 1 – XML Schema

10% (week 6)– Assignment 2 - XSLT

15% (week 9)– Unit Test, - XML, XSLT

15% (week 10)

Page 8: © Maria Indrawan Monash University 2003 1 CSE3201/4500 Information Retrieval Systems Maria Indrawan C4.26, 9903-1916 maria.indrawan@infotech.monash.edu.au

© Maria Indrawan Monash University 2003

8

CSE4500 Assessment

• Component B– Assignment 3 Research Paper 10%

(week 12)

• Component C:– Exam 50%

Page 9: © Maria Indrawan Monash University 2003 1 CSE3201/4500 Information Retrieval Systems Maria Indrawan C4.26, 9903-1916 maria.indrawan@infotech.monash.edu.au

© Maria Indrawan Monash University 2003

9

CSE3201 Assessment

• Component A: – Assignment 1 – XML Schema

10% (week 6)– Assignment 2 - XSLT

15% (week 9)– Unit Test, - XML, XSLT

15% (week 10)

Page 10: © Maria Indrawan Monash University 2003 1 CSE3201/4500 Information Retrieval Systems Maria Indrawan C4.26, 9903-1916 maria.indrawan@infotech.monash.edu.au

© Maria Indrawan Monash University 2003

10

CSE3201 Assessment

• Component B– Unit Test on text retrieval

10% (week 12)

• Component C:– Exam

50%

Page 11: © Maria Indrawan Monash University 2003 1 CSE3201/4500 Information Retrieval Systems Maria Indrawan C4.26, 9903-1916 maria.indrawan@infotech.monash.edu.au

© Maria Indrawan Monash University 2003

11

Assessment Rules

• The result of the unit test will determine the final grade for component A as follow:

Unit Test Maximum grade for Component A

Fail Pass Pass Credit Credit Distinction Distinction High Distinction

Page 12: © Maria Indrawan Monash University 2003 1 CSE3201/4500 Information Retrieval Systems Maria Indrawan C4.26, 9903-1916 maria.indrawan@infotech.monash.edu.au

© Maria Indrawan Monash University 2003

12

Assessment Rules

• In order to pass this unit you must attain:– 50% overall and– at least 40% of the available marks in

each component A, B and C.

Page 13: © Maria Indrawan Monash University 2003 1 CSE3201/4500 Information Retrieval Systems Maria Indrawan C4.26, 9903-1916 maria.indrawan@infotech.monash.edu.au

© Maria Indrawan Monash University 2003

13

Textbook

Prescribed:XML:How To Program (1st ed)Deitel, H.M. Deitel P.J. Nieto, TR. Lin, T. and Sadhu, PPrentice Hall

Recommended:Professional XML, 2nd Ed, WROX Publisher.Beginner XML, WROX Publisher.XML SchemaEric Van Der Vlist, O’Reilly Publishing.

Page 15: © Maria Indrawan Monash University 2003 1 CSE3201/4500 Information Retrieval Systems Maria Indrawan C4.26, 9903-1916 maria.indrawan@infotech.monash.edu.au

© Maria Indrawan Monash University 2003

15

Plagiarism/Cheating

• Please read all the necessary university materials on cheating/plagiarism (listed in the unit guide).

Page 16: © Maria Indrawan Monash University 2003 1 CSE3201/4500 Information Retrieval Systems Maria Indrawan C4.26, 9903-1916 maria.indrawan@infotech.monash.edu.au

© Maria Indrawan Monash University 2003

16

Computing Facilities

• Quota system

• Acceptable policy– http://www.infotech.monash.edu.au/

myfit/students/student_labinfo_rules_netusage.cfm

– http://www.adm.monash.edu.au/unisec/pol/itec12.html

Page 17: © Maria Indrawan Monash University 2003 1 CSE3201/4500 Information Retrieval Systems Maria Indrawan C4.26, 9903-1916 maria.indrawan@infotech.monash.edu.au

© Maria Indrawan Monash University 2003

17

Being Resourceful and Independent

• I have a question on …

– Read the textbook or reading list.

– Explore additional materials, eg W3C.

– Ask my tutor.

– Ask my lecturer.

• Can I ask my tutor/helpdesk to find the bugs in my work?

– No.

• Will the solution to the tutorial exercises be published?

– No. Students are encourage to discuss their work with the tutors.• Will study the lecture notes be sufficient for this unit?

– No. Students need to read the textbook and additional reading list.

Page 18: © Maria Indrawan Monash University 2003 1 CSE3201/4500 Information Retrieval Systems Maria Indrawan C4.26, 9903-1916 maria.indrawan@infotech.monash.edu.au

© Maria Indrawan Monash University 2003

18

Basic XML

Page 19: © Maria Indrawan Monash University 2003 1 CSE3201/4500 Information Retrieval Systems Maria Indrawan C4.26, 9903-1916 maria.indrawan@infotech.monash.edu.au

© Maria Indrawan Monash University 2003

19

Objectives

• Be able to:– Understand XML technologies and their roles.– Understand different components of an XML

document.– Create a well-form XML document.

Page 20: © Maria Indrawan Monash University 2003 1 CSE3201/4500 Information Retrieval Systems Maria Indrawan C4.26, 9903-1916 maria.indrawan@infotech.monash.edu.au

© Maria Indrawan Monash University 2003

20

What is XML?

• XML=ExtensibleMarkup Language.• Markup Languages:

– HTML– SGML

• Utilise the mark ups to define the – structure– semantics => to a certain level.

• WWW Consortium(W3C) recommendation– www.w3c.org

Page 21: © Maria Indrawan Monash University 2003 1 CSE3201/4500 Information Retrieval Systems Maria Indrawan C4.26, 9903-1916 maria.indrawan@infotech.monash.edu.au

© Maria Indrawan Monash University 2003

21

XML vs HTML

HTML XML

• tags define the presentation layout<p> CSE3201 </p>

<p> Information Retrieval </p>

tags define the structure and the meaning of the data<unit>

<unitCode> CSE3201

</unitCode>

<unitName> Information

Retrieval </unitName>

</unit>

Page 22: © Maria Indrawan Monash University 2003 1 CSE3201/4500 Information Retrieval Systems Maria Indrawan C4.26, 9903-1916 maria.indrawan@infotech.monash.edu.au

© Maria Indrawan Monash University 2003

22

Why XML?

• Distributed applications need to share data.– plain text– structure and the meaning of the data are tightly

defined.

• Delivery of data to multi-devices– Separation of data and presentation.

Page 23: © Maria Indrawan Monash University 2003 1 CSE3201/4500 Information Retrieval Systems Maria Indrawan C4.26, 9903-1916 maria.indrawan@infotech.monash.edu.au

© Maria Indrawan Monash University 2003

23

XML Document – an Example

<bookshop><book><title> Harry Potter and the

Sorcerer’s Stone</title><author> <initials>J.K</initials> <surname>

Rowling</surname></author><price value=“$16.95”></price></book>…</bookshop>

bookshop

book

title

book

author

initials surname

price

value

Page 24: © Maria Indrawan Monash University 2003 1 CSE3201/4500 Information Retrieval Systems Maria Indrawan C4.26, 9903-1916 maria.indrawan@infotech.monash.edu.au

© Maria Indrawan Monash University 2003

24

XML Technologies

• DTD/Schema– definition of XML structures

• XSL (XSLT and XSL-FO)– presentation

• XPath– locating nodes

• Xlink, Xpointer– linking

• DOM and SAX– APIs to manipulate XML

Page 25: © Maria Indrawan Monash University 2003 1 CSE3201/4500 Information Retrieval Systems Maria Indrawan C4.26, 9903-1916 maria.indrawan@infotech.monash.edu.au

© Maria Indrawan Monash University 2003

25

XML Parser

• Required to read and manipulate XML documents.

• Read the XML documents as a plain text and transform it into a data structure, typically tree, in the memory.

• The applications, such as web browser, access the data structure and process the data according to their objectives.

• Example: msxml

Page 26: © Maria Indrawan Monash University 2003 1 CSE3201/4500 Information Retrieval Systems Maria Indrawan C4.26, 9903-1916 maria.indrawan@infotech.monash.edu.au

© Maria Indrawan Monash University 2003

26

XML Usage

• SOAP (simple object access protocol)

• Microsoft BizTalk Server

• WSDL and UDDI in Web Services

• Semantic Web

Page 27: © Maria Indrawan Monash University 2003 1 CSE3201/4500 Information Retrieval Systems Maria Indrawan C4.26, 9903-1916 maria.indrawan@infotech.monash.edu.au

© Maria Indrawan Monash University 2003

27

XML Issues

• Performance– text processing vs binary processing

• Security

Page 28: © Maria Indrawan Monash University 2003 1 CSE3201/4500 Information Retrieval Systems Maria Indrawan C4.26, 9903-1916 maria.indrawan@infotech.monash.edu.au

© Maria Indrawan Monash University 2003

28

XML Document – Basic Components

• Elements.

• Attributes.

• Character and Entity References.

• Character Data (CDATA).

• Processing Instruction.

• Comments.

Page 29: © Maria Indrawan Monash University 2003 1 CSE3201/4500 Information Retrieval Systems Maria Indrawan C4.26, 9903-1916 maria.indrawan@infotech.monash.edu.au

© Maria Indrawan Monash University 2003

29

Elements

Root Element (compulsory)

Branch Elements

Leaf Element

bookshop

book

title

book

author

initials surname

price

value

attribute

Page 30: © Maria Indrawan Monash University 2003 1 CSE3201/4500 Information Retrieval Systems Maria Indrawan C4.26, 9903-1916 maria.indrawan@infotech.monash.edu.au

© Maria Indrawan Monash University 2003

30

Element

• The basic building block of XML markups.• It may contains:

– Text– Other elements (child elements)– Attributes– Character Data– Other markup, eg comments

• Delimited with a start-tag and an end-tag.• Element can be empty.• The end-tag CANNOT be omitted as in HTML.

• Each tag must consist a valid element type name.

Page 31: © Maria Indrawan Monash University 2003 1 CSE3201/4500 Information Retrieval Systems Maria Indrawan C4.26, 9903-1916 maria.indrawan@infotech.monash.edu.au

© Maria Indrawan Monash University 2003

31

Element’s Name

• Element’s Name (Tag’s name) is CASE SENSITIVE.– <BOOK> <Book><book>

• Trailing space is legal but will be ignored– <BOOK > = <BOOK>

Page 32: © Maria Indrawan Monash University 2003 1 CSE3201/4500 Information Retrieval Systems Maria Indrawan C4.26, 9903-1916 maria.indrawan@infotech.monash.edu.au

© Maria Indrawan Monash University 2003

32

Empty Element

• Has no content.

• May be associated with attribute.

• Example: <img src=‘logo.png’></img>

can be abbreviated into

<img src=‘logo.png’/>

Page 33: © Maria Indrawan Monash University 2003 1 CSE3201/4500 Information Retrieval Systems Maria Indrawan C4.26, 9903-1916 maria.indrawan@infotech.monash.edu.au

© Maria Indrawan Monash University 2003

33

XML Document – Basic Components

• Elements.

• Attributes.

• Character and Entity References.

• Character Data (CDATA).

• Processing Instruction.

• Comments.

Page 34: © Maria Indrawan Monash University 2003 1 CSE3201/4500 Information Retrieval Systems Maria Indrawan C4.26, 9903-1916 maria.indrawan@infotech.monash.edu.au

© Maria Indrawan Monash University 2003

34

Attributes

• Information regarding the element.“If elements are ‘nouns’ of XML then attributes

are its ‘adjective’.• <tagname attribute_name=“attribute_value”>

<book>

<title> Harry Potter</title>

</book>

<book title=“Harry Potter”>

</book>

Page 35: © Maria Indrawan Monash University 2003 1 CSE3201/4500 Information Retrieval Systems Maria Indrawan C4.26, 9903-1916 maria.indrawan@infotech.monash.edu.au

© Maria Indrawan Monash University 2003

35

Attributes vs Element

• Determine by the semantic contents.

• Attributes are characteristics of an element.

<book>

<title> Harry Potter</title>

</book>

<book title=“Harry Potter”>

</book>

Page 36: © Maria Indrawan Monash University 2003 1 CSE3201/4500 Information Retrieval Systems Maria Indrawan C4.26, 9903-1916 maria.indrawan@infotech.monash.edu.au

© Maria Indrawan Monash University 2003

36

XML Document – Basic Components

• Elements.

• Attributes.

• Character and Entity References.

• Character Data (CDATA).

• Processing Instruction.

• Comments.

Page 37: © Maria Indrawan Monash University 2003 1 CSE3201/4500 Information Retrieval Systems Maria Indrawan C4.26, 9903-1916 maria.indrawan@infotech.monash.edu.au

© Maria Indrawan Monash University 2003

37

Character References

• Use to display characters that are not supported by the input device (keyboard). – entering £ using US-ASCII keyboard.

• Format: &#NNNNN; or &#xXXXX; – N decimal – X hexadecimal

• Example: $ => &#36; OR &#x24

Page 38: © Maria Indrawan Monash University 2003 1 CSE3201/4500 Information Retrieval Systems Maria Indrawan C4.26, 9903-1916 maria.indrawan@infotech.monash.edu.au

© Maria Indrawan Monash University 2003

38

Entity References

• Entities may be defined and used for:– Representing character used in mark-up

• &lt == “<“

• &amp == “&”

– String • &IR == Information Retrieval

• Predefined entities: &lt, &gt, &quot, etc

Page 39: © Maria Indrawan Monash University 2003 1 CSE3201/4500 Information Retrieval Systems Maria Indrawan C4.26, 9903-1916 maria.indrawan@infotech.monash.edu.au

© Maria Indrawan Monash University 2003

39

XML Document – Basic Components

• Elements.

• Attributes.

• Character and Entity References.

• Character Data (CDATA).

• Processing Instruction.

• Comments.

Page 40: © Maria Indrawan Monash University 2003 1 CSE3201/4500 Information Retrieval Systems Maria Indrawan C4.26, 9903-1916 maria.indrawan@infotech.monash.edu.au

© Maria Indrawan Monash University 2003

40

Character Data

• To escape blocks of text containing characters which would otherwise be recognized as markup.

• <![CDATA[…]]>• <![CDATA[<greeting>Hello,

world!</greeting>]]>

Page 41: © Maria Indrawan Monash University 2003 1 CSE3201/4500 Information Retrieval Systems Maria Indrawan C4.26, 9903-1916 maria.indrawan@infotech.monash.edu.au

© Maria Indrawan Monash University 2003

41

Character Data(2)

<example>

<![CDATA[&Warn;-&Disclaimer;&lt;&copy 2001; &PM;&gt;]]>

</example>

<example>

&amp;Warn;-&amp;Disclaimer;&amp;lt;&amp;copy 2001; &amp;PM; &amp;gt>

</example>

Page 42: © Maria Indrawan Monash University 2003 1 CSE3201/4500 Information Retrieval Systems Maria Indrawan C4.26, 9903-1916 maria.indrawan@infotech.monash.edu.au

© Maria Indrawan Monash University 2003

42

XML Document – Basic Components

• Elements.

• Attributes.

• Character and Entity References.

• Character Data (CDATA).

• Processing Instruction.

• Comments.

Page 43: © Maria Indrawan Monash University 2003 1 CSE3201/4500 Information Retrieval Systems Maria Indrawan C4.26, 9903-1916 maria.indrawan@infotech.monash.edu.au

© Maria Indrawan Monash University 2003

43

Processing Instruction(PI)

• Processing instructions (PIs) allow documents to contain instructions for applications.

• <?target … instruction … ?>

• Target is used to identify the application or other object to which the PI is directed.

• <?xml-stylesheet href=“mystyle.css” type=“text/css”>

Page 44: © Maria Indrawan Monash University 2003 1 CSE3201/4500 Information Retrieval Systems Maria Indrawan C4.26, 9903-1916 maria.indrawan@infotech.monash.edu.au

© Maria Indrawan Monash University 2003

44

XML Document – Basic Components

• Elements.

• Attributes.

• Character and Entity References.

• Character Data (CDATA).

• Processing Instruction.

• Comments.

Page 45: © Maria Indrawan Monash University 2003 1 CSE3201/4500 Information Retrieval Systems Maria Indrawan C4.26, 9903-1916 maria.indrawan@infotech.monash.edu.au

© Maria Indrawan Monash University 2003

45

Comments

• Syntax: <!–- comment text -->

• Comments cannot be used within element tags.

<tag>… some content … <tag <!– it is illegal -->>

• Comments may never be nested.<!– Comments cannot <!– be nested --> like this -->

Page 46: © Maria Indrawan Monash University 2003 1 CSE3201/4500 Information Retrieval Systems Maria Indrawan C4.26, 9903-1916 maria.indrawan@infotech.monash.edu.au

© Maria Indrawan Monash University 2003

46

Structure of XML Document

• XML document has to be well-formed.– Conform to syntax requirements– Conform to a simple container structure

• Common structure of XML document:– Prolog– Body– Epilog

Page 47: © Maria Indrawan Monash University 2003 1 CSE3201/4500 Information Retrieval Systems Maria Indrawan C4.26, 9903-1916 maria.indrawan@infotech.monash.edu.au

© Maria Indrawan Monash University 2003

47

Prolog

• Includes:– XML Declaration

<?xml version=“1.0” encoding=‘utf-8’ standalone=“yes”>

• Version is mandatory, encoding and standalone are optional

– Document Type Declaration<!DOCTYPE • It is not DTD=Document Type Definition

• A simple well-formed XML does not need it.

– Schema declaration

Page 48: © Maria Indrawan Monash University 2003 1 CSE3201/4500 Information Retrieval Systems Maria Indrawan C4.26, 9903-1916 maria.indrawan@infotech.monash.edu.au

© Maria Indrawan Monash University 2003

48

Body & Epilog

• Body– Contains 1 or more elements– The “contents”

• Epilog– Hardly used– Can be used to identify end of document

Page 49: © Maria Indrawan Monash University 2003 1 CSE3201/4500 Information Retrieval Systems Maria Indrawan C4.26, 9903-1916 maria.indrawan@infotech.monash.edu.au

© Maria Indrawan Monash University 2003

49

Well-formed XML Document

• Contains a root element.

• valid tag’s name.

• no overlapping tags.