76
Hsin-Hsi Chen 6-1 Chapter 6 Text and Multimedia Languages and Properties Hsin-Hsi Chen Department of Computer Science and Informatio n Engineering National Taiwan University f the materials in the following is selected from Dr Kuang-hu n XML and RDF (Department of Library Information Science, al Taiwan University)

Hsin-Hsi Chen6-1 Chapter 6 Text and Multimedia Languages and Properties Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Embed Size (px)

Citation preview

Hsin-Hsi Chen 6-1

Chapter 6Text and Multimedia Languages and

Properties

Hsin-Hsi Chen

Department of Computer Science and Information Engineering

National Taiwan University

Part of the materials in the following is selected from Dr Kuang-hua Chen’stalk on XML and RDF (Department of Library Information Science, National Taiwan University)

Hsin-Hsi Chen 6-2

what is a document

• document: a single unit of information– complete logical unit

• research paper, book, manual

– part of a larger text• paragraph, passage, an entry in a dictionary, …

– a physical unit• file, email, Web page

Hsin-Hsi Chen 6-3

characteristics of a document

Document

Syntax

Presentation Style

Semantics

Text + Structure + Other Media

How a documentis displayed or printed

Express structure,presentation style,or even external

actions

Creator

Author

implicit, orexpressed in a language

Hsin-Hsi Chen 6-4

Metadata( 元資料,超資料,中介資料,中間資料,後設資料,詮釋資料 )

• Definition– Data about the data– describe other information based on some rules or

policies

• Type– Descriptive Metadata

• Metadata that is external to the meaning of the document• Dublin Core

– Semantic Metadata• Metadata that can be found within the document’s content• Library of Congress subject codes

Hsin-Hsi Chen 6-5

Dublin Core

• Metadata Element Set (15) – 主題和關鍵詞( Subject)

• 資源的主題,即敘述資源主題或內容的關鍵字或片語,包括控制詞彙或分類架構

– 題名( Title )• 由創造者或出版者給予資源的名稱

– 著者( Creator )• 創造資源內容的個人、組織或機構

– 簡述 (descriptions)• 資源內容的文字描述,包括文件的摘要或是影像資源概述

– 出版者( Publisher )• 發表資源的組織,例如出版社、大學部門、團體或組織

Hsin-Hsi Chen 6-6

Dublin Core (Continued)

– 其他參與者( Contributors )• 其他對資源的創造有貢獻的個人或組織,例如編者、譯者或插

畫者 – 出版日期( Date )

• 資源發表的日期 – 資源類型( Type )

• 資源的種類,例如首頁、小說、詩、技術報告、字典等 – 資料格式( Format )

• 資源的檔案格式,例如 text/html 、 ASCII 、或是 JPEG 影像檔等

– 資源識別代號( Identifier )• 用來標示資源唯一性的字串或數字,例如網路資源 URL 或 UR

N ,以及 ISBN 或其他正式名稱

Hsin-Hsi Chen 6-7

Dublin Core (Continued)

– 關連( Relation )• 與其他資源的關連,例如所屬的系列或其他關係

– 來源( Source )• 作品是由何處衍生而來

– 語言( Language )• 資源內容所採用的語文

– 涵蓋時空( Coverage )• 資源的時間與空間特性

– 版權規範( Rights )• 資源版權聲明以及版權管理使用之規範

Hsin-Hsi Chen 6-8

MARC• Machine-Readable Cataloging Record• The most used format for library records• An Example (NTU Lib)

書名 公共藝術年鑑 Public art in Taiwan eng 何政廣 總編輯出版項 臺北市 行政院文化建設委員會 民 88-出版項 1999.稽核項 冊 彩圖 29 公分附註 據民 87 年書目資料著錄中文標題 csh 公共藝術 -- 年鑑其他作者 何 政廣控制號 100982322.控制號 100982322.國際標準號 957-02-4468-2 平裝 NT$500.國會卡片號 cw 88008821.

Hsin-Hsi Chen 6-9

Web Metadata

• purposes– cataloging (e.g., BibTex)– content rating

• Protect children from reading some type of documents

– intellectual property rights– digital signatures (for authentication)– privacy levels – applications to electronic commerce– …

• RDF (Resource Description Framework)

Hsin-Hsi Chen 6-10

RDF

• description of nodes and attached attribute/value pairs

• nodes: any Web resource

• attributes: properties of nodes

• values: text strings or other nodes (Web resources or metadata instances)

Hsin-Hsi Chen 6-11

RDF 基本模型

ResourceProperty

Value

Subject Predicate Object

Statement

Hsin-Hsi Chen 6-12

範例一

機器貓小叮噹

作者籐子不二雄

漫畫

型態

Hsin-Hsi Chen 6-13

RDF 結構模型

ResourceProperty

Resource

valuevalue

Property Property

Hsin-Hsi Chen 6-14

範例二

機器貓小叮噹

作者Dummy

籐子不二雄[email protected]

電子郵件 姓名

Hsin-Hsi Chen 6-15

Name Space

• 提供使用其他機構控制詞彙的機制• 提供各權威機構制定控制詞彙的機制• 範例 <RDF xmlns=“http://www.w3.org/TR/WD-rdf-syntax/”

xmlns:dc=“http://purl.org/dc/elements/1.0/”>

Dublin Core

Hsin-Hsi Chen 6-16

DC in RDF

Resourcedc:type

dc:title

dc:description

dc:subject

dc:coverage

dc:creator

dc:contributor

dc:publisher

dc:date

dc:relation

dc:language

dc:identifier

dc:rights

dc:format

dc:source

Hsin-Hsi Chen 6-17

A DC Example in RDF

http://x.html Kevin Chendc:creator

<RDF xmlns = “http://www.w3.org/TR/WD-rdf-syntax#” xmlns:dc = “http://purl.org/dc/elements/1.0/”> <Description about = “http://x.html”> <dc:creator> Kevin Chen </dc:creator> </Description></RDF>

Hsin-Hsi Chen 6-18

RDF 語法

<RDF xmlns = “http://www.w3.org/TR/WD-rdf-syntax#” xmlns:dc = “http://purl.org/dc/elements/1.0/”> <Description about = “http://www.lis.ntu.edu.tw/~khchen/”> <dc:Title> The Magic Shelter </dc:Title> <dc:Creator> Kuang-hua Chen </dc:Creator> </Description></RDF>

http://www.lis.ntu.edu.tw/~khchen/

“The Magic Shelter”

dc:creator“Kuang-hua Chen”

dc:title

Hsin-Hsi Chen 6-19

Text

• Formats– Basic form

• ASCII, …

– Document interchange• Rich Text Format (RTF): used by word processors• Portable Document Format (PDF) and Postcript: use

d for display or printing documents• MIME (Multipurpose Internet Mail Exchange): sup

port multiple character sets, multiple languages, and multiple media

Hsin-Hsi Chen 6-20

Text (Continued)

– compress• Compress (Unix)

• ARJ (PCs)

• ZIP (gzip in Unix and Winzip in Windows)

Hsin-Hsi Chen 6-21

Information Theory

• entropy– Measure information content or information un

certainty

12log

iii ppE

where is the number of symbols in the alphabet pi is a probability for symbol i

Hsin-Hsi Chen 6-22

Modeling Natural Language

• Issue 1: how a word is formulated– symbols (separate words and belong to words)– Vowels are more frequent than most consonants– Binomial model (0-order Markov model): each

symbol is generated with a certain probability– k-order Markov model

Hsin-Hsi Chen 6-23

Modeling Natural Language(Continued)

• Issue 2: how different words are distributed inside each document

• Zipf’s law– The frequency of the i-th most frequent word is

1/i times that of the most frequent word– In a text of n words with a vocabulary of V wor

ds, the i-th most frequent word appears n /(iHV())

v

1

1)(j j

VH

=1.5~2.0

V

j j

nwordst

nV

1

11

1...

3

1

2

11

Hsin-Hsi Chen 6-24

F

Words

V

Text size

There are a few hundred words which take up 50% of the text.Words (stopwords) that are too frequent can be disregarded.

Hsin-Hsi Chen 6-25

Modeling Natural Language(Continued)

• Issue 3: the distribution of words in the documents of a collection

• Negative binomial distribution– The fraction of documents containing a work k

times

)1(1

)( ppk

kkF kk

where p and depend on the word and the document collection

Hsin-Hsi Chen 6-26

Modeling Natural Language(Continued)

• Issue 4: number of distinct words in a document

• Heaps’ Law– The vocabulary of a text of size n words is

V = Kn

where K and depend on the particular textK: between 10 and 100: a positive value less than 1

Hsin-Hsi Chen 6-27

Modeling Natural Language(Continued)

• Issue 5: average length of words

• Heaps’ law– The length of the words in the vocabulary

increases logarithmically with the text size

Hsin-Hsi Chen 6-28

Similarity Model

• distance function– symmetric: distance(a,b)=distance(b,a)– triangle inequality:

distance(a,c)distance(a,b)+distance(b,c)– measure

• Edit distance: minimum number of character insertions, deletions, and substitutionse.g., Edit-distance(color, colour)=1, Edit-distance(survey, surgery)=2

• Longest common subsequence: only deletion is allowede.g., LCS(survey, surgery)=surey

• Longest common sequence of lines between two files: e.g., diff command in Unix

Hsin-Hsi Chen 6-29

Markup Languages

• Definition– Textual syntax that describes formatting

actions, structure information, text semantics, attributes, etc.

• Types

– Procedural Markup

– Descriptive Markup

Hsin-Hsi Chen 6-30

程序性標示 (Procedural Markup)

Hsin-Hsi Chen 6-31

描述性標示 (Descriptive Markup)

Hsin-Hsi Chen 6-32

描述性標示的特色

• 將文件內容與呈現格式區分開來

• 針對文件的語意結構進行標誌

Hsin-Hsi Chen 6-33

SGML(Standard Generalized Markup Language)

• 1986 年 ISO 所制定的標準- ISO 8879

• 屬於描述性標示。• 是一種 Meta-language

– HTML 是 SGML 的應用。

Hsin-Hsi Chen 6-34

SGML 的特色• 有彈性 (flexibility)

– 能描述任何資訊結構與任何複雜文件。• 非專屬性 (non-proprietary) 、平台獨立性 (platform-independence) 與系統獨立性 (system-independence) – 利於文件的交換與長期保存。

• 資訊再利用性 (re-usability)

Hsin-Hsi Chen 6-35

SGML 文件的組成

• SGML declaration– 指定文件所使用的字集,及特定的選項功能。

• DTD (Document Type Definition)– 定義文獻所包含的 elements 。– 定義 elements 的內容與屬性。– ...

• DI (Document Instance)– 加上標示的文件。

Hsin-Hsi Chen 6-36

SGML Declaration

• 指定 SGML 文件使用的字元集,及特定的選項功能。

• 可以不特別指定 SGML declaration ,文件會採用 SGML 預設的字元集與功能設定。

• <!SGML “ISO 8879-1986” ...

Hsin-Hsi Chen 6-37

Example : Email 的文件結構

Email

Body

ToSubjectDate

From

Hsin-Hsi Chen 6-38

An SGML DTD for Email

<!-- Elements Min Content --><!-- ----------- ----- ---------------------------------- --><!ELEMENT Email -- (From,Date,To+,Subject, Body?)><!ELEMENT From -O (#PCDATA)> <!ELEMENT Date -O (#PCDATA)><!ELEMENT To -- (#PCDATA)><!ELEMENT Subject -O (#PCDATA)><!ELEMENT Body -- (#PCDATA)><!-- End of Email DTD -->

commentstarting and ending tagscompulsory(-) or optional (O)

,: concatenation|: logical or?: 0 or 1 occurrence*: 0 or 1 occurrences+: 1 occurrences

PCDATA: ASCII charactersNDATA: binary dataEMPTY

Hsin-Hsi Chen 6-39

An SGML DI for Email DTD

<!DOCTYPE Email SYSTEM “c:\temp\email.dtd”>

<Email>

<From>Joe

<Date>1999-7-14 AM 09:20

<To>Jay</To>

<To>Jennifer</To>

<Subject>Learning XML

<Body>XML 將在 Web 上大放異彩,趕快學喔! …</Body>

</Email>

user defined (vs. PUBLIC)

The endingtag is optional

Hsin-Hsi Chen 6-40

Hsin-Hsi Chen 6-41

SGML, DTDs, Document Instances, and Presentation Instances

SGML

DTD DTD ….

DI DI DI ….

印刷版本 Hypertext版本

盲人點字版本

….

Hsin-Hsi Chen 6-42

SGML 發展的限制• SGML 應用程式不易開發• SGML 文件不易在Web上傳佈• 缺乏廠商的支援

Hsin-Hsi Chen 6-43

HTML (Hypertext Markup Language)

• 是 SGML 的應用:– HTML 2.0 DTD– HTML 3.2 DTD– HTML 4.0 DTD

• 目前 Web 上寫作網頁的標準資料格式• 簡單易學• 具可攜性 (portable)• 可結合超連結 (hyperlink) 與多媒體

Hsin-Hsi Chen 6-44

HTML 的特性• HTML DTD 的設計主要是滿足線上顯示的需求

• HTML 有內建的樣式 (style)

• HTML引用 SGML 的標示最簡化特徵 (markup minimization feature)

• HTML沒有採用 SGML 的超連結機制

Hsin-Hsi Chen 6-45

HTML 的限制• 結構上的限制• 資訊再利用的限制• 資料交換的限制• 自動文件處理的限制• 無法支援較精確的查詢• 各家廠商推出的 HTML Extension 不相容

Hsin-Hsi Chen 6-46

XML (eXtensible Markup Language)

• W3C Recommendation 10-February-1998 – XML 1.0

• 大廠支持:Microsoft 、 Netscape 、 Sun 、 ...

• XML is SGML-- rather than HTML++• 取 SGML 之長,補 HTML 之短

– 允許使用者依據需求,自行定義 tags– 能在 Web 上傳遞

Hsin-Hsi Chen 6-47

W3C Data Format

http://www.w3c.org/

Hsin-Hsi Chen 6-48

XML最重要的特性• 可擴展性 (Extensibility)

– XML讓使用者根據需要,自行定義標籤。• 結構性 (Structure)

– XML 能描述各種複雜的文件結構。• 可確認性 (Validation)

– XML可以根據 DTD 對文件進行結構確認。

Hsin-Hsi Chen 6-49

XML 標準

• XML-Language: SGML without tears– Self-describing Documents – Well-formed and Valid Documents

• XML-Link: Power linking– simple and extended links

• XML-Style: Separate style from content– XSL (Extensible Style sheet Language)

Hsin-Hsi Chen 6-50

XML 標準制定現況• XML 1.0 :

– W3C Recommendation 10-Feb-1998

• XML Namespace :– W3C Recommendation 14-Jan-1999

• XLink & Xpointer :– W3C Working Draft 03-March-1998

• XSL :– W3C Working Draft 16-Dec-1998

Hsin-Hsi Chen 6-51

Well-formed XML Rules

• 包含一個以上的 elements• 恰有一個 root element• 不能省略 start-tag 或 end-tag• 所有的 tags 必須呈現適當的巢狀 (nest) 結構。 ( 如 <B><I>bold and italic</B>italic</I> 是不允許的 )• empty tags 必須遵守特殊的 XML 語法。 ( 如 <img src=“…”/> )• 所有的 attribute value 必須括上單引號或雙引號 . ( 如: <font size=“2”> )• 所有的實體都必須宣告

Hsin-Hsi Chen 6-52

Writing Well-Formed XML

• Step 1 : Make an XML Declaration

• Step 2 : Creating a Root Element

• Step 3 : Writing in XML

• Step 4 : Parsing your document

Hsin-Hsi Chen 6-53

Step 1:Make an XML Declaration

• <?xml version=”1.0” standalone=”yes”?>

• <?xml version=”1.0” encoding=”UTF-8” standalone=”yes”?>

• <?xml version=”1.0” encoding=”big5” standalone=”yes”?>

Hsin-Hsi Chen 6-54

Step 2:Creating a Root Element

<?xml version=”1.0” standalone=”yes”?>

<Email>

……

</Email>

Hsin-Hsi Chen 6-55

Step 3:Writing in XML

<?xml version=”1.0” encoding=“big5” standalone=”yes”?><Email> <From>Joe</From> <Date>1999-7-14 AM 09:20</Date> <To>Jay</To> <To>Jennifer</To> <Subject>Learning XML</Subject> <Body>XML 將在 Web 上大放異彩,趕快學喔! …</Body></Email>

Hsin-Hsi Chen 6-56

Step 4:Parsing your document

• Checking if your well-formed XML document conforms to well-formed XML rules.

• Use a parser to check well-formedness– for example: the XML parser embedded in IE5

Hsin-Hsi Chen 6-57

Explorer 5.0 瀏覽Well-formed XML

Hsin-Hsi Chen 6-58

Explorer 5.0 瀏覽錯誤的 XML 文件

Hsin-Hsi Chen 6-59

Writing Valid XML

• Step 1 : Make an XML declaration.

• Step 2 : Designing a DTD.

• Step 3 : Writing Valid XML.

• Step 4 : Parsing your Valid XML document.

Hsin-Hsi Chen 6-60

Step 1:Make an XML Declaration

• <?xml version=”1.0” standalone=”no”?>

• <?xml version=”1.0” encoding=”UTF-8” standalone=”no”?>

• <?xml version=”1.0” encoding=”big5” standalone=”no”?>

Hsin-Hsi Chen 6-61

Step 2 : Designing a DTD

<!-- Elements Content -->

<!-- ----------- ---------------------------------- -->

<!ELEMENT Email (From,Date,To+,Subject,Body?)>

<!ELEMENT From (#PCDATA)>

<!ELEMENT Date (#PCDATA)>

<!ELEMENT To (#PCDATA)>

<!ELEMENT Subject (#PCDATA)>

<!ELEMENT Body (#PCDATA)>

<!-- End of Email DTD -->

Hsin-Hsi Chen 6-62

Step 3 : Writing Valid XML

<?xml version=”1.0” encoding=“big5” standalone=”no”?><!DOCTYPE Email SYSTEM ”email.dtd"><Email> <From>Joe</From> <Date>1999-7-14 AM 09:20</Date> <To>Jay</To> <To>Jennifer</To> <Subject>Learning XML</Subject> <Body>XML 將在 Web 上大放異彩,趕快學喔! …</Body></Email>

Hsin-Hsi Chen 6-63

XML Simple Link

Hsin-Hsi Chen 6-64

XML Extended linking: multiple ends

Hsin-Hsi Chen 6-65

XML Extended linking:addressing by structure

Hsin-Hsi Chen 6-66

XML Extended linking

Hsin-Hsi Chen 6-67

CSS (Cascading Style Sheet)

• Sample : email.css

Email,From,Date,To,Subject,Body,

{display:block;margin-left:5%;

margin-right:5%;border-style:groove;}

Hsin-Hsi Chen 6-68

XML document with Style

<?xml version=”1.0” encoding=“big5” standalone=”no”?><?xml-stylesheet href ="email.css" type="text/css"?><Email> <From>Joe</From> <Date>1999-7-14 AM 09:20</Date> <To>Jay</To> <To>Jennifer</To> <Subject>Learning XML</Subject> <Body>XML 將在 Web 上大放異彩,趕快學喔! …</Body></Email>

Hsin-Hsi Chen 6-69

Explorer 5.0 瀏覽結合 CSS 的 XML文件

Hsin-Hsi Chen 6-70

XML 的應用

• Database interchange

• Client-side processing

• User views of the data

• Information filtering

Hsin-Hsi Chen 6-71

Multimedia

• medias– text, sound, images, video

• issues– volume, format, processing requirements

Hsin-Hsi Chen 6-72

Formats

• image– bit-mapped/pixel-based display

• The simplest format• XBM, BMP, PCX• disadvantages: redundancy

– compression• Compuserve’s Graphic Interchange Format (GIF)

– lossy compression• Joint Photographic Experts Group (JPEG)

– exchange• Tagged Image File Format (TIFF)

Hsin-Hsi Chen 6-73

Formats

• Audio– AU, MIDI, WAVE

• Video – MPEG, AVI, QuickTime

Hsin-Hsi Chen 6-74

Textual Images

• definition– images of documents that contain mainly typed or

typeset text

– obtained by OCR

• image retrieval– Alternative 1

• At creation time, a set of keywords (called metadata) is associated with each image

• Conventional text retrieval techniques can be applied to keywords

Hsin-Hsi Chen 6-75

Textual Images (Continued)

– Alternative 2• Use OCR to extract the text of the image

• The resultant ASCII text can be used to extract keywords

– Alternative 3• Use the symbols extracted from the images as basic

units to combine image retrieval techniques with sequence retrieval techniques

Hsin-Hsi Chen 6-76

Taxonomy of Web languages