Upload
neka
View
59
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Chapter 6 Text and Multimedia Languages and Properties. Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University. Part of the materials in the following is selected from Dr Kuang-hua Chen’s - PowerPoint PPT Presentation
Citation preview
Hsin-Hsi Chen 6-1
Chapter 6Text and Multimedia Languages and
Properties
Hsin-Hsi Chen
Department of Computer Science and Information Engineering
National Taiwan University
Part of the materials in the following is selected from Dr Kuang-hua Chen’stalk on XML and RDF (Department of Library Information Science, National Taiwan University)
Hsin-Hsi Chen 6-2
what is a document
• document: a single unit of information– complete logical unit
• research paper, book, manual
– part of a larger text• paragraph, passage, an entry in a dictionary, …
– a physical unit• file, email, Web page
Hsin-Hsi Chen 6-3
characteristics of a document
Document
Syntax
Presentation Style
Semantics
Text + Structure + Other Media
How a documentis displayed or printed
Express structure,presentation style,or even external
actions
Creator
Author
implicit, orexpressed in a language
Hsin-Hsi Chen 6-4
Metadata( 元資料,超資料,中介資料,中間資料,後設資料,詮釋資料 )
• Definition– Data about the data, e.g., schema in a DBMS– describe other information based on some rules or
policies
• Type– Descriptive Metadata
• Metadata that is external to the meaning of the document• Dublin Core
– Semantic Metadata• Metadata that can be found within the document’s content• Library of Congress subject codes
Hsin-Hsi Chen 6-5
Dublin Core
• Metadata Element Set (15) – 主題和關鍵詞( Subject)
• 資源的主題,即敘述資源主題或內容的關鍵字或片語,包括控制詞彙或分類架構
– 題名( Title )• 由創造者或出版者給予資源的名稱
– 著者( Creator )• 創造資源內容的個人、組織或機構
– 簡述 (descriptions)• 資源內容的文字描述,包括文件的摘要或是影像資源概述
– 出版者( Publisher )• 發表資源的組織,例如出版社、大學部門、團體或組織
Hsin-Hsi Chen 6-6
Dublin Core (Continued)
– 其他參與者( Contributors )• 其他對資源的創造有貢獻的個人或組織,例如編者、譯者或插
畫者 – 出版日期( Date )
• 資源發表的日期 – 資源類型( Type )
• 資源的種類,例如首頁、小說、詩、技術報告、字典等 – 資料格式( Format )
• 資源的檔案格式,例如 text/html 、 ASCII 、或是 JPEG 影像檔等
– 資源識別代號( Identifier )• 用來標示資源唯一性的字串或數字,例如網路資源 URL 或 UR
N ,以及 ISBN 或其他正式名稱
Hsin-Hsi Chen 6-7
Dublin Core (Continued)
– 關連( Relation )• 與其他資源的關連,例如所屬的系列或其他關係
– 來源( Source )• 作品是由何處衍生而來
– 語言( Language )• 資源內容所採用的語文
– 涵蓋時空( Coverage )• 資源的時間與空間特性
– 版權規範( Rights )• 資源版權聲明以及版權管理使用之規範
Hsin-Hsi Chen 6-8
器物的例子<?xml version="1.0"?><dc-record><type> 器物 </type><format> 銅、琺瑯 </format><format> 掐絲 </format><title> 景泰掐絲琺瑯番蓮紋盒 </title><title>cloisonnie box with lotus-spray decoration</title><description>1400/1500</description><description> 銅胎,蓋與器身鑄成浮雕式八瓣蓮花形 </descripti
on><description> 高 63.cm 口徑 12.4cm 重 634.6 克 </description><description> 陳夏生,明清琺瑯器展覽圖錄。台北:國立故宮
博物院,民 88 年 2 月。 </description>
Hsin-Hsi Chen 6-9
器物的例子(續)<subject> 景泰掐絲琺瑯番蓮紋盒 </subject><subject> 日常生活 </subject><subject> 容器 </subject><subject> 銅、琺瑯 </subject><subject> 掐絲 </subject><subject> 地區 ( 社的座落位置 )(r) place</subject><date>1400/1500</date><coverage> 地區 ( 社的座落位置 )(r) place</coverage><rights> 臺灣 , 故宮 </rights></dc-record>
Hsin-Hsi Chen 6-10
紙本水墨的例子<?xml version="1.0"?><dc-record>
<type> 紙本水墨 </type>
<type> 原件 </type>
<title> 古木流泉 </title>
<description> 全文 </description>
<description> 紙本水墨 </description>
<description>30*48.7</description>
<description> 蓼塘。楊世家藏。神。品。項元汴印。項子京家珍藏。項墨林鑑賞章。墨林秘玩。?李項氏士家寶玩。張澤之。柯亭文房之印。乾隆御覽之寶。石渠寶笈。重華宮鑑藏寶。樂善堂圖書記。 </description>
Hsin-Hsi Chen 6-11
紙本水墨的例子(續)<description>1127/1189</description><description> 國立故宮博物院編輯委員會,宋代書畫冊頁
名品特展。台北:國立故宮博物院,民 84 年 9 月。 </description>
<subject>風景 </subject><creator>馬和之 </creator><date>1127/1189</date><language>zh</language><right> 臺灣 , 故宮 </right></dc-record>
Hsin-Hsi Chen 6-12
MARC• Machine-Readable Cataloging Record• The most used format for library records• An Example (NTU Lib)書名 公共藝術年鑑 Public art in Taiwan eng 何政廣 總編輯出版項 臺北市 行政院文化建設委員會 民 88-出版項 1999.稽核項 冊 彩圖 29公分附註 據民 87 年書目資料著錄中文標題 csh 公共藝術 -- 年鑑其他作者 何 政廣控制號 100982322.控制號 100982322.國際標準號 957-02-4468-2 平裝 NT$500.國會卡片號 cw 88008821.
Hsin-Hsi Chen 6-13
Web Metadata
• purposes– cataloging (e.g., BibTex)– content rating
• Protect children from reading some type of documents
– intellectual property rights– digital signatures (for authentication)– privacy levels – applications to electronic commerce– …
• RDF (Resource Description Framework)
Hsin-Hsi Chen 6-14
RDF
• description of nodes and attached attribute/value pairs
• nodes: any Web resource
• attributes: properties of nodes
• values: text strings or other nodes (Web resources or metadata instances)
Hsin-Hsi Chen 6-15
RDF基本模型
ResourceProperty
Value
Subject Predicate Object
Statement
Hsin-Hsi Chen 6-16
範例一
機器貓小叮噹
作者籐子不二雄
漫畫
型態
Hsin-Hsi Chen 6-17
RDF結構模型
ResourceProperty
Resource
valuevalue
Property Property
Hsin-Hsi Chen 6-19
Name Space
• 提供使用其他機構控制詞彙的機制• 提供各權威機構制定控制詞彙的機制• 範例 <RDF xmlns=“http://www.w3.org/TR/WD-rdf-syntax/”
xmlns:dc=“http://purl.org/dc/elements/1.0/”>
Dublin Core
Name Space
Hsin-Hsi Chen 6-20
DC in RDF
Resourcedc:type
dc:title
dc:description
dc:subject
dc:coverage
dc:creator
dc:contributor
dc:publisher
dc:date
dc:relation
dc:language
dc:identifier
dc:rights
dc:format
dc:source
Hsin-Hsi Chen 6-21
A DC Example in RDF
http://x.html Kevin Chendc:creator
<RDF xmlns = “http://www.w3.org/TR/WD-rdf-syntax#” xmlns:dc = “http://purl.org/dc/elements/1.0/”> <Description about = “http://x.html”> <dc:creator> Kevin Chen </dc:creator> </Description></RDF>
Hsin-Hsi Chen 6-22
RDF 語法
<RDF xmlns = “http://www.w3.org/TR/WD-rdf-syntax#” xmlns:dc = “http://purl.org/dc/elements/1.0/”> <Description about = “http://www.lis.ntu.edu.tw/~khchen/”> <dc:Title> The Magic Shelter </dc:Title> <dc:Creator> Kuang-hua Chen </dc:Creator> </Description></RDF>
http://www.lis.ntu.edu.tw/~khchen/
“The Magic Shelter”
dc:creator“Kuang-hua Chen”
dc:title
Hsin-Hsi Chen 6-23
Text
• Formats– Basic form
• ASCII, …
– Document interchange• Rich Text Format (RTF): used by word processors• Portable Document Format (PDF) and Postcript: use
d for display or printing documents• MIME (Multipurpose Internet Mail Exchange): sup
port multiple character sets, multiple languages, and multiple media
Hsin-Hsi Chen 6-24
Text (Continued)
– compress• Compress (Unix)
• ARJ (PCs)
• ZIP (gzip in Unix and Winzip in Windows)
Hsin-Hsi Chen 6-25
Information Theory
• entropy– Measure information content or information
uncertainty
12log
iii ppE
where is the number of symbols in the alphabet pi is a probability for symbol i
Hsin-Hsi Chen 6-26
Modeling Natural Language
• Issue 1: how a word is formulated– symbols (separate-words and belong-to-words)– Vowels are more frequent than most consonants– Binomial model (0-order Markov model): each symbol is
generated with a certain probability– k-order Markov model
• Extension: how a sentence is formulated– 5-order Markov model in Bible– finite-state model (regular languages)– grammar model (context free and other languages)
Hsin-Hsi Chen 6-27
Modeling Natural Language(Continued)
• Issue 2: how different words are distributed inside each document
• Zipf’s law– The frequency of the i-th most frequent word is
1/i times that of the most frequent word– In a text of n words with a vocabulary of V wor
ds, the i-th most frequent word appears n /(iHV())
v
1
1)(j j
VH
=1.5~2.0
V
j j
nwordst
nV
1
11
1...
3
1
2
11
Hsin-Hsi Chen 6-28
F
Words
V
Text size
There are a few hundred words which take up 50% of the text.Words (stopwords) that are too frequent can be disregarded.
Hsin-Hsi Chen 6-29
Modeling Natural Language(Continued)
• Issue 3: the distribution of words in the documents of a collection
• Negative binomial distribution– The fraction of documents containing a word k
times
)1(1
)( ppk
kkF kk
where p and depend on the word and the document collection
p=9.24 and =0.42 for word “said” in Brown corpus
Hsin-Hsi Chen 6-30
Modeling Natural Language(Continued)
• Issue 4: number of distinct words in a document (document vocabulary)
• Heaps’ Law– The vocabulary of a text of size n words is
V = Kn
where K and depend on the particular textK: between 10 and 100: a positive value less than 1 (e.g., 0.4 < < 0.6)
Hsin-Hsi Chen 6-31
Modeling Natural Language(Continued)
• Issue 5: average length of words• Heaps’ law
– The length of the words in the vocabulary increases logarithmically with the text size
• Longer words should appear as the text grows.
• The average length of the overall text is constant.
Hsin-Hsi Chen 6-32
Similarity Model
• distance function– symmetric: distance(a,b)=distance(b,a)– triangle inequality:
distance(a,c)distance(a,b)+distance(b,c)– measure
• Edit distance: minimum number of character insertions, deletions, and substitutionse.g., Edit-distance(color, colour)=1, Edit-distance(survey, surgery)=2
• Longest common subsequence: only deletion is allowede.g., LCS(survey, surgery)=surey (non-common is deleted)
• Longest common sequence of lines between two files: e.g., diff command in Unix
Hsin-Hsi Chen 6-33
Markup Languages
• Definition– Textual syntax that describes formatting
actions, structure information, text semantics, attributes, etc.
• Types
– Procedural Markup
– Descriptive Markup
Hsin-Hsi Chen 6-34
程序性標示 (Procedural Markup)
Hsin-Hsi Chen 6-35
描述性標示 (Descriptive Markup)
Hsin-Hsi Chen 6-36
描述性標示的特色
• 將文件內容與呈現格式區分開來
• 針對文件的語意結構進行標誌
Hsin-Hsi Chen 6-37
SGML(Standard Generalized Markup Language)
• 1986 年 ISO 所制定的標準- ISO 8879
• 屬於描述性標示。• 是一種 Meta-language
– HTML 是 SGML 的應用。
Hsin-Hsi Chen 6-38
SGML 的特色• 有彈性 (flexibility)
– 能描述任何資訊結構與任何複雜文件。• 非專屬性 (non-proprietary) 、平台獨立性 (platform-independence) 與系統獨立性 (system-independence) – 利於文件的交換與長期保存。
• 資訊再利用性 (re-usability)
Hsin-Hsi Chen 6-39
SGML 文件的組成
• SGML declaration– 指定文件所使用的字集,及特定的選項功能。
• DTD (Document Type Definition)– 定義文獻所包含的 elements 。– 定義 elements 的內容與屬性。– ...
• DI (Document Instance)– 加上標示的文件。
Hsin-Hsi Chen 6-40
SGML Declaration
• 指定 SGML 文件使用的字元集,及特定的選項功能。
• 可以不特別指定 SGML declaration ,文件會採用 SGML 預設的字元集與功能設定。
• <!SGML “ISO 8879-1986” ...
Hsin-Hsi Chen 6-41
Example : Email 的文件結構
Body
ToSubjectDate
From
Hsin-Hsi Chen 6-42
An SGML DTD for Email
<!-- Elements Min Content --><!-- ----------- ----- ---------------------------------- --><!ELEMENT Email -- (From,Date,To+,Subject, Body?)><!ELEMENT From -O (#PCDATA)> <!ELEMENT Date -O (#PCDATA)><!ELEMENT To -- (#PCDATA)><!ELEMENT Subject -O (#PCDATA)><!ELEMENT Body -- (#PCDATA)><!-- End of Email DTD -->
commentstarting and ending tagscompulsory(-) or optional (O)
,: concatenation|: logical or?: 0 or 1 occurrence*: 0 or 1 occurrences+: 1 occurrences
PCDATA: ASCII charactersNDATA: binary dataEMPTY
Hsin-Hsi Chen 6-43
An SGML DI for Email DTD
<!DOCTYPE Email SYSTEM “c:\temp\email.dtd”>
<Email>
<From>Joe
<Date>1999-7-14 AM 09:20
<To>Jay</To>
<To>Jennifer</To>
<Subject>Learning XML
<Body>XML 將在 Web 上大放異彩,趕快學喔! …</Body>
</Email>
user defined (vs. PUBLIC)
The endingtag is optional
Hsin-Hsi Chen 6-44
Hsin-Hsi Chen 6-45
SGML, DTDs, Document Instances, and Presentation Instances
SGML
DTD DTD ….
DI DI DI ….
印刷版本 Hypertext版本
盲人點字版本
….
DSSSL (Document Style Semantic Specification Language)FOSI (Formatted Output Specification Instance)
Hsin-Hsi Chen 6-46
SGML 發展的限制• SGML應用程式不易開發• SGML 文件不易在Web上傳佈• 缺乏廠商的支援
Hsin-Hsi Chen 6-47
HTML (Hypertext Markup Language)
• 是 SGML 的應用:– HTML 2.0 DTD– HTML 3.2 DTD– HTML 4.0 DTD
• 目前 Web 上寫作網頁的標準資料格式• 簡單易學• 具可攜性 (portable)• 可結合超連結 (hyperlink) 與多媒體
Most HTML instances do notexplicitly make reference to the DTD
Hsin-Hsi Chen 6-48
HTML 的特性• HTML DTD 的設計主要是滿足線上顯示的需求
• HTML 有內建的樣式 (style)
• HTML引用 SGML 的標示最簡化特徵 (markup minimization feature)
• HTML沒有採用 SGML 的超連結機制
Hsin-Hsi Chen 6-49
HTML 的限制• 結構上的限制• 資訊再利用的限制• 資料交換的限制• 自動文件處理的限制• 無法支援較精確的查詢• 各家廠商推出的 HTML Extension 不相容
Hsin-Hsi Chen 6-50
XML (eXtensible Markup Language)
• W3C Recommendation 10-February-1998 – XML 1.0
• 大廠支持: Microsoft 、 Netscape 、 Sun 、 ...
• XML is SGML-- rather than HTML++• 取 SGML 之長,補 HTML 之短
– 允許使用者依據需求,自行定義 tags– 能在 Web 上傳遞
Hsin-Hsi Chen 6-51
W3C Data Format
http://www.w3c.org/
Hsin-Hsi Chen 6-52
XML最重要的特性• 可擴展性 (Extensibility)
– XML讓使用者根據需要,自行定義標籤。• 結構性 (Structure)
– XML能描述各種複雜的文件結構。• 可確認性 (Validation)
– XML可以根據 DTD 對文件進行結構確認。
Hsin-Hsi Chen 6-53
XML 標準
• XML-Language: SGML without tears– Self-describing Documents – Well-formed and Valid Documents
• XML-Link: Power linking– simple and extended links
• XML-Style: Separate style from content– XSL (Extensible Style sheet Language)
Hsin-Hsi Chen 6-54
XML 標準制定現況• XML 1.0 :
– W3C Recommendation 10-Feb-1998
• XML Namespace :– W3C Recommendation 14-Jan-1999
• XLink & Xpointer :– W3C Working Draft 03-March-1998
• XSL :– W3C Working Draft 16-Dec-1998
Hsin-Hsi Chen 6-55
Well-formed XML Rules
• 包含一個以上的 elements• 恰有一個 root element• 不能省略 start-tag 或 end-tag• 所有的 tags 必須呈現適當的巢狀 (nest) 結構。 ( 如 <B><I>bold and italic</B>italic</I> 是不允許的 )• empty tags 必須遵守特殊的 XML 語法。 ( 如 <img src=“…”/> )• 所有的 attribute value 必須括上單引號或雙引號 . ( 如: <font size=“2”> )• 所有的實體都必須宣告
Hsin-Hsi Chen 6-56
Writing Well-Formed XML
• Step 1 : Make an XML Declaration
• Step 2 : Creating a Root Element
• Step 3 : Writing in XML
• Step 4 : Parsing your document
Hsin-Hsi Chen 6-57
Step 1:Make an XML Declaration
• <?xml version=”1.0” standalone=”yes”?>
• <?xml version=”1.0” encoding=”UTF-8” standalone=”yes”?>
• <?xml version=”1.0” encoding=”big5” standalone=”yes”?>
without DTD
Hsin-Hsi Chen 6-58
Step 2:Creating a Root Element
<?xml version=”1.0” standalone=”yes”?>
<Email>
……
</Email>
Hsin-Hsi Chen 6-59
Step 3:Writing in XML
<?xml version=”1.0” encoding=“big5” standalone=”yes”?><Email> <From>Joe</From> <Date>1999-7-14 AM 09:20</Date> <To>Jay</To> <To>Jennifer</To> <Subject>Learning XML</Subject> <Body>XML 將在 Web 上大放異彩,趕快學喔! …</Body></Email>
End tag cannotomitted
Hsin-Hsi Chen 6-60
Step 4:Parsing your document
• Checking if your well-formed XML document conforms to well-formed XML rules.
• Use a parser to check well-formedness– for example: the XML parser embedded in IE5
Hsin-Hsi Chen 6-61
Explorer 5.0 瀏覽 Well-formed XML
Hsin-Hsi Chen 6-62
Explorer 5.0 瀏覽錯誤的 XML 文件
Hsin-Hsi Chen 6-63
Writing Valid XML
• Step 1 : Make an XML declaration.
• Step 2 : Designing a DTD.
• Step 3 : Writing Valid XML.
• Step 4 : Parsing your Valid XML document.
Hsin-Hsi Chen 6-64
Step 1:Make an XML Declaration
• <?xml version=”1.0” standalone=”no”?>
• <?xml version=”1.0” encoding=”UTF-8” standalone=”no”?>
• <?xml version=”1.0” encoding=”big5” standalone=”no”?>
Hsin-Hsi Chen 6-65
Step 2 : Designing a DTD
<!-- Elements Content -->
<!-- ----------- ---------------------------------- -->
<!ELEMENT Email (From,Date,To+,Subject,Body?)>
<!ELEMENT From (#PCDATA)>
<!ELEMENT Date (#PCDATA)>
<!ELEMENT To (#PCDATA)>
<!ELEMENT Subject (#PCDATA)>
<!ELEMENT Body (#PCDATA)>
<!-- End of Email DTD -->
Hsin-Hsi Chen 6-66
Step 3 : Writing Valid XML
<?xml version=”1.0” encoding=“big5” standalone=”no”?><!DOCTYPE Email SYSTEM ”email.dtd"><Email> <From>Joe</From> <Date>1999-7-14 AM 09:20</Date> <To>Jay</To> <To>Jennifer</To> <Subject>Learning XML</Subject> <Body>XML 將在 Web 上大放異彩,趕快學喔! …</Body></Email>
Hsin-Hsi Chen 6-67
XML Simple Link
Hsin-Hsi Chen 6-68
XML Extended linking: multiple ends
Hsin-Hsi Chen 6-69
XML Extended linking:addressing by structure
Hsin-Hsi Chen 6-70
XML Extended linking
Hsin-Hsi Chen 6-71
XSL: XML counterpart of CSS (Cascading Style Sheet)
• Sample : email.css
Email,From,Date,To,Subject,Body,
{display:block;margin-left:5%;
margin-right:5%;border-style:groove;}
Hsin-Hsi Chen 6-72
XML document with Style
<?xml version=”1.0” encoding=“big5” standalone=”no”?><?xml-stylesheet href ="email.css" type="text/css"?><Email> <From>Joe</From> <Date>1999-7-14 AM 09:20</Date> <To>Jay</To> <To>Jennifer</To> <Subject>Learning XML</Subject> <Body>XML 將在 Web 上大放異彩,趕快學喔! …</Body></Email>
Hsin-Hsi Chen 6-73
Explorer 5.0 瀏覽結合 CSS 的 XML文件
Hsin-Hsi Chen 6-74
XML 的應用
• Database interchange
• Client-side processing
• User views of the data
• Information filtering
Hsin-Hsi Chen 6-75
Multimedia
• medias– text, sound, images, video
• issues– volume, format, processing requirements
Hsin-Hsi Chen 6-76
Formats
• image– bit-mapped/pixel-based display
• The simplest format• XBM, BMP, PCX• disadvantages: redundancy
– compression• Compuserve’s Graphic Interchange Format (GIF)
– lossy compression• Joint Photographic Experts Group (JPEG)
– exchange• Tagged Image File Format (TIFF)
Hsin-Hsi Chen 6-77
Formats
• Audio– AU, MIDI, WAVE
• Video – MPEG, AVI, QuickTime
Hsin-Hsi Chen 6-78
Textual Images
• definition– images of documents that contain mainly typed or
typeset text
– obtained by OCR
• image retrieval– Alternative 1
• At creation time, a set of keywords (called metadata) is associated with each image
• Conventional text retrieval techniques can be applied to keywords
Hsin-Hsi Chen 6-79
Textual Images (Continued)
– Alternative 2• Use OCR to extract the text of the image
• The resultant ASCII text can be used to extract keywords
– Alternative 3• Use the symbols extracted from the images as basic
units to combine image retrieval techniques with sequence retrieval techniques
Hsin-Hsi Chen 6-80
Taxonomy of Web languages
Hsin-Hsi Chen 6-81
相關資源• HTML-4: http://www.w3.org/TR/REC-html40• W3C: http://www.w3c.org/• OCLC: http://purl.oclc.org/• XML: http://www.xml.org/• XML Parser: http://xdev.datachannel.com/• DDML: Document Definition Markup Language.
http://www.w3.org/TR/NOTE-ddml• Xschema: http://purl.oclc.org/NET/xschema
Hsin-Hsi Chen 6-82
參考文獻J. Kunze, “Encodeing Dubin Core Metadata in HTML”, <ftp:
//ftp.ietf.org/internet-drafts/draft-kunze-dchtl-00.txt>.
E. Miller, P. Miller and d. Brickley, “Guidance on Expressing the Dublin Core within the Resource Description Framework (RDF)”, <http://www.ukoln.ac.uk/interop-focus/activites/dc/datamodel/WD-dc-rdf-19990423.htm>.