47
Representation of Web Data in a Web Warehouse Ragini A.S. & Shipra Dutta November 20 th , 2001

Representation of Web Data in a Web Warehouse Ragini A.S. & Shipra Dutta November 20 th, 2001

  • View
    214

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Representation of Web Data in a Web Warehouse Ragini A.S. & Shipra Dutta November 20 th, 2001

Representation of Web Data in a Web Warehouse

Ragini A.S.&

Shipra Dutta

November 20th, 2001

Page 2: Representation of Web Data in a Web Warehouse Ragini A.S. & Shipra Dutta November 20 th, 2001

Overview

• Need for a web warehouse• WHOM- Data Model for WHOWEDA• Concept of Node & Link• Model for representing metadata,Structure & Content of web

documents & Hyperlinks • NMT & LMT• Modeling of structural & textual content of web documents • NDT & LDT• Advantages• Disadvantages• Conclusion & Future Word

Page 3: Representation of Web Data in a Web Warehouse Ragini A.S. & Shipra Dutta November 20 th, 2001

Need of a Web Warehouse

• Rapid growth of WWW,which is a distributed global information resource.

• Applications must be able to harness and analyze web data.

• Germination of mobile users.• Necessity to exploit historical web data.• Traditional information retrieval techniques &

Search engines are not satisfactory

Page 4: Representation of Web Data in a Web Warehouse Ragini A.S. & Shipra Dutta November 20 th, 2001

Data Model of WHOWEDAWHOM(Ware House Object Model)

• Consists of two components (1) set of web objects (2) set of web operators• Centered on the notion of web tables,which is a set of

web tuples• Web tuple is a set of directed graphs each consisting of

set of nodes and links & satisfies a web schema• Set of operators like global web coupling,web join,web

select etc., are used to manipulate the web data

Page 5: Representation of Web Data in a Web Warehouse Ragini A.S. & Shipra Dutta November 20 th, 2001

Metadata associated with HTML & XML documents (web documents)

HTML or XML documents may have metadata as• URL

• Format,size(in bytes),Date of last modification

• Information about author

Hyperlink in web doc may have metadata as• Source URL

• Target URL

• Type of Hyperlink(interior,local or global)

Page 6: Representation of Web Data in a Web Warehouse Ragini A.S. & Shipra Dutta November 20 th, 2001

Node Meta Data Attribute

• Represented using a data type node

metadata-attribute• Meta Attribute may be either atomic or complex• Eg.for complex attribute, URL

URL can be decomposed as server, port,

protocol,path,filename • Eg. for atomic attribute

size, as it can not be further decomposed

Page 7: Representation of Web Data in a Web Warehouse Ragini A.S. & Shipra Dutta November 20 th, 2001

Set of node meta data attributes

Figure 1

Page 8: Representation of Web Data in a Web Warehouse Ragini A.S. & Shipra Dutta November 20 th, 2001

Node Meta data Tree(NMT)

• Representation of instance of node meta-attribute

• Internal vertices of tree are meta-attribute

names

• Leaf vertices of the tree are values of meta data attribute

Page 9: Representation of Web Data in a Web Warehouse Ragini A.S. & Shipra Dutta November 20 th, 2001

Example

Page 10: Representation of Web Data in a Web Warehouse Ragini A.S. & Shipra Dutta November 20 th, 2001

Example of NMT

ConsiderURL: http://www.ninds.nih.gov/patients/Disorder/Alexander/Alexander.htm

Last Modification Date: Thursday,15th July, 1999,04:50:53.

Size : 10761K

Attribute URL has following attribute/value pairs:

(Protocol,”http”),(Server,” www.ninds.nih.gov”),(Path,” patients/Disorder/Alexander”) and (Filename,” Alexander.htm”)

Attribute Server has following sub attribute/value pairs:

(Name,”www.ninds.nih”),(Domain name,”gov”)

Page 11: Representation of Web Data in a Web Warehouse Ragini A.S. & Shipra Dutta November 20 th, 2001

Node Meta Data Tree

Figure 2

Page 12: Representation of Web Data in a Web Warehouse Ragini A.S. & Shipra Dutta November 20 th, 2001

Link Meta Data Attribute

• Represented using a data type link

metadata-attribute• Each Attribute can be either atomic or complex• Eg. Complex attribute,

Source URL or target URL can be decomposed to server, port, protocol, path and file name.

Eg. Atomic attribute

Link type-local, global or interior

Page 13: Representation of Web Data in a Web Warehouse Ragini A.S. & Shipra Dutta November 20 th, 2001

Set of link meta data attributes

Figure 3

Page 14: Representation of Web Data in a Web Warehouse Ragini A.S. & Shipra Dutta November 20 th, 2001

Link metadata tree (LMT)

• Representation of instance of a link metadata-attribute

• Corresponds to the link meta data attribute/value pairs of hyperlinks.

• Internal vertices of tree are meta-attribute

names of hyperlinks• Leaf vertices of the tree are values of meta data

attribute

Page 15: Representation of Web Data in a Web Warehouse Ragini A.S. & Shipra Dutta November 20 th, 2001

Figure 4

Page 16: Representation of Web Data in a Web Warehouse Ragini A.S. & Shipra Dutta November 20 th, 2001

Example of LMT

Figure 5

Page 17: Representation of Web Data in a Web Warehouse Ragini A.S. & Shipra Dutta November 20 th, 2001

Issues for modeling Structure & Content

• Web data embedded within a HTML or XML document should be written in compliance with the HTML & XML specifications respectively

• Modeling tags & tag less data• Modeling hierarchical structure• Attribute/Value pairs associated with tags• Order of text• Location information of a portion of tag less data

Page 18: Representation of Web Data in a Web Warehouse Ragini A.S. & Shipra Dutta November 20 th, 2001

Components of Node structural attribute

• Name

• Attribute_list

• Content

• Identifier

• Location_attribute

Page 19: Representation of Web Data in a Web Warehouse Ragini A.S. & Shipra Dutta November 20 th, 2001

Node Data Tree(NDT)

• Represents the structure & content of web page

• Node structural objects which are instances of node structural attributes satisfy some dependency constraints ,which can be collectively visualized as rooted,directed tree which is an NDT.

Page 20: Representation of Web Data in a Web Warehouse Ragini A.S. & Shipra Dutta November 20 th, 2001

Figure 6

Page 21: Representation of Web Data in a Web Warehouse Ragini A.S. & Shipra Dutta November 20 th, 2001
Page 22: Representation of Web Data in a Web Warehouse Ragini A.S. & Shipra Dutta November 20 th, 2001

Figure 6

Page 23: Representation of Web Data in a Web Warehouse Ragini A.S. & Shipra Dutta November 20 th, 2001

Features Of NDT

• Rooted, directed tree

• Loss of structural information

• No loss of content data

• Exclusion of anchor tags

Page 24: Representation of Web Data in a Web Warehouse Ragini A.S. & Shipra Dutta November 20 th, 2001

Components of NDT

• name

• Attribute_list

• Identifier

• Content

• Location_attribute

Page 25: Representation of Web Data in a Web Warehouse Ragini A.S. & Shipra Dutta November 20 th, 2001

Definition of Dependency Constraints

Page 26: Representation of Web Data in a Web Warehouse Ragini A.S. & Shipra Dutta November 20 th, 2001

NDT for HTML Documents

Classification of HTML tags• Non-noisy tags – HTML tags which are considered

in node data tree.

• Noisy tags – HTML tags which are ignored while mapping a HTML document to NDT.Three types of noisy tags that our model considers are

1. Tags used for formatting purpose

2. Tags used to represent a hyperlink

3. Tags with specification of executable content

Page 27: Representation of Web Data in a Web Warehouse Ragini A.S. & Shipra Dutta November 20 th, 2001

Representation of non-noisy tags in NDT

Classified as three types

• Type1 tags

• Type2 tags

• Tags3 tags

Page 28: Representation of Web Data in a Web Warehouse Ragini A.S. & Shipra Dutta November 20 th, 2001

Noisy and Non-noisy tag attributes

Noisy attributes: Attributes which are ignored while generating a NDT from HTML doc.

Three types of noisy attributes are • Attributes used for formatting purpose• Attributes used to represent behavior of web document• Attributes which specify execution content

Non-noisy attributes: Attributes considered important in the context of modeling HTML document & are represented in NDT

Page 29: Representation of Web Data in a Web Warehouse Ragini A.S. & Shipra Dutta November 20 th, 2001

Representation Of Content and Structure in XML Document.

• The XML Documents are mapped into a Node Data Tree.

• Node structural objects which are instances of node structural attributes satisfy some dependency constraints ,which can be collectively visualized as rooted,directed tree which is an NDT.

Page 30: Representation of Web Data in a Web Warehouse Ragini A.S. & Shipra Dutta November 20 th, 2001

Issues related to generating NDT from XML Documents.

• The XML Documents don’t have have a fixed set of tags and attributes like HTML.

• The tags and attributes are defined by the user.• XML does not encounter the problem of elements

with no end tags and elements whose tags may be omitted.

• Thus no need to address the issues related to type 2 and type 3 tags while generating the NDT’s from XML documents.

Page 31: Representation of Web Data in a Web Warehouse Ragini A.S. & Shipra Dutta November 20 th, 2001

Example of NDT generated from XML Document.

Page 32: Representation of Web Data in a Web Warehouse Ragini A.S. & Shipra Dutta November 20 th, 2001

Representing Structure and Contents of hyperlinks.

• A hyperlink is an explicit relationship between two or more data objects or portions of data objects.

• A hyperlink is defined by the data type Link type.

• A Link type consists of three components: a set of meta data attributes, a set of link structural attributes and a reference identifier.

• Link structural attributes are used to express the structure and content if hyperlinks and the reference identifier is used to specify the location of hyperlinks in web documents.

Page 33: Representation of Web Data in a Web Warehouse Ragini A.S. & Shipra Dutta November 20 th, 2001

Issues for modeling hyperlinks.• The <a> tag and the attributes href or name are

used to specify hyperlinks in HTML documents. XML links are specified by the use of attribute named xml:link.Possible values are simple and extended, as well as locator group and document.

• When authors add a hyperlink to a document D, they include the description of the document in addition to the URL which are important.

• The location of Hyperlinks is important as we may need to impose constraints in a query to follow only those links which are located in a particular portion of web page.

Page 34: Representation of Web Data in a Web Warehouse Ragini A.S. & Shipra Dutta November 20 th, 2001

Link Structural attributes.

Similar to node structural attributes, it consists of three components:

• Name, corresponding to start-tag of HTML or XML link.

• Attribute_list, it is finite possibly empty set of attributes associated with the tag.

• Content, between start and end tags.

Page 35: Representation of Web Data in a Web Warehouse Ragini A.S. & Shipra Dutta November 20 th, 2001

Reference Identifier

It is a unique identifier that references an identifier in node structural attribute. For example consider the web page in

Page 36: Representation of Web Data in a Web Warehouse Ragini A.S. & Shipra Dutta November 20 th, 2001

Node Data tree for XML Document.

Page 37: Representation of Web Data in a Web Warehouse Ragini A.S. & Shipra Dutta November 20 th, 2001

Link data Tree• It can be represented by a set of instances of link

structural attributes.• HTML or XML documents are mapped into

instances of link structural attributes called link structural objects,these objects and the dependency constraints can be visualized as rooted, directed tree called a link data tree.

• The internal vertices represent tagged elements containing tag names and a list of attribute/value pairs,the leaf vertices represent the label of the link.

Page 38: Representation of Web Data in a Web Warehouse Ragini A.S. & Shipra Dutta November 20 th, 2001

Link Data Tree for HTML Documents

• The <a> tag marks a bock of HTML document as a hypertext link.

• <a> can take several attributes like href or name which specify the destination of hypertext link or indicate that the marked text can be the target of a hypertext link.

Page 39: Representation of Web Data in a Web Warehouse Ragini A.S. & Shipra Dutta November 20 th, 2001

Example of LDT for HTML Document

• Consider a code snippet:<a href = http://www.rxlist.com/cgi/generic/index.html>all

RxList monographs(Nearly 300 of them)</a>

Page 40: Representation of Web Data in a Web Warehouse Ragini A.S. & Shipra Dutta November 20 th, 2001

The Link data tree of the web page

Page 41: Representation of Web Data in a Web Warehouse Ragini A.S. & Shipra Dutta November 20 th, 2001

Illustration of LDT of hyperlinks which contain image.

Page 42: Representation of Web Data in a Web Warehouse Ragini A.S. & Shipra Dutta November 20 th, 2001

Link Data Tree for XML Documents

• There are no fixed links tags to express links in XML data so an element is an XmL link if either it has xml:link attribute or the element and all of its attributes and content adhere to syntactic requirements.

• Two types of links are to be considered simple and extended links.

• A simple link when mapped to LDT is always a linear tree.

Page 43: Representation of Web Data in a Web Warehouse Ragini A.S. & Shipra Dutta November 20 th, 2001

Example of Extended XML Link

Page 44: Representation of Web Data in a Web Warehouse Ragini A.S. & Shipra Dutta November 20 th, 2001

Noisy Tags and Attributes

• XML tags are user defined and are not used for formatting purpose. Thus, there are no noisy tags to be ignored while generating LDT’s.

• The attributes which specify link behavior such as show and actuate are however ignored.

Page 45: Representation of Web Data in a Web Warehouse Ragini A.S. & Shipra Dutta November 20 th, 2001

Advantages

• It provides location independent information to the mobile users.

• It is used in building web data repository that supports historical web data.

• It is an effective and efficient information retrieval technique.

Page 46: Representation of Web Data in a Web Warehouse Ragini A.S. & Shipra Dutta November 20 th, 2001

Disadvantages

• Information retrieval is complex.

• It does not handle the executable contents of the web documents.

• Attributes used to represent behavior of web document are not considered in this model.

Page 47: Representation of Web Data in a Web Warehouse Ragini A.S. & Shipra Dutta November 20 th, 2001

Conclusions

• This model logically separates the hyperlinks from the web documents

• Aids in the representation of metadata, contents and structure of HTML and XML data as a tree-like structure.