25
Conceptual and Systematic Design Approach for XML Document Warehouses Vicky Nassis, La Trobe University, Australia Rajugan Rajagopalapillai, University of Technology, Australia Tharam S. Dillon, University of Technology, Australia Wenny Rahayu, La Trobe University, Australia ABSTRACT EXtensible Markup Language (XML) has emerged as the dominant standard in describing and exchanging data among heterogeneous data sources. The increasing presence of large volumes of data appearing creates the need to investigate XML Document Warehouses as a means of handling the data. In this paper our focus is twofold. First we utilise Object Oriented (OO) concepts to develop and propose a conceptual design formalism to build meaningful XML Document Warehouses (XDW). This includes: (1) XML (warehouse) repository (xFACT) using OO concepts followed by the transformation of XML Schema constructs and (2) Conceptual Virtual Dimensions (VDims) using Conceptual views (Rajugan, Chang, Dillon, & Feng, 2003, 2004). Secondly we address several important outstanding issues related to our proposed design of an XML Document Warehouse. Specifically we note that the xFACT portion is now a complex structure, involving several entities and relationships as opposed to being a simple FACT table as was the case in relational data warehouses, and the notion of Virtual Dimensions (VDims) has considerably greater complexity. Keywords: conceptual design; data modeling; data warehouse; database conceptual design; database design; database logical design; database requirements analysis; database views; information engineering; information requirements analysis; structural modeling; UML; XML INTRODUCTION Data Warehousing (DW) has been an approach adopted for handling large volumes of historical data for detailed analysis and management support. Trans- actional data in different databases is cleaned, aligned and combined to produce data warehouses. Since its introduction in 1996, eXtensible Markup Language (XML) has become the defacto standard for stor- ing and manipulating self-describing infor- mation, which creates vocabularies in as- sisting information exchange between het- erogenous data sources over the Web (Pokorny, 2002). Amongst the purposes of This paper appears in the journal International Journal of Data Warehousing and Mining edited by David Taniar. Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission 701 E. Chocolate Avenue, Suite 200, Hershey PA 17033-1240, USA Tel: 717/533-8845; Fax 717/533-8661; URL-http://www.idea-group.com ITJ2884 IDEA GROUP PUBLISHING

Conceptual and Systematic Design Approach for XML Document ... · dimensions), several design techniques have been proposed to capture multidimen-sional data (MD) at the conceptual

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Conceptual and Systematic Design Approach for XML Document ... · dimensions), several design techniques have been proposed to capture multidimen-sional data (MD) at the conceptual

Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

International Journal of Data Warehousing & Mining, 1(3), 63-87, July-September 2005 63

Conceptual and SystematicDesign Approach for

XML Document WarehousesVicky Nassis, La Trobe University, Australia

Rajugan Rajagopalapillai, University of Technology, AustraliaTharam S. Dillon, University of Technology, Australia

Wenny Rahayu, La Trobe University, Australia

ABSTRACT

EXtensible Markup Language (XML) has emerged as the dominant standard in describing andexchanging data among heterogeneous data sources. The increasing presence of large volumesof data appearing creates the need to investigate XML Document Warehouses as a means ofhandling the data. In this paper our focus is twofold. First we utilise Object Oriented (OO)concepts to develop and propose a conceptual design formalism to build meaningful XMLDocument Warehouses (XDW). This includes: (1) XML (warehouse) repository (xFACT) usingOO concepts followed by the transformation of XML Schema constructs and (2) ConceptualVirtual Dimensions (VDims) using Conceptual views (Rajugan, Chang, Dillon, & Feng, 2003,2004). Secondly we address several important outstanding issues related to our proposeddesign of an XML Document Warehouse. Specifically we note that the xFACT portion is now acomplex structure, involving several entities and relationships as opposed to being a simpleFACT table as was the case in relational data warehouses, and the notion of Virtual Dimensions(VDims) has considerably greater complexity.

Keywords: conceptual design; data modeling; data warehouse; database conceptual design;database design; database logical design; database requirements analysis;database views; information engineering; information requirements analysis;structural modeling; UML; XML

INTRODUCTION

Data Warehousing (DW) has beenan approach adopted for handling largevolumes of historical data for detailedanalysis and management support. Trans-actional data in different databases iscleaned, aligned and combined to produce

data warehouses. Since its introduction in1996, eXtensible Markup Language (XML)has become the defacto standard for stor-ing and manipulating self-describing infor-mation, which creates vocabularies in as-sisting information exchange between het-erogenous data sources over the Web(Pokorny, 2002). Amongst the purposes of

This paper appears in the journal International Journal of Data Warehousing and Mining edited by David Taniar.Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission

701 E. Chocolate Avenue, Suite 200, Hershey PA 17033-1240, USATel: 717/533-8845; Fax 717/533-8661; URL-http://www.idea-group.com

�������

IDEA GROUP PUBLISHING

Page 2: Conceptual and Systematic Design Approach for XML Document ... · dimensions), several design techniques have been proposed to capture multidimen-sional data (MD) at the conceptual

64 International Journal of Data Warehousing & Mining, 1(3), 63-87, July-September 2005

Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

XML is to carry out electronic documenthandling, electronic storage, retrieval andexchange. It is envisaged that XML will alsobe used for logically encoding documentsfor many domains. Hence, it is likely that alarge number of XML documents will popu-late the would-be repository and includeseveral disparate transactional databases.

The need for managing large amountsof XML document data raises the neces-sity to explore the data warehouse ap-proach through the use of XML documentmarts and XML document warehouses.

Since the introduction of dimensionalmodeling (which revolves around facts anddimensions), several design techniqueshave been proposed to capture multidimen-sional data (MD) at the conceptual level.Ralph Kimball’s Star Schema (Kimball &Ross, 2002) proved very popular, fromwhich the well-known conceptual modelsSnowFlake and StarFlake were derived.More recent comprehensive data ware-house design models are built using Ob-ject-Oriented concepts on the foundationsof the Star Schema. In Trujillo, Palomar,Gomez and Song (2001), Lujan-Mora,Trujillo, and Song (2002), and Abello, Samos,and Saltor (2001), two different OO mod-eling approaches are demonstrated wherea data cube is transformed into an OOmodel integrating class hierarchies. TheObject-Relational Star Schema (O-R Star)model (Rahayu, Dillon, Mohammad, &Taniar, 2001) aims to envisage data mod-els and their object features, focusing onhierarchical dimension presentation, differ-entiation and their different sorts of em-bedded hierarchies.

These models, both object and rela-tional, have a number of drawbacks if onewishes to use them for XML documentwarehouses, namely: (1) data-orientedwithout sufficient emphasis or capturinguser requirements, (2) extensions of seman-

tically poor relational models (star, snow-flake models), (3) original conceptual se-mantics are lost before building data ware-houses as the operational data source isrelational, (4) further loss of semantics re-sults from oversimplified dimensional mod-eling, (5) time consuming if additional datasemantics are required to satisfy evolvinguser requirements, and (6) complex querydesign and processing is needed, thereforemaintenance is troublesome. Traditionaldesign models lack the ability to utilise orrepresent XML design level constructs ina well-defined abstract and implementation-independent form.

One of the early XML data ware-house implementations includes the XylemeProject (Xyleme, 2001). The Xylemeproject was successful and it was madeinto a commercial product in 2002. It has awell-defined implementation architectureand proven techniques to collect andarchive Web XML documents into an XMLwarehouse for further analysis. Anotherapproach by Fankhauser and Klements(2003) explores some of the changes andchallenges of a document centric XMLwarehouse. Coupling these approacheswith a well defined conceptual and logicaldesign methodology will help future designof such XML warehouse for large-scaleXML systems. In this article we are con-cerned with the design of a data warehouserather than the implementation of the XMLdocument warehouse. We propose a con-ceptual modelling approach for the devel-opment of an XML Document Warehouse(XDW), while emphasising the design tech-niques to build the XDW conceptual modelin UML. We also carry out a systematictransformation of this conceptual model intoan XML Schema. An in-depth analysis forthe design of the xFACT and the VirtualDimension structures is illustrated using awalk-through with a case study example.

Page 3: Conceptual and Systematic Design Approach for XML Document ... · dimensions), several design techniques have been proposed to capture multidimen-sional data (MD) at the conceptual

Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

International Journal of Data Warehousing & Mining, 1(3), 63-87, July-September 2005 65

Our WorkUML, a widely adopted standard for

Object-Oriented (OO) conceptual models,is the foundation to build this conceptualmodel for XML document warehousing.The main aspects of our methodology areas follows: (1) User requirements (UR;assist in determining different perspectivesof the document warehouse), (2) XMLDocument structure (W3C Consortium,2000): Using XML document capability inaccommodating and explicitly describingheterogeneous data along with their inter-relationships semantics, unlike flat-relationaldata), (3) XML Schema: Describes, vali-dates, and provides semantics for its cor-responding instance document; XML Docu-ment (W3C Consortium, 2001). Also, it cancapture all OO concepts and relationships(Feng, Dillon, & Chang, 2002, 2003) as wellas intuitive XML specific constructs suchas ordering of components (see sectionXML Schema and OO Concepts), and (4)Conceptual Views (Rajugan et al., 2003,2004; describes how a collection of XMLtags contained in an XML document re-lates to the direct utilisation by a domainuser at the conceptual/abstract level.”

The rest of the article is organised asfollows. In the section XML Schema andOO Concepts, we outline some uniqueXML Schema concepts that must be cap-tured and modeled using the UML docu-ment data model. The section of the XMLDocument Warehouse Model is dedicatedto in-depth discussion of the XDW con-ceptual model (with examples). The sec-tion of the case study example, Confer-ence Publication System (CPSys), pro-vides a brief description of the examplecase study used in this paper, of which wewill be gradually building the XML ware-house model (using UML). This is followedby a discussion, conclusion and future work.

XML Schema and OO ConceptsA widely used approach for concep-

tual modeling for developing XML Schemasfor XML Data is the OO conceptual model,frequently expressed in UML (Feng et al.,2003). In order to understand the seman-tics that must be captured in abstract mod-els of XML Data Warehouses, two signifi-cant issues must be noted, namely: orderedcomposition and homogeneous composi-tion. An illustration of each case is providedin the sections that follow using the pro-posed mapping techniques in Feng et al.(2003) for the transformation of the OOdiagrams to XML Schema. The examplesare extracted from the complete UML dia-gram in Figure 10.

Ordered CompositionConsider the XML Schema fragment

in Box A, modeled applying this UML nota-tion. The composite element Publ_Papersis an aggregation of the sub-elements:Publ_Title, Publ_Abstract, andPubl_References. Interpreting this XMLSchema fragment we observe that the tag<xs:sequence> signifies that the em-bedded elements are not only a simple as-sortment of components but these have aspecific ordering. Hence we add to UMLan annotation that allows capturing of theordered composition as shown in Figure 1,utilising stereotypes to specify the objects’order of occurrence such as <<1>>, <<2>>,<<3>>,…,<<n>>. Figure 1(b) shows theXML Schema segment above modeledapplying this UML notation.

Once we have the XML Schemacomponent, we can generate the corre-sponding XML instance document (see BoxB). This is used to check and verify thevalidity of the XML Schema document.

Page 4: Conceptual and Systematic Design Approach for XML Document ... · dimensions), several design techniques have been proposed to capture multidimen-sional data (MD) at the conceptual

66 International Journal of Data Warehousing & Mining, 1(3), 63-87, July-September 2005

Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

Homogeneous CompositionIn a homogeneous aggregation, one

“whole” object consists of “part” objects,which are of the same type (Rahayu &Chang, 2002). Two important cases are

considered when applied to the homogenouscomposition, which are as follows:

Case of One-to-Many RelationshipConsider the XML Schema segment

in Box C of the relationship between the

Figures 1(a) and Figure 1(b). Ordered composition example

Box A. XML Schema fragment

<xs:element name="Publ_Papers" type=" Publ_PaperType" maxOccurs="unbounded"/> <xs:complexType name="Publ_PaperType"> <xs:sequence> <xs:element name="Paper_ID" type="xs:ID"/> <xs:element name="Paper_Desc" type="xs:string"/> <xs:element name="Paper_Type" type="xs:string"/> <xs:sequence> <xs:element name="Publ_Abstract" type="Publ_AbstractType"/> <xs:element name="Publ_Content" type="Publ_ContentType"/>

<xs:element name="Publ_References" type="Publ_ReferenceType"/> <xs:sequence> </xs:sequence> </xs:complexType>

Component A<<1>>

Component B<<2>>

Component C<<3>>

Composite Object

1 1 1

a. UML Stereotype for an Ordered Composition

Publ_Abst ract

Abst_P aperIDAbst_Contents

<< 1>>Publ_Content

Con t_Paper IDCon t_NoO fPag esCon t_W ordsCount

<< 2>>Publ_Refere nces

Reference_IDReference_DetailsTotal_References

<< 3>>

Publ_Papers

Paper_IDPaper_DescPaper_Length

11 11 11

b. The XML Schema Fragment for an Ordered Composition shown in UML

Box B. XML instance document

<Publ_Papers> <Paper_ID> 131 </Paper_ID> <Paper_Desc> Conceptual Design for XML Document Warehouses </Paper_Desc> <Paper_Type> Long Journal Paper </Paper_Type>

<Publ_Abstract> <Abst_PaperID> 131 </Abst_PaperID> <Abst_Contents> This is the abstract of the paper </Abst_Contents>

</Publ_Abstract> <Publ_Content>

<Cont_PaperID> 131 </Cont_PaperID> <Cont_NoOfPages> 25 </Cont_NoOfPages> <Cont_WordsCount> 6,900 </Cont_WordCount>

</Publ_Content> <Publ_References>

<Reference_ID> 1 </Reference_ID> <Reference_Details> These are the details of reference 1 </Reference_Details> <Total_References> 30 </Total References>

</Publ_References> </Publ_Papers>

Page 5: Conceptual and Systematic Design Approach for XML Document ... · dimensions), several design techniques have been proposed to capture multidimen-sional data (MD) at the conceptual

Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

International Journal of Data Warehousing & Mining, 1(3), 63-87, July-September 2005 67

two elements Publ_Chapter andPubl_Section.

The object Publ_Chapter canconsist of one or more Publ_Sections.We specify the maxOccurs value ofPubl_Section to “unbounded” (defaultvalue is “1”), enabling the elementPubl_Section to occur from one tomany times within Publ_Chapter. Weconsider the assumption that a given sec-tion cannot be contained in any other chap-ter from the papers submitted. This char-acteristic is shown in Figure 2(a) using theproposed UML notation while Figure 2(b)shows the model corresponding to the XMLSchema fragment in Box D.

In Box D is the equivalent XML docu-ment for the homogeneous compositionwith the one-to-many relationship.

Case of Many-to-Many RelationshipConsider the XML Schema segment

in Box E of the relationship betweenPubl_Year and Publ_Month ele-ments.

In this category a composite object(Publ_Year) may have many compo-nents (Publ_Months) and each compo-nent may belong to many composite ob-jects. The maxOccurs value set to “un-bounded” in both elements forms the many-

Figure 2(a) and Figure 2(b). Homogeneous composition (Case 1)

Box C. XML Schema segment

<xs:element name="Publ_Chapter" type="Publ_ChapterType" maxOccurs="unbounded"/> <xs:complexType name=" Publ_ChapterType"> <xs:sequence> <xs:element name="Ch_ID" type="xs:ID"/> . . . . . . </xs:sequence> <xs:element name="Publ_Section" type="Publ_SectionType

maxOccurs="unbounded"/> </xs:sequence> </xs:sequence> </xs:complexType> <xs:complexType name="Publ_SectionType"> <xs:sequence> . . . . . . <xs:sequence> <xs:element name="Section_SubSection" type="Publ_SectionType" maxOccurs="unbounded"/> </xs:sequence> </xs:sequence> </xs:complexType>

Component A

Com posi te Object

1...n

a. UML Notation for a Homogeneous Composition featuring One to Many relationship

P u b l _ C h a p t e r

C h _ I DC h _ N oC h _ T i t l eC h _ K e y w o r d s

P u b l _ S e c t i o nS e c t i o n _ I DS e c t i o n _ T i t l eS e c t i o n _ C o n t e n t sS e c t i o n _ F i g u r e sS e c t i o n _ T a b l e s

1 . . n1 . . n

b. Homogeneous Composition example

Page 6: Conceptual and Systematic Design Approach for XML Document ... · dimensions), several design techniques have been proposed to capture multidimen-sional data (MD) at the conceptual

68 International Journal of Data Warehousing & Mining, 1(3), 63-87, July-September 2005

Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

to-many relationship between Publ_Yearand Publ_Month elements. The UMLnotation illustrating this case is shown inFigure 3(a) while Figure 3(b) shows themodel corresponding to the XML Schemafragment in Box F.

Box D. XML document

<Publ_Chapter> <Ch_ID> 1A </Ch_ID> <Ch_No> 1 </Ch_No> <Ch_Title> Introduction </Ch_Title> <Ch_Keywords> Design, Conceptual, Data Warehouse, XML </Ch_Keywords>

<Publ_Section> <Section_ID> 1.1 </Section_ID> <Section_Title> This is the title of section 1.1 </Section_Title> <Section_Contents> This is the content of section 1.1 </Section_Contents> <Section_Figures> Figure 1.1a, Figure 1.1b </Section_Figures> <Section_Tables> Table 1.1a, Table 1.1b </Section_Tables>

</Publ_Section> <Publ_Section>

<Section_ID> 1.2 </Section_ID> <Section_Title> This is the title of section 1.2 </Section_Title> <Section_Contents> This is the content of section 1.2 </Section_Contents> <Section_Figures> Figure 1.2a, Figure 1.2b </Section_Figures> <Section_Tables> Table 1.2a, Table 1.2b </Section_Tables>

</Publ_Section> </Publ_Chapter>

Box E. XML Schema statement

<xs:element name="Publ_Year" type="Publ_YearType" maxOccurs="unbounded"/> <xs:complexType name="Publ_YearType"> <xs:sequence> <xs:element name="Year_ID" type="xs:IDREF"/> . . . <xs:sequence> <xs:element name="Publ_Month" type="Publ_MonthType" maxOccurs="unbounded"/> </xs:sequence> </xs:sequence> </xs:complexType>

In Box F is the XML document forthe second case of homogeneous com-position involving a many-to-many rela-tionship.

Figure 3(a) and Figure 3(b). Homogeneous composition (Case 2)

Component A

Composite Object

1...n

1...n

a. UML Notation for a Homogeneous Composition featuring Many to Many relationship

P u b l _ Y e a r

Y e a r _ IDY r _ Y e a rY r _ S p e c i a l E v e n t s

1 . . n

P u b l _ M o n t h

M o n t h _ IDM o n t h _ N a m e

1 . . n

b. Homogeneous Composition example (n:m)

Page 7: Conceptual and Systematic Design Approach for XML Document ... · dimensions), several design techniques have been proposed to capture multidimen-sional data (MD) at the conceptual

Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

International Journal of Data Warehousing & Mining, 1(3), 63-87, July-September 2005 69

PROPOSED MODEL FORXML DOCUMENTWAREHOUSE (XDW)

The XDW model is shown in Figure4. To our knowledge, this is unique in that itutilises XML itself (together with XMLSchema) to provide: (1) structural con-structs, (2) metadata, (3) validity, and (4)expressiveness (via refined granularity andclass decompositions). The proposed modelis composed of two levels: (1) User Re-quirement Level, which includes the ware-house user requirement document and OOrequirement model, and (2) XML Ware-house Conceptual Level, composed of anXML FACT repository (section on XMLFACT Repository, xFACT) and a collec-tion of logically grouped conceptual viewswhich employ the dimensions that satisfycaptured warehouse user requirements.

User Requirement LevelThe first level of the XDW model

captures the warehouse end-user require-ments. As opposed to the classical datawarehouse models, this requirement modeldoes not consider the transactional data asthe focal point in the design of the ware-house. The XDW conceptual level datamodel is designed based on the user re-quirements. This level is further divided intotwo more sub-components, which are dis-cussed next.

Warehouse User RequirementsThe warehouse user requirements

correspond to the written, non-technicaloutline of the XML warehouse. These areusually the typical or the predictable resultsexpected of the XML Document Ware-house (XDW). Also, these user require-ments can be further refined and/or en-hanced by the participation of domain ex-perts or the users of the transactional sys-tem in question. In addition, further refine-ment can be added to this model by usingapproaches that generate user requirementdocuments such as analysing frequent userquery patterns (Zhang, Ling, Bruckner, &Tjoa, 2003) or other automated approaches.

OO Requirement ModelThe OO requirement model trans-

forms all non-technical user requirementsinto technical, software model specific con-cepts using UML (actors, use-case, andobjects). The OO requirement model canbe further refined through the involvementof domain experts, system analysts, pro-grammers or the operators of the transac-tional system. This OO requirement model,by making more precise the representationof the requirements allows domain expertsand users to move more effectively, evalu-ate their requirements and identifying re-quirements or misrepresentations. It helpsXDW model developers in providing a less

Box F. XML document

<Publ_Year> <xs:element name="Year_ID"/>

<Year_ID> 3 </Year_ID> <Yr_Year> 2004 </Yr_Year> <Yr_SpecialEvents> Conference </Yr_Special_Events>

<Publ_Month> <Month_ID> 9 </Month_ID> <Month_Name> September </Month_Name>

</Publ_Month> </Publ_Year>

Page 8: Conceptual and Systematic Design Approach for XML Document ... · dimensions), several design techniques have been proposed to capture multidimen-sional data (MD) at the conceptual

70 International Journal of Data Warehousing & Mining, 1(3), 63-87, July-September 2005

Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

ambiguous and more meaningful model onwhich to base the data models.

XML Data WarehouseConceptual Level in UMLThe process of deriving the XDW

conceptual data model involves taking theformalised user requirements expressed inUML and then validating them against theXML transactional systems, from which thedata in the XML data warehouse is as-sembled, to check for data availability. Thecase of lacking transactional data neces-sary to assemble certain dimensional orxFACT data highlights an ambiguous ornon-achievable user requirement, whichthen has to be re-defined and/or modified.At the stage of transactional system main-tenance, these will be considered as futureuser requirements to define what dataneeds to be collected.

The xFACT is a snapshot of the un-derlying transactional system(s) for a given

context. A context is more than a mea-sure (Golfarelli, Maio, & Rizzi, 1998; Trujillo,Palomar, Gomez, & Song, 2001) but insteadis an item that is of interest for the organi-sation as a whole.

The role of conceptual views is toprovide perspectives of the document hi-erarchy stored in xFACT repository. Thesecan be grouped into logical groups, whereeach one is very similar to that of a givensubject area (Dillon & Tan, 1993; Coad& Yourdon, 1991) appearing in Object-Ori-ented conceptual modeling techniques.Each subject-area in the XDW model isreferred to as a Virtual Dimension (VDim)in accordance with the language of dimen-sional models. VDim is called virtual sinceit is modeled using a conceptual view(Rajugan et al., 2003) (which is an imagi-nary XML document) behaving as a di-mension to a given perspective from thexFACT. The following sections discuss indetail the modeling of VDims, xFACT re-

Figure 4. XDW Context Diagram

Page 9: Conceptual and Systematic Design Approach for XML Document ... · dimensions), several design techniques have been proposed to capture multidimen-sional data (MD) at the conceptual

Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

International Journal of Data Warehousing & Mining, 1(3), 63-87, July-September 2005 71

pository and the issues likely to be encoun-tered during this process.

XML FACT Repository (xFACT)and Meaningful FACT

In building the xFACT Repository, wenote that it differs from a traditional rela-tional data warehouse, which has a flatFACT table that is normally modeled as anID packed FACT table with its associateddata perspectives as dimensions. But, inregard to XML, a context refers to thepresence of embedded declarative se-mantics and relationships including or-dered and homogeneous compositions,associations and inheritance as well asnon-relational constructs such as set, list,and bag (see Figure 10). Therefore, we ar-gue that a plain FACT does not providesemantic constructs that are needed to ac-commodate an XML context.

At the logical level we utilize the XMLSchema definition language to capture thexFACT semantics (i.e., classes, relation-ships, ordering and constraints). Therefore,modeling of the xFACT is constrained onlyby the availability of XML Schema ele-ments and constructs. Below we introducea case study example and provide an illus-trative demonstration, which goes throughthe steps involved in the formation of ourxFACT conceptual model.

Case Study: ConferencePublication System (CPSys)

The main component of a conferencepublication comprises of a collection of papers.All the papers are documents originallywritten by one or more authors and which arethen submitted to the appropriate publicationchief editor for review. The structure of themanuscript has three components, namely:Abstract, Content and References. In thereview process, papers are distributed to two

or four referees with expertise in the subjectarea. The given feedback will assist in theselection process of papers to be included inthe proceedings book. A conference is notlimited to one area of research and mayencompass several areas, therefore workshopsare formed to cover each or closely relatedsubject topics. Usually the conferenceparticipants can originate from any part ofworld and belong to various internationallylocated institutes. The conference organisers’responsibility is to develop the schedule andinform the participants, as well as to ensurethat events unfold in an orderly manner.

Taking the above outlined points intoconsideration, we are able to grasp andrepresent these through a conceptual modelusing UML. Concentrating most impor-tantly on the interaction amongst the ob-jects will help determine their relationshipconfiguration and cardinality, which will bereflected in the conceptual design. An ex-amination of our complete UML concep-tual model (Figure 10) emphasises thestructural complexity of the xFACT inwhich real-world objects can be hierarchi-cally decomposed into several sub-elementswhere each of these can include furtherembedded elements. Such decomposi-tions are necessary to provide represen-tation of a real-world object with appro-priate granularity and if needed, addi-tional semantics are added at differentlevels of the hierarchy. For example, thePubl_Conference class hierarchy isdecomposed with additional semantics suchas Publ_Region, Publ_Country andPubl_City. It is important to make thexFACT as expressive as possible whilstretaining the overriding objective of rel-evance.

Virtual DimensionsA user requirement, which is captured

in the OO Requirement Model, is trans-

Page 10: Conceptual and Systematic Design Approach for XML Document ... · dimensions), several design techniques have been proposed to capture multidimen-sional data (MD) at the conceptual

72 International Journal of Data Warehousing & Mining, 1(3), 63-87, July-September 2005

Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

formed into one or more Conceptual Views(Rajugan et al., 2003) also referred to asVirtual Dimension(s), in association withthe xFACT. These are typically views in-volving aggregation or perspectives of theunderlying stored documents of the trans-action system(s). A valid user requirementis such that it can be satisfied by one ormore XML conceptual views for a givencontext (i.e., xFACT). But in the casewhere for a given user requirement thereis no transactional document or data frag-ment to satisfy it, further enhancements arenecessary to make the requirement fea-sible to model with a certain xFACT. There-fore modeling of VDims is an iterative pro-cess where user requirements are validatedagainst the xFACT in conjunction with thetransactional system. It might be simply theinformation required was present in thetransactional systems but not included inthe xFACT or alternatively it was notpresent in the transactional systems.

Definition 1: One Virtual Dimen-sion is composed of one (or more logi-cally grouped) conceptual view, satisfy-ing one (or more logically related) userdocument warehouse requirement(s)(Rajugan et al., 2003, 2004).

We introduced a new UML stereo-type called <<VDim>> to model the vir-tual dimensions at the XDW conceptuallevel (Nassis et al., 2004). This stereotypeis similar to a UML class notation with adefined set of attributes and methods. Theset of methods here can have either con-structors (to construct a VDim) or manipu-lators (to manipulate the VDim attributeset).

In Figures 5 and 6, the relationshipbetween the VDims is modeled with adashed, directed line, denoting the <<con-struct>> stereotype. Though VDims

can have additional semantic relationshipssuch as generalisation, aggregation, and as-sociation (Rajugan et al., 2003, 2004), thesecan be shown using standard UML nota-tion. In addition to this, two VDims can alsohave <<construct>> relationshipswith dependencies such as thoseshown in Figure 6, between VDimsAbstracts_By_Year andAbstracts_By_Keyword.

Dimensions as UML PackagesWe stated that semantically related

conceptual views could be logically groupedtogether as grouping classes into a subjectarea. Further, a new view-hierarchy and/or constructs can be added to include addi-tional semantics for a given user require-ment. In the XDW conceptual model, whena collection of similar or related concep-tual views are logically grouped together,we called it grouped Virtual Dimension(Figures 5-6), implying that it satisfies oneor more logically related user requirement(s).In addition, we can also construct additionalconceptual view hierarchies such as shownin Figures 5-6. These hierarchies may formadditional structural or dependency relation-ships with existing conceptual views or viewhierarchies (grouped VDims) as shown inFigures 7-8. Thus it is possible that a clus-ter of dimensional hierarchy(ies) can beused to model a certain set of userrequirement(s). Therefore, we argue thatthis aggregate aspect can give us enoughabstraction and flexibility to design a user-centred XDW model.

In order to model an XML view hier-archy or VDim and capture the logicalgrouping among them, we utilise the pack-age construct in UML. According to OMGspecification:

“A package is a grouping of model elements.Packages themselves may be nested within

Page 11: Conceptual and Systematic Design Approach for XML Document ... · dimensions), several design techniques have been proposed to capture multidimen-sional data (MD) at the conceptual

Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

International Journal of Data Warehousing & Mining, 1(3), 63-87, July-September 2005 73

other packages. A package may containsubordinate packages as well as otherkinds of model elements. All kinds of UMLmodel elements can be organized intopackages” (OMG, 2003, Part 3, Section3.13.1, p. 416 of 736).

The definition of Virtual Dimensionhere is a package. This in practice describesour logical grouping of XML conceptual

views and their hierarchies. Thus we uti-lise packages to model our connected di-mensions (Figure 9).

Following the arguments similar tothese above, we can show that, the xFACT(shown in Figure 8) can be grouped intoone logical construct and can be shown inUML as one package. In Figures 8 and 9,we show our case study XDW model with

Figure 5. A conceptual view hierarchy with <<construct>> stereotype

Figure 6. VDim “Abstract” package contents

Authors_By_Conference<<VDim>>

Author_List<<VDim>>

Authors_By_Publication<<VDim>>

Authors_By_Institute<<VDim>>

Abstracts_By_Keyword<<VDim >>

Ab stracts _By _Confe rence<<VDim >>

Abstract_Lis t<< VDi m >>

Abstracts_By_Year<< VDi m >>

A u tho r_ L is t < <V D im > >

A bs tract_ L is t_B y_A utho r < <V D im > >

A bst_ L is t_B y_ A u tho r_B y_ Y e ar < <V D im > >

A bs trac t_ L is t < <V D im > >

Figure 7. A VDim hierarchy (derived from grouped VDims)

Page 12: Conceptual and Systematic Design Approach for XML Document ... · dimensions), several design techniques have been proposed to capture multidimen-sional data (MD) at the conceptual

74 International Journal of Data Warehousing & Mining, 1(3), 63-87, July-September 2005

Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

xFACT and VDims connected via<<construct>> stereotype.

Transformation of xFACT OOConceptual Model to XML Schema

Using the generic rules (Feng et al.,2003), we are able to accomplish the sys-tematic transformation of our OO concep-tual model into XML Schema, which is thelogical model. Initially we envisage thatthe xFACT table will be an entire majordocument of its own containing elements,further embedded elements, and relation-ships amongst these, which would trans-late into smaller complex or simple struc-tured components.

We assume that class C, corre-sponds to the composite document(xFACT) containing additional classesC

1,…C

n. The steps to transform this into

XML Schema segment are given next(Feng et al., 2003).

Step 1: Create an element named Cwith a ComplexType, CType<xs:element name=”C” type=”CType”/>

Step 2: For each of the embeddedclass C

i of C, create a sub-element C

i with

a ComplexType, CiType. <xs:element

name=”Ci” type=”C

iType”/> Elements can

also have a simple structure when they carrysingle valued attributes from the in-built datatypes of the XML Schema (e.g., string, in-teger). To apply these rules and illustratethis in more detail, we will refer to our OOconceptual model of xFACT (Figure 10).The real-world object Publications(root element) is hierarchically decomposedinto Publ_Conference andPubl_Papers. This can be translatedinto XML Schema as shown in Box G.

The corresponding instance documentfor the XML Schema segment in Box G isshown in Box H.

Figure 8. “Conference_List”, “Abstract_List” and “Author_List” packages

Figure 9. XDW Conceptual Model (in UML)

Publications ( xFACT )

Author_List <<VDim>>

Conference_List <<VDim>>

Abstract_List <<VDim>>

Proceedings (xFACT)

Author_List<<VDim>>

Conference_List<<VDim>>

Abstract_List_By_Author<<VDim>>

Abs_Lst_By_Author_By_Year<<VDim>>

Abstract_List<<VDim>>

Publication_List<<VDim>>

Page 13: Conceptual and Systematic Design Approach for XML Document ... · dimensions), several design techniques have been proposed to capture multidimen-sional data (MD) at the conceptual

Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

International Journal of Data Warehousing & Mining, 1(3), 63-87, July-September 2005 75

Box G. XML Schema

Step 1 <xs:element name="Publications" type=" Publ_PublicationType "/> Step 2 <xs:complexType name="Publ_PublicationType"> <xs:sequence>

<xs:element name="Publ_INDEX" type="xs:ID"/> . . . . . . <xs:sequence> <xs:element name="Publ_Conference" type="Publ_ConferenceType" maxOccurs="unbounded"/> <xs:element name="Publ_Papers" type="Publ_PaperType" maxOccurs="unbounded"/> </xs:sequence> </xs:sequence> </xs:complexType> <xs:complexType name="Publ_ConferenceType"> <xs:sequence> <xs:element name="Conf_Conference_ID" type="xs:ID"/> . . . . . . <xs:sequence> <xs:element name="Publ_Region" type="Publ_RegionType"/>

<xs:element name="Publ_Schedule" type="Publ_ScheduleType" maxOccurs="unbounded"/>

</xs:sequence> </xs:sequence> </xs:complexType>

Box H. Instance document for XML Schema in Box G

<Publications> <Publ_INDEX> This is the index </Publ_INDEX> <Publ_ISBN> This is the ISBN of the publication </Publ_ISBN> <Publ_Publisher> This is the publisher name </Publ_Publisher> <Publ_Title> This is the publication title </Publ_Title> <Table_Of_Contents> This is the publisher name </Table_Of_Contents> <Publ_Editors> Editor list of the publication </Publ_Editors> <Publ_Office> Office name </Publ_Office> <Publ_ContactNO> +61-3-1234-5678 </Publ_ContactNO> <Publ_Address> This is the address </Publ_Address>

</Publication> <Publ_Conference>

<Conf_Conference_ID> Identication number of conference </Conf_Conference_ID>

<Conf_Name> Name of conference </Conf_Name> <Conf_PCChair> Chair List </Conf_PCChair> <Conf_PCCommitte> Committee List </Conf_PCCommitte> <Conf_Editors> Editor list </Conf_Editors>

</Publ_Conference>

Page 14: Conceptual and Systematic Design Approach for XML Document ... · dimensions), several design techniques have been proposed to capture multidimen-sional data (MD) at the conceptual

76 International Journal of Data Warehousing & Mining, 1(3), 63-87, July-September 2005

Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

Another important relationship exist-ing in the xFACT conceptual model is thegeneralisation relationship for whichclasses are categorised according to theirdifferences and similarities. As seen fromFigure 10, class Publ_Person is an ab-stract class, which creates instance classessuch as Publ_Author andPubl_Referee. In Box I is the trans-formation of class Publ_Person (super-class) also showing the connection ofPer_BelongsTo with classPubl_Institute.

In Box J is the transformation of thesub-class Publ_Author into XMLSchema.

A child class can inherit all the prop-erties from its parent class. In this casePubl_Author class has all the attributesof the super class and in additionAu_AuthorID is included, which is exclusiveonly in this class. The tag <xs:extensionbase=”Publ_PersonType”> shows thatthe current class created as an extension ofan existing class. An instance of a corre-sponding XML document is given in Box K.

Figure 10. The complete xFACT of the Conference Publication System (CPSys) case study

Publ_Institute

Inst_IDInst_NameInst_TypeInst_ContactPersonInst_ContactNOInst_Address

Publ_Person

Per_TitlePer_FirstNamePer_Las tNamePer_ContactNOPer_Address

1

1..n

1

1..n

belongs_to1..n

conducts

Publ_Time

Session_IDSession_TimeSession_Notes

Publ_Section

Sec tion_IDSec tion_Ti tl eSec tion_Cont entsSec tion_FiguresSec tion_Tables

nn

Publ_Location

Loc_Lo cationIDLoc_Room NoLoc_CapacityLoc_AV_Faci lity

0..n1 0..n1

may have

Publ_Day

Day_IDDay_NameDay_Date

1..n

1..n

1..n

1..n

Publ_Keywords

Key_PaperIDKey_Keyword

Publ_Referee

Ref_RefereeID

Publ_Chapter

Ch_IDCh_NoCh_TitleCh_Keywords

1..n1..n

Publ_City

City_IDCity_NameCity_MajorAttractionsCity_Spec ialEvents

1 .. n1 .. n

Publ_Month

Month_IDMonth_Name

1..n

1..n

1..n

1..n

Publ_Abstract

Abst_PaperIDAbst_Contents

<<1>>

1..n1..n

Publ_Authors

Au_AuthorID

Publ_Research_Area

ReasearchArea_IDPubl_ResearchAreaPubl_CoRelatedArea

interested_in

Publ_RefereeComments

Ref_Com mentIDRef_FinalComm ents 1 11 1

write

Publ_Content

Cont_PaperIDCont_NoOfPagesCont_WordsCount

<<2>>

1..n1..n

Publ_References

Reference_IDReference_DetailsTotal_References

<<3>>Publ_Country

Co_CountryIDCo_CountryNameCo_VIS A_ReqCo_W eat her

1..n1..n

Publ_Year

Year_IDYr_YearYr_SpecialEvents

Publ_Papers

Paper_IDPaper_DescPaper_Type

11

1

1..6

1

1..6sub mit

1

0..n

1

0..n

assoc iated_with

2..4

1

2..4

1

referred_comments

11

Publ_Region

Reg_Region_IDReg_Region_CommentsReg_Region_Desc

Publ_Schedule

Sch_ScheduleIDSch_Desc

11

PublicationsPubl_INDEXPubl_ISBNPubl_PublisherPubl_TitleTable_Of_ContentsPubl_EditorsPubl_OfficePubl_ContactNOPubl_Address

1..n1..n

Publ_Conference

Conf_ConferenceIDConf_NameConf_PCChairConf_PCCommitteeConf_Editors

11 1..n1..n

1..n

1..n

1..n

1..n

W orkshop

Workshop_IDWorkshop_Name

1..n

1..n

1..n

1..n

1..n

1

Page 15: Conceptual and Systematic Design Approach for XML Document ... · dimensions), several design techniques have been proposed to capture multidimen-sional data (MD) at the conceptual

Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

International Journal of Data Warehousing & Mining, 1(3), 63-87, July-September 2005 77

User Requirements (URs) and theSystematic Querying Approach for

Virtual Dimension/s (VDim/s)Previously for the conversion of

xFACT into XML Schema we used thetransformations discussed in the papers byFeng et al. (2003) and Xiaou, Dillon, Chang,& Feng (2001). In the case of VDims, thisis actually the transformation between the

conceptual views into XML View (Rajuganet al., 2003, 2004) Schemas. An XML Viewis defined as:

Definition 2: An XML View is animaginary XML Document which pointsto a collection of semantically relatedXML tags from an XML domain and sat-isfies a Conceptual View definition from

Box I. Transformation of class Publ_Person (superclass) also showing the connection ofPer_BelongsTo with class Publ_Institute

<xs:element name="Publ_Person" type=" Publ_PersonType "/> <xs:complexType name="Publ_PersonType"> <xs:sequence> <xs:element name="Per_Title" type="xs:string"/> . . . . . . <xs:element name="Per_belongsTo"> <xs:complexType> <xs:all> <xs:element name="Publ_Institute" type="Publ_InstituteType"/> </xs:all> </xs:complexType> </xs:element> </xs:sequence> </xs:complexType>

Box J. Transformation of the sub-class Publ_Author into XML Schema

<xs:element name="Publ_Author" type=" Publ_AuthorType "/> <xs:complexType name="Publ_AuthorType"> <xs:complexContent> <xs:extension base="Publ_PersonType"> <xs:sequence> <xs:element name="Au_AuthorID" type="xs:ID"/> </xs:sequence> </xs:extension> </xs:complexContent> </xs:complexType>

Box K. XML document

<Publ_Author> <Per_Title> Mrs. </Per_Title> <Per_FirstName> Helen </Per_FirstName> <Per_LastName> Smith </Per_LastName> <Per_ContactNO> +61-3-1234-3458 </Per_ContactNO> <Per_Address> LaTrobe University, Bundoora, 3086, Australia </Per_Address> <Au_AuthorID> 12345 </Au_AuthorID>

</Publ_Author>

Page 16: Conceptual and Systematic Design Approach for XML Document ... · dimensions), several design techniques have been proposed to capture multidimen-sional data (MD) at the conceptual

78 International Journal of Data Warehousing & Mining, 1(3), 63-87, July-September 2005

Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

the target XML conceptual domain(Rajugan et al., 2004, p. 8).

Definition 2 indicates that for eachconceptual view there is a correspondingXML document, which may contain moreembedded XML documents. Querying theresulting XML document using any querylanguage for XML with language specificsyntax allows one to extract the informationneeded in order to meet all the conditions ofeach user requirement. The query processis well suited for simple, straightforward re-quirements but it tends to get very com-plex especially when dealing with user re-quirements having a wide-ranging context.

Logical Implementationof the Virtual Dimension

Considering the above facts regard-ing the query process, in the segment ofqueries that follow, the preferred query lan-guage is XQuery (W3C Consortium, 2004).However XQuery at this stage proves lim-ited regarding “group by” and aggregatefunctions for data warehouse operations(need to use nested loops which are diffi-cult to operate) and are still in progress forfuture development (Fankhauser et al.,2003). Due to this, and in order to illustratethe purpose of querying XML documentsin relation to VDim, using the notion ofXQuery, in this article we also propose ageneric query algorithm, which will bethe foundation to build smaller algorithmicsegments to suit the structure of the casequeries from common cases and henceenable full capture of each user require-ment.

We use the terms defined in W3C,namely Query, Context, and ReturnContext, extracted from the illustrated usecase queries, in order to guide us in deriv-ing and defining the main parameters to beused in our proposed algorithm. We con-

sider that each class involved in the searchquery can be of simple or complex struc-ture, meaning that it can contain only at-tributes, only elements and sometimes acombination of both. What follows is a listof the conditions and explanations of thekeywords that will aid to design our ge-neric query algorithm.

• Keyword 1: Query Context: Thisspecifies the full path of classes used tolocate the attributes and elements to bequeried. Classes can have further em-bedded attributes and/or elements.

• Keyword 2: Query ContextValue: This is the parameter value(specified by the user) used to executethe query and is located in the last oc-curring attribute or element from theclass specified in the query con-text path. The value can be a preciseword, phrase, number or include a groupof values denoted in the query by theword All.

• Keyword 3: Return Context: Theresults are obtained having that all thequery’s conditions have been met. Whilethe search is conducted within theclass(es) specified in the query con-text, it is possible that the requiredattribute(s) or element(s) might originatefrom a different class. Also in somecases the outcome may be comprisedof new assembled attribute(s) and/orelement(s).

• Keyword 4: Sort: This function ap-plies to the queries requiring the ‘order-ing’ of resulting records. This is per-formed based on the subject fac-tor, which is directly related to theuser’s need regarding the display of thequery outcome. A subject factorvalue is defined by a name, a number,or it can be of any value provided it ex-

Page 17: Conceptual and Systematic Design Approach for XML Document ... · dimensions), several design techniques have been proposed to capture multidimen-sional data (MD) at the conceptual

Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

International Journal of Data Warehousing & Mining, 1(3), 63-87, July-September 2005 79

ists within the class’ attribute(s) and/orelement(s) involved in the query.

• Keyword 5: Merge: The mergingfunction facilitates in combining sepa-rate query outcomes together. The sub-ject factor as previously stated isused to sort records based on a valuetype. When records are ordered, it islikely that there may be more than oneentry that belongs under the same sub-ject factor value. In order topresent the results in a more uniformlayout, we merge the records occur-ring under the common subject factorvalue and remove duplicates.

Regardless of what each VDim aimsto fulfil, we categorise four different typesof VDims. A description and diagram pro-vides illustration of each of the differenttypes of VDims, namely: Selection, Sort-ing, Implicit Join, and Explicit Join.

Selection Virtual DimensionThis consists of selecting and extract-

ing instance documents originating from oneor more classes. The required instancedocuments correspond to what has been

specified in the query contextvalue, which can be an exact term orinvolve a group of values. The records ob-tained construct a new “partial” class, sig-nifying that only a part of the originalclass(es) are required and therefore ex-tracted. This is shown in Figure 11.

Sorting Virtual DimensionIn the Selection VDim the resultant

document list would be displayed in ran-dom order, but the Sorting VDim requiresfor this to appear in a certain order. Commoninstance cases highlight the fact that sortingis to be done in alphabetical or chronologicalorder. Figure 12 demonstrates that the result-ing class A1 is now sorted.

Implicit Join Virtual DimensionThe concept of class hierarchical de-

composition proves necessary to providegranularity to a real-world object. There-fore it is likely that a VDim will involvejoining of two or more classes/elements,which may appear at different levels withina hierarchy. An Implicit Join VDim (Fig-ure 13) applies when classes originatingfrom different hierarchical paths and sharea common parent class, are joined to form

Figure 11. Selection VDim

Figure 12. Sorting VDim

Page 18: Conceptual and Systematic Design Approach for XML Document ... · dimensions), several design techniques have been proposed to capture multidimen-sional data (MD) at the conceptual

80 International Journal of Data Warehousing & Mining, 1(3), 63-87, July-September 2005

Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

a combined class; in other words, the theprocess carried out is analogous to that ofPath Traversing (Bertino, 1994).

Explicit Join Virtual DimensionThis type of VDim denotes that join-

ing is not limited to occurring within the hi-erarchical paths originating from the sameparent. Now it can also emerge fromclasses belonging to different source com-ponents, which are not directly related. Forinstance, in Figure 14 we are able to jointhe components B and G.

Using the keywords presented andthe four types of Virtual Dimensions pre-sented, we are able to construct a generic

query algorithm as shown in Box L, en-compassing all these existing components.Note that “[ ]” indicates that the containedfunction is optional.

Table 1 provides a full illustration of thealgorithm’s purpose by conveying its maincapabilities. These have been applied to asample set of queries based on our case studyexample, Conference Publication System(CPSys), including the design and implemen-tation of the most suited algorithm for eachcase. Each type of the Virtual Dimension sofar discussed is also demonstrated.

The three main keywords; QueryContext, Query Context Value,and Return Context, are treated as

Figure 13. Implicit Join VDim

Figure 14. Explicit Join VDim

Page 19: Conceptual and Systematic Design Approach for XML Document ... · dimensions), several design techniques have been proposed to capture multidimen-sional data (MD) at the conceptual

Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

International Journal of Data Warehousing & Mining, 1(3), 63-87, July-September 2005 81

mandatory for any algorithm designed andSubject Factor Value is applicableonly when the functions Sort and Mergeare required. The format of each casequery, shown in Table 1, will be presentedas follows: Query Number, QueryName, Query Context, Query Con-text Value, Return Context, andComments (written explanation of thequery highlighting the involved classes andthe connection amongst these). Note thatupper case letters signify classes and lowercase attributes and elements. A full refer-ence to each class’s components and rela-tionships involved in the queries is shownin Figure 10.

CONCLUSION ANDFUTURE WORK

XML has become an increasinglyimportant data format for storing structuredand semi-structured text intended for dis-semination and ultimate publication in avariety of media. It is a mark up lan-guage that supports user-defined tagsand encourages the separation of docu-ment content from presentation. In thisarticle, we present a coherent way to

integrate a conceptual design method-ology to build a native XML documentwarehouse. The proposed XML designmethodology consists of user requirementlevel, warehouse conceptual level and XMLSchema level, which capture warehouseuser requirements and the warehousemodel respectively. The only model not dis-cussed in detail is the OO user requirementmodel, as this is part of our impending work.

We also investigated a number of theissues in the conceptual design of XML datawarehouses rather than dealing with itsimplementation. We used existing genericrules (Feng et al., 2003) to develop the for-mal steps and demonstrate the transforma-tion of the xFACT conceptual model intoXML Schema. Simultaneously we high-lighted two cases of semantic involvementwithin the xFACT, namely ordered and ho-mogeneous compositions. The User Re-quirements (URs) are important compo-nents and aimed to be fully captured at thedesign level. We proposed a systematicquerying approach to access the data ware-house considering the need to accommo-date XML semantics such as aggregatefunctions. Also, we formally defined four

Box L. Generic Query Algorithm

Get Query Context Value For All | Each value/s within the attribute/s or element/s of the Query Context path

If the value is empty Go to the next value

[Check for exact match If there is no match Go to the next value]

Return Context attribute/s and/or element/s value/s [Sort by Subject Factor Value] [If the Subject Factor Value is a duplicate Merge entries under one common Subject Factor Value Delete duplicated Subject Factor Value ELSE ( )]

Page 20: Conceptual and Systematic Design Approach for XML Document ... · dimensions), several design techniques have been proposed to capture multidimen-sional data (MD) at the conceptual

82 International Journal of Data Warehousing & Mining, 1(3), 63-87, July-September 2005

Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

VD

im T

ype

&

Req

uire

men

t Q

uery

Con

text

Q

uery

Con

tex t

V

alue

Su

bjec

t Fac

tor

Val

ue

Ret

urn

Con

text

C

omm

ents

Q

uery

Alg

orit

hm

Sele

ctio

n V

Dim

on

an E

xact

Val

ue

A

utho

r lis

t ba

sed

on g

iven

inst

itut

e na

me

./PUBL_PAPERS

/PUBL_AUTHORS

/PUBL_INSTITUTE

/Inst_Name

La Trobe

University

N/A

./PUBL_PAPERS

/PUBL_AUTHORS/

It

is r

equi

red

for

all a

utho

rs’

de-

tail

s to

be

list

ed a

ccor

ding

to th

eir

asso

ciat

ed in

stit

ute,

whi

ch is

“La

Trobe University.”

A s

im-

ple

asso

ciat

ion

betw

een

the

two

clas

ses Publ_Authors

and

Publ_Institute

ena

bles

one

to

ext

ract

the

requ

ired

dat

a.

Get La Trobe University

For each value of the element

Inst_Name

If the value is empty

Go to the next value

Check for exact match

If there is no match

Go to the next value

Return Context values of all the

elements in the PUBL_AUTHORS class

Se

lect

ion

VD

im o

n a

Gro

up o

f Val

ues

L

ist o

f co

nfer

ence

na

mes

./PUBL_CONFERENCE

/Conf_Name

All

N/A

./PUBL_CONFERENCE

/Conf_Name

In

ord

er to

gen

erat

e th

e re

quir

ed

list

of

all t

he c

onfe

renc

es n

ames

, th

e pr

opos

ed q

uery

wou

ld p

erfo

rm

a se

arch

with

in th

e cl

ass

Publ_Conference

and

mor

e sp

ecif

ical

ly in

the

elem

ent

Conf_Name.

A g

roup

of

valu

es

is in

clud

ed, a

s op

pose

d in

the

pre-

viou

s ca

se c

heck

ing

for

a m

atch

ba

sed

on a

spe

cifi

c te

rm v

alue

.

Get All values of element Conf_Name

For All values of the element

Conf_Name

IF the value is empty

Go to the next value

Return Context all values in the

element Conf_Name

Impl

icit

Joi

n V

Dim

Alp

habe

tica

l lis

t of

co

nfer

ence

nam

es

./PUBL_CONFERENCE

/Conf_Name

ALL

Conference

Name

./Publ_Conference

_Sort_By_Name

In

the

prev

ious

alg

orit

hms,

the

list

of

res

ults

wou

ld b

e sh

own

in a

ra

ndom

ord

er. I

t is

now

req

uire

d fo

r th

e re

sult

s to

be

pres

ente

d in

a

spec

ific

ord

er, b

ased

on

a sub-

ject factor value

(al

pha-

beti

cal)

spe

cifi

ed b

y th

e us

er. I

n th

is c

ase

the

resu

lt is

the

new

ly

form

ed e

lem

ent

Publ_Conference_Sort_By

_Name.

Get All values of element Conf_Name

For All values of the element

Conf_Name

IF the value is empty

Go to the next value

Return Context of all values of the

element Conf_Name

Sort Conference Name list in alpha-

betical order

Table 1. Illustration of VDims logical implementation

Page 21: Conceptual and Systematic Design Approach for XML Document ... · dimensions), several design techniques have been proposed to capture multidimen-sional data (MD) at the conceptual

Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

International Journal of Data Warehousing & Mining, 1(3), 63-87, July-September 2005 83

Impl

icit

Joi

n V

Dim

Chr

onol

ogic

al li

st

of c

onfe

renc

es

./PUBL_CONFERENCE

/Conf_Name

./PU

BL

_CO

NFE

RE

NC

E

/PU

BL

_SC

HE

DU

LE

/P

UB

L_Y

EA

R

/Yr_

Yea

r

All

Year

./PUBL_CONFERENCE

/Conf_Name

./PUBL_CONFERENCE

/PUBL_SCHEDULE

/PUBL_YEAR

/Yr_Year

./Publ_Conference

_Sort_By_Year

The

out

com

e li

st is

like

ly to

hav

e m

ore

than

one

con

fere

nce

nam

e oc

curr

ing

unde

r th

e sa

me year

, w

hich

mea

ns h

avin

g m

any

in-

stan

ce d

ocum

ents

und

er th

e sa

me

subject factor value.

App

lyin

g th

e m

erge

alg

orit

hm

ensu

res

that

all

conf

eren

ces

taki

ng

plac

e in

the

sam

e ye

ar a

re to

be

grou

ped

and

appe

ar u

nder

one

co

mm

on year

hea

ding

.

Get All values of elements

Conf_Name and Yr_Year

For All values of the elements

Conf_Name and Yr_Year

If the value is empty

Go to the next value

Return Context of all values in

elements Conf_Name and Yr_Year

Sort Conf_Name by Year

If the Year value is a duplicate

Merge entries under one common

Year Value

Delete duplicate Year Value

Else ( )

Exp

lici

t Jo

in V

Dim

Chr

onol

ogic

al li

st-

ing

of c

onfe

renc

e na

mes

wit

h cr

oss

refe

renc

e to

aut

hor

deta

ils

./PUBL_CONFERENCE

/Conf_Name

./PUBL_CONFERENCE

/PUBL_SCHEDULE

/PUBL_YEAR

/Yr_Year

All

Year

./PUBL_CONFERENCE

/Conf_Name

./PUBL_CONFERENCE

/PUBL_SCHEDULE

/PUBL_YEAR

/Yr_Year

./PUBL_PAPERS

/PUBL_AUTHORS

./Publ_Conference

_Sort_By_Year

A c

hron

olog

ical

list

ing

is a

gain

re

quir

ed w

ith th

e di

stin

ctio

n to

in

clud

e cr

oss-

refe

renc

ing

from

the

clas

s Publ_Authors

. Thi

s sh

ows

that

the

retu

rn c

onte

xt is

no

t bou

nd to

be

wit

hin

the

clas

ses

that

hav

e in

tern

al d

irec

t con

nec-

tion

s. I

nste

ad a

t tim

es it

may

be

nece

ssar

y to

go

thro

ugh

seve

ral

intr

a-co

nnec

tions

wit

h ex

tern

al

obje

cts

to o

btai

n th

e re

quir

ed v

al-

ues.

The

refo

re th

e al

gori

thm

in-

clud

ing

the

sort

and

mer

ge f

unc-

tion

s is

app

lied

.

Get All values of elements

Conf_Name and Yr_Year

For All values of the elements

Conf_Name and Yr_Year

If the value is empty

Go to the next value

Return Context of all values in

elements Conf_Name, Yr_Year and all

values in class Publ_Authors

Sort Conf_Name by Year

If the Year value is a duplicate

Merge entries under one common

Year Value

Delete duplicate Year Value

Else ( )

Table 1. Illustration of VDims logical implementation (continued)

Page 22: Conceptual and Systematic Design Approach for XML Document ... · dimensions), several design techniques have been proposed to capture multidimen-sional data (MD) at the conceptual

84 International Journal of Data Warehousing & Mining, 1(3), 63-87, July-September 2005

Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

different types of Virtual Dimensions at theconceptual level.

For future work, the following sub-ject matter deserve investigation: (1) de-velopment of formal semantics to automatemapping between XML Data and XDWSchema where the view is defined moreprecisely, (2) incremental update of mate-rialised views, (3) derivation of user require-ments by investigating frequent path que-ries (Zhang et al., 2003), and (4) investiga-tion of performance issues upon query ex-ecution in relation to accessing the datawarehouse.

REFERENCESAbelló, A., Samos, J., & Saltor, F. (2001).

Understanding facts in a multidimen-sional object-oriented model. In Pro-ceedings of the Fourth InternationalWorkshop on Data Warehousing andOLAP (DOLAP), Altlanta, November,(pp. 32-39) .

Bertino, E. (1994). A survey on indexingtechniques for object-oriented data-bases. In Freytag et al. (Eds.), QueryProcessing for Advanced DatabaseSystems. Morgan Kaufmann Pub-lisher.

Coad, P., & Yourdon, E. (1991). Object-oriented analysis. Australia: PrenticeHall.

Dillon, T.S., & Tan P.L. (1993). Object-oriented conceptual modeling. Aus-tralia: Prentice Hall.

Elmasri, R., & Navathe, S.B. (2000).Fundamentals of database systems(3rd Ed.). Addison-Wesley.

Fankhauser, P., & Klement, T. (2003).XML for data warehousing changes& challenges. In Data Warehousingand Knowledge Discovery, Proceed-ings of the Fifth International Con-ference (DaWaK 2003), Prague,

Czech Republic, September 3-5, Lec-ture Notes in Computer Science,(Vol. 2737, pp. 1-3). Berlin: Springer-Verlag.

Feng, L., Dillon, T., & Chang. E. (2002).A semantic network based designmethodology for XML documents.ACM Transactions on InformationSystems, 20(4), 390-421.

Feng, L., Chang, E., & Dillon, T. (2003).Schemata transformation of object-ori-ented conceptual models to XML. In-ternational Journal of ComputerSystems Engineering (CSSE), 18(1),45-60.

Golfarelli, M., Maio, D., & Rizzi, S.(1998). The dimensional fact model:A conceptual model for data ware-houses. International Journal of Co-operative Information Systems (In-vited proceeding), 7-2-3.

Kimball, R., & Ross, M. (2002). The datawarehouse toolkit: The completeguide to dimensional modeling (2nd

Ed.). Wiley Computer Publishing.Lujan-Mora, S., Trujillo, J., & Song, I-Y.

(2002). Extending the UML for multi-dimensional modeling. In UML 2002:The Unified Modeling Language,Proceedings of the Fifth Interna-tional Conference, Dresden, Ger-many, September 30-October 4, (pp.290-304). Lecture Notes in ComputerScience (Vol. 2460). Berlin: Springer-Verlag.

Lujan-Mora, S., Trujillo, J., & Song, I-Y.(2002). Multidimensional modelingwith UML package diagrams. In Con-ceptual Modeling: ER 2002, Pro-ceedings of the 21st InternationalConference on Conceptual Model-ing, Tampere, Finland, October 7-11,Lecture Notes in Computer Science(Vol. 2503, pp. 199-213). Berlin:Springer-Verlag.

Page 23: Conceptual and Systematic Design Approach for XML Document ... · dimensions), several design techniques have been proposed to capture multidimen-sional data (MD) at the conceptual

Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

International Journal of Data Warehousing & Mining, 1(3), 63-87, July-September 2005 85

Nassis, V., Rajugan, R., Dillon, T.S., &Rahayu, W. (2004). Conceptual designof XML document warehouses. InProceedings of the Sixth Interna-tional Conference on Data Ware-housing and Knowledge Discovery(DaWak 2004), Zaragoza, Spain, Sep-tember 1-3, (pp. 1-14).

Nassis, V., Rajugan, R., Dillon, T.S., &Rahayu, W. (2005). A systematic designapproach for XML-view driven Webdocument warehouses. InternationalWorkshop on Ubiquitous Web Systemsand Intelligence (UWSI ’05),Singapore, May 9-12, (pp. 914-924).

Object Management Group. (n.d.). Data ware-housing, CWM™and MOF™resourcepage. Retrieved from http://www.omg.org/cwm/

Object Management Group. (2003,March). Unified modeling languagespecification. Version 1.5 formal/03-03-01. Retrieved from http://www.omg.org/ technology/docu-ments/formal/uml.htm

Pokorny, J. (2002). XML data ware-house: Modelling and querying. TheNetherlands: Kluwer Academic Pub-lishers.

Rahayu, W.J. et al. (2002). Aggregationversus association in object modelingand databases. In Proceedings of theACS Conference on IS, (pp. pp 521 –532). Australian Computer SocietyInc..

Rahayu, W.J., Dillon, T.S., Mohammad,S., & Taniar, D. (2001). Object-rela-tional star schemas. The ThirteenthIASTED International ConferenceParallel and Distributed Computingand Systems. Anaheim, California:IASTED/ACTA Press.

Rajugan, R., Chang, E., Dillon, T.S., &Feng, L. (2003). XML views: Part I.In Proceedings of the 14th Interna-

tional Conference on Database andExpert Systems Applications (DEXA2003), Prague, Czech Republic, Sep-tember 1-5, (pp. 148-159).

Rajugan, R., Chang, E., Dillon, T.S., &Feng, L. (2004). XML Views, Part II:Modelling Conceptual Views UsingXSemantic Nets. Workshop & Spe-cial Session on IndustrialInformatics, The 30th Annual Con-ference of the IEEE Industrial Elec-tronics Society (IECON ’04), SouthKorea, November.

Trujillo, J., Palomar, M., Gomez, J., &Song, I-Y. (2001). Designing datawarehouses with OO conceptual mod-els. Computer, IEEE Computer So-ciety, (12), 66-75.

W3C Consortium. (2000). EXtensibleMarkup Languages. Retrieved fromhttp://www.w3.org/XML/

W3C Consortium. (2001). XML schema.Retrieved from http://www.w3c.org/XML/Schema

W3C Consortium. (2003). XML Query(XQuery) requirements. Retrievedfrom http://www.w3.org/TR/xquery-requirements/

W3C Consortium. (2004). XQuery1.0and XPath 2.0 Full-text specifica-tion . Retrieved from http://www.w3.org/TR/2004/WD-xquery-full-text-20040709/

W3C Consortium. (2004).XQuery1.0and XPath 2.0 Full-text use cases.Retrieved from http://www.w3.org/TR/2004/WD-xmlquery-full-text-use-cases-20040709/

Xiaou, R., Dillon, T.S., Chang, E., & Feng,L. (2001). Modeling and transfor-mation of object-oriented concep-tual models into XML schema. InProceedings of the 12th Interna-tional Workshop on Database andExpert Systems Applications

Page 24: Conceptual and Systematic Design Approach for XML Document ... · dimensions), several design techniques have been proposed to capture multidimen-sional data (MD) at the conceptual

86 International Journal of Data Warehousing & Mining, 1(3), 63-87, July-September 2005

Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

(DEXA), Munich, Germany, Septem-ber 3-7, (pp. 795-804).

Xyleme Project. (n.d.). PublicationsWeb site . Retrieved from http://www.xyleme.com/

Xyleme, L. (2001). A dynamic warehousefor XML data of the Web. IEEE DataEngineering Bulletin, 24(2), 28, 40-47.

Zhang, J., Tok, W.L., Bruckner, R.M., &Tjoa, A M. (2003). Building XML datawarehouse based on frequent patternsin user queries. In Data Warehous-ing and Knowledge Discovery, Pro-ceedings of the Fifth InternationalConference (DaWaK 2003), Prague,Czech Republic, September 3-5, Lec-ture Notes in Computer Science (Vol.2737, pp. 99-108). Berlin: Springer-Verlag.

Miss Vicky Nassis received her Bachelor of Information Systems and Bachelor of Business fromLa Trobe University and is currently a PhD student within the school of Computer Science andComputer Engineering at the same university. Her research interests include object-orientedconceptual models, XML and data warehousing. Her thesis is concerned with the conceptualdesign of data warehouses in integration with XML (eXtreme Markup Language). She hasbeen examining techniques to overcome some of the problems encountered during the conceptualdesign process. She has also been involved in exploring the issues of XML and its use inhandling publications over the web. She has had publications in the proceedings ofinternational conferences, in the field of data warehousing, knowledge discovery as well as inweb systems and intelligence.

Mr. Rajugan Rajagopalapillai holds a bachelor’s degree in information systems (BInfoSys)from La Trobe University, Australia. He has worked in the industry as chief application/databaseprogrammer in developing sports planing and sports fitness & injury management softwareand as database administrator. He was also involved in developing an e-commerce solution fora global logistics (logistics, cold-storage and warehousing) company as a software engineer/architect. He is currently a PhD student at University of Technology, Sydney (UTS) Australia.He has published research articles which have appeared in international refereed conferenceand journal proceedings. His research interests include object-oriented conceptual models,XML, data warehousing, software engineering, database and e-commerce systems. He is amember of the IEEE, a member of the ACM and an associate member (AACS) of the AustralianComputer Society.

Tharam S. Dillon is the dean of the Faculty of Information Technology at the University ofTechnology, Sydney (UTS). His research interests include data mining, internet computing,e-commerce, hybrid neuro-symbolic systems, neural nets, software engineering, database systemsand computer networks. He has also worked with industry and commerce in developingsystems in telecommunications, health care systems, e-commerce, logistics, power systems,and banking and finance. He is editor-in-chief of the International Journal of Computer SystemsScience and Engineering and the International Journal of Engineering Intelligent Systems, aswell as co-editor of the Journal of Electric Power and Energy Systems. He is on the advisoryeditorial board of Applied Intelligence (Kluwer, US) and Computer Communications (Elservier,UK). He has published more than 400 papers in international and national journals andconferences and has written four books and edited five other books. He is a fellow of the IEEE,fellow of the Institution of Engineers (Australia), and fellow of the Australian Computer Society.

Page 25: Conceptual and Systematic Design Approach for XML Document ... · dimensions), several design techniques have been proposed to capture multidimen-sional data (MD) at the conceptual

Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

International Journal of Data Warehousing & Mining, 1(3), 63-87, July-September 2005 87

Dr. Rahayu is an associate professor at the Department of Computer Science and ComputerEngineering, LaTrobe University, Australia. She has been actively working in the areas ofdatabase design and implementation covering object-relational databases, Web and e-commercedatabases, semi-structured databases and XML, Semantic Web and ontology, data warehousing,and parallel databases. She has worked with industry and expertise from other disciplines in anumber of projects including bioinformatics databases, parallel data mining, and e-commercecatalogues. She has been invited to give seminars in the area of e-commerce databases in anumber of international institutions. She has edited three books which forms a series in webapplications, including Web databases, Web information systems and Web Semantics. She hasalso published over 70 papers in international journals and conferences.