[IEEE 2011 8th International Conference on Electrical Engineering, Computing Science and Automatic Control (CCE 2011) - Merida City, Mexico (2011.10.26-2011.10.28)] 2011 8th International

Ontology based ETL process for creation ofontological data warehouse

Joel Villanueva1 Chávez, Xiaoou Li11Computer Science Department CINVESTAV-IPN, México D.F., México

Email: [email protected]; [email protected]

Abstract—Extraction, Transformation and Loading (ETL) is akey process of data warehouse building. It integrates data sourceswith diverse features and structures. Numerous approachesand implementations of ETL have been introduced. However,they still have the following disadvantages: human-dependence,information integration only in syntactic levels, incomplete thehomogeneity solution, difficulty to install and configure, etc.

In this paper, we propose an alternative approach to theETL process by attacking the homogeneity in data sources withan ontology-based methodology. Our approach can overcomethe drawbacks of most existing approaches; as it automatesthe key activities of the process, such as: extraction of meta-information, generation of logical and physical data models, andtransformation of information.

Keywords – ETL, Data Warehouse, Ontology, Knowl-edge, Semantic Interoperability, Data, Information, DataModel.

I. INTRODUCTION

A Data Warehouse idea was proposed to store a hugeamount of data, in which the unification and consolidationof different data sources are in an organization under a uniquecentralized, unified and standardized schema [6]. The mainchallenge to build a data warehouse is the the integration ofdata sources with different characteristics such as: structure,mechanisms to access and retrieve. ETL (Extraction, Trans-formation and Loading) is considered the most effective wayto load information into a data warehouse.

ETL process has the main goal of integrate and unify thevarious sources of information within an organization, in orderto populate them and in some cases to create a Data warehouse[13].

ETL is a intuitive but complex; It has heterogeneity prob-lems in data sources, moreover current tools have a highhuman intervention in activities, for example: selection andassignation the origin and destination of data, construction ofthe logical and physical models and verification of data model,etc.

In 1994 Tim Berners-Lee described a new concept called"The Semantic Web" [22]. It defines a new Web where theinformation is accessed based on its meaning, ontologies aremain tools to enable data exchange based on their meaningin information systems. They are a set of concepts whichrepresent a vocabulary to describe a specific domain in aformal way, helping to solve the interoperability problem.

In this paper we discuss the process to construct an ontologymodel of the knowledge domain of a data warehouse, and

propose an alternative approach to the ETL process whichautomates the following task using master ontology: the cate-gorization of data sources, the generation of routines for dataextraction, the generation of logical and physical data modelsof data warehouse and routines for data storage.

II. BACKGROUND AND RELATED WORK

Data warehousing provides architectures and tools for busi-ness executives to systematically organize, understand, and usetheir data to make strategic decisions. Data warehouse systemsare valuable tools in today’s competitive, fast-evolving world.

A. Data warehouse and ETL process

According toWilliam H. Inmon, a leading architect in theconstruction of data warehouse systems, “A data warehouseis a subject-oriented, integrated, time-variant, and nonvolatilecollection of data in support of management’s decision mak-ing process”[11]. A data warehouse is often viewed as anarchitecture, constructed by integrating data from multipleheterogeneous sources to support structured and/or ad hocqueries, analytical reporting, and decision making. A datawarehouse can be structured according the next schemas: 1)the star data schema: It is a multidimensional approach, themain components of the schema are: a fact table, it representsdata that occurs many times. Surrounding the fact table aresome dimension tables, they describe one important aspect ofthe fact table; 2) The snowflake data schema: In a snowflakestructure, different fact tables are connected by means ofsharing one or more common dimensions. Sometimes theseshared dimensions are called conformed dimensions, differentfact and dimension tables combine to form a shape similar toa snowflake.

As shown in Figure 1 the input of the ETL process aredata sources of operational systems, they can be flat files, textfiles, XML files, etc. The output is the information loaded ofall input data sources in the data warehouse. The process hasthe following steps:

• Extraction: The data and Meta data of all input datasources are analyzed, categorized and described. Thenprocedures are generated to extract data and informationsatisfactorily.

• Transformation: Meta-data, structure of data sourcesand targets in Data warehouse are analyzed to generatetransformation and association rules between the differentdata sources and the Data warehouse.

• Loading: using the association rules, task and proceduresof insertion are generated, and finally, transformed dataare loaded in the data warehouse.

Figure 1. The ETL process

SAS Data Integration Studio & DataFlux, IBM WebsphereDataStage [20], Microsoft Integration Services and OracleWarehouse Builder [18] are commercial tools to perform theETL process. They have some points in common: (i) work witha proprietary data model; (ii) use a model of transformationowner, (iii) update and configure them is complex, (v) theirApplication Programming Interface (API) is not standardized.

B. Ontology and heterogeneity

Ontologies were considered as an important topic for arti-ficial intelligence communities since 90´s. The term has beenexpanded and become more popular among the communitiesbecause it offers a shared and common understanding in aknowledge domain. The basic components to define and de-scribe ontologies are: classes, attributes, Relations, Functions,Axioms and Instances [7].

Based on the content ontologies can be classified as: Do-main Ontologies, Task Ontologies and General Ontologies[17].

With the objective that Ontologies can be comprehended andrepresented by the computers, some languages are proposed,each one defines his semantic and rules to express the differentelements in the ontologies; some ontological languages are:SHOE [16]: it was developed at University of Maryland likean HTML extension to support semantic knowledge in webdocuments; OML [19]: it was developed at University ofWashington, based in SHOE, the ontologies are representedlike a set of entities ok many types; XOL: it was crated bybioinformatic community in EEUU to interchange informationin heterogeneity systems; OIL [10]: it was presented like asection of OntoKnowledge project, its syntax and grammar isbased on XOL, RDF and OIL, it represents ontologies in threelevels: object level and first and second meta-data levels; RDF[14]: it was proposed by the W3C to specify standardized,interoperable and XML-based semantic contents; OWL: it wasdeveloped by the W3C like a tagged language to publish andshare data using ontologies in the www, it is built on RDFand XML.

The central problem of ETL is to overcome the heterogene-ity in information systems. Independently, some approacheshave been proposed to assist the ETL process using of ontolo-gies. In [15], Cao present a list of key points to implement data

warehouses with an approach ontology-based such as: creationof ontological profiles, definition of ontological agreementsand aggregation and transformation of domain elements. In[8], [3], [9] ontologies are used to define business processes inthe extraction phase at conceptual level These publications useOntologies to improve the ETL process. They apply ontologiesto define business processes in the data warehouse and only inthe extraction phase. There are approaches developed by pri-vate enterprises or academic research institutions that attemptto mitigate ETL process [23]. However, none of them attackedthe heterogeneity problem and his consequences trough realand efficient implementations [13].

To the best of our knowledge, there are not approaches thatgenerate physical and logical ontology-based models , andautomatic generation of routines to extract and insert data. Wewill discuss how to apply the ontology in the ETL process toobtain a more efficient and automatic process..

III. ONTOLOGY BASED ETL PROCESS

A. Overall process

The proposed ETL process has two main components, amethodology to describe in detail the steps of the process anda library of software components to assist the process.

The proposed methodology involves several elements: roles,processes, task and products generated. The involved rolesin the methodology are: the general manager of ETL (ETLadministrator) and the domain experts.

The methodology is applied to create ontology of domainspecific, just in order to show the feasibility of it, and show thatthe proposed methodology can be extended to more generaldomains of knowledge.

The general processes is composed of the followingmodules: Meta information extraction, Addition of meta-information, Generation of logical data model and Generationof physical data model which can be seen in Figure 2.

• The first process is responsible for extracting meta-information from various data sources.

• The meta-information is linked to the master ontology.• A logical model of data warehouse is built based on the

Master Ontology. The generated model is presented tothe ETL administrator; they check the model and makesadjustments or corrections. The main function of theontology is to model the domain elements of the datastore, concepts, relationships, thus, a domain ontologywas built with the help of domain experts. In orderto build the ontology, the methodologies presented in[21] and [1] were applied. In addition, the Protege 4.0framework and the OWL language were used.

• ETL administrator selects the physical model to generate.• Mappings from logical model to physical model are

generated automatically.• The physical model of the data store is generated auto-

matically.• Mappings from data sources to the physical model are

generated automatically.

• Routines of extraction and insertion of data are generatedand then the data store is populated.

Meta-Information extraction

Extract information from various sources and populate

the data warehouse

Does the meta-information is in the

ontology?

Build the data warehouse data model (Ontology)

Integrate information with expert help

NO

Is the data model correct?

YES

Select the physical model to be implemented

Make adjustments and corrections

NO

YES

1

1

Generate the corresponding mapping from the logical

model to physical model of Datawarehouse

Build the physical model of Datawarehouse

Generate the mappings from the various data sources to

the physical model

END

Figure 2. The General Methodology

B. Meta-data extraction

This process takes as input the data sources, they can bedatabases, XML files or structured and unstructured text files.The process is divided into 4 tasks (see Figure 3):

Interconnection with the data sources: JDBC technologywas used for communication with databases, the Xerces APIparses XML files and the GATE framework and the approachdiscussed in [5] were used to access data from text files.

It´s important to mention that all tools used are writtenin Java, to facilitate its interoperability with the generatedsoftware modules.

The extraction of metadata: There are meta-data in eachdata sources, databases have tables and attributes, XML fileshave simple and composite data types, meanwhile text fileshave relevant terms as nouns, verbs and adjectives. In thisprocess we locate and extract all of these items for subsequenttreatment.

The generation of Java beans: With the metadata extractedfrom the data sources and their elements, a hierarchicalstructure of Java beans which stores all the meta informationis created.

The matching with the Master Ontology: using the object-ontological approach presented in [2], we generate respectiveontological elements in OWL from Java beans. Then SPARQLstatements are generated to find the items in the masterontology. Finally a list of mappings is created with the locatedand unlocated items.

Meta data extraction

Text Files

XML Documents

Relational Databases

XERCES

GATE

Name of tables and attributes

Relations

Processing Complex Data

Data types, simple and complex

Nouns, adjectives, verbs, relationships

Relevant statements

Search and matching with Master Ontology

vocabularies

Terms and relations that are known to the ontology

Terms and relationships that are new

Figure 3. Meta-Data Extraction Process

C. Addition of meta-information

This process has the goal of add the terms and relations un-known for the Master-Ontology. The process has the followingsteps (see Figure 5):

Grouping of related terms: With the help of the tool[4], objects and relations with properties in common aregrouped under a matrix of association to build small kernelsof information.

Building of relationships: Based on the analysis of theassociation matrix, relationships between objects in commonare created.

Generation of Micro-Ontologies: Both the objects andtheir relations are transformed into ontological terms with theobject-Ontological mapping. Then we obtain some isolatedmicro-ontologies.

Addition of micro-ontologies to ontology-Master: Usingthe algorithm of ontology merge proposed in [12], the isolatedontologies are added to the Master Ontology. It´s displayed toa domain expert to corroborate its correctness and to makeappropriate adjustments.

Terms and new relationships (unknown

to the ontology)

Grouping Related concepts

Building relationships and partnerships between

the terms.

Generate micro-ontologies with new

concepts

Introduction of new models to the user

Adjustments and corrections from users

Adding new concepts and relationships to the

master ontology

MASTER ONTOLOGY WITH NEW FEATURES

Figure 4. Addition of Meta-information Process

D. Generation of logical data model

The tasks of the process "Generation of Logical DataModel" are (see Figure 5):

Generation of a list of involved terms: With the list ofcorrespondences created in the process first process “MetaData Extraction”, a list of Java Beans and their respectiveontological equivalent will be generated.

Generation of queries to extract terms: based on onto-logical terms, SPARQL statements are generated to extract asub-ontology from the Ontology-Master.

Extraction of terms: the statements generated are executedthen we obtain a set of terms and isolated Ontologies whichrepresent a formal definition of involved terms in the ETLprocess.

Construction of the new Ontology: in order to constructthe final ontology, the terms and isolated Ontologies aremerged with the algorithm presented in [12]. The obtainedOntology, is a formal data model of the involved data sources.

Terms and relationships known to the ontology

Terms and new relationships (added to

the ontology)

Generate list of terms to Extract

Generation of queries and mapping for the extraction

of terms.

Extraction of the terms and temporarily storing

them.

Construction of the ontology data model from

the extracted terms.

ONTOLOGY-BASED DATA MODELDatabase

Figure 5. Generation of Logical Data Model

E. Generation of physical data model

Displayed in figure 6, this process generates a particulardata physical schema from the logical data model. For the finaldata warehouse schemas, we consider the star and snowflakeschemas which are the most used approaches due to perfor-mance and scalability. The tasks of the process are:

Selection of physical data model: The ETL administratorselects one of the two final physical models, star or snowflake.

Generation of mappings from logical data model tophysical data model: The object-Ontology mapping algorithmis executed to obtain a set of java beans from the logical datamodel firstly, then an internal method of each generated javaobject is executed, which returns a SQL standard statement tocreate the object in a RDBMS. Finally a data map is createdbetween the Java beans of logical model and the generatedstatements.

Execution of procedures to create the physical model:The SQL statements are executed to create the physicalstructure of data warehouse.

Generation of mappings from data Sources to physicalmodel: Repeat the object-ontology mapping algorithm, and an-other internal method of each generated java object generatedis executed to insert instances of the element in a RDBMS.In collaboration with the data map between the Java-Beansand Creation statements, a new map is generated between theelements of the physical model and the insertion statements.

Generation of procedures for extraction and insertion:In collaboration with the list of correspondences betweendata sources and their generated java Beans, an internalmethod of each Java bean is executed which returns anextraction statement, then a script is created with all extractionstatements. With the data map between physical model andthe insertion statements, a second script is created with allinsertion statements.

Execution of procedures: The extraction script is executedto extract the data and then the insertion script is executed topopulate the new data warehouse.

Star schema

Snowflake schema

Physical Data Model

Data Sources

Mappings

Logical Data Model

Physical Data Model

Mappings

Figure 6. Generation of Physical Data Model

F. Architecture in layers

Figure 7 shows the conceptual view in layers of main soft-ware components in the proposed approach. It also presents themost important interactions along the methodology betweenthe data and software components developed.

In the data sources layer, we can observe the differenttypes of information sources involved in the process.

In the information extraction layer, we present the meta-data extracted from data sources and their mapping to javabeans.

In the abstraction and generalization layer, we have themapping of java beans to ontological elements.

In the model layer we can view the ontology mapping tological data model and the generation of the physical modelfrom the logical data model.

Finally, in the information integration layer, we can seethe data warehouse built under the star or snowflake schema.

IV. IMPLEMENTATION OF A STUDY CASE

A. Problem description

Our case study is about a company that sells auto parts. Theissue was to create a central data warehouse for four specificorganizational topics: inventory, sales, human resources andpurchasing. The data sources consists of:

• My SQL Data base: 35 Relations more than 20,000records.

• Postgress Data base: 26 Relations more than 16,000records.

• XML files: 60 XML files more than 15,000 records.According to the proposed methodology, the first step is thedesign and building of the Master Ontology, this is achievedin collaboration with the experts on the domain; the second

Data Sources Layer

Information Extraction Layer

Abstraction and generalization Layer

Model layer

Information Integration Layer

DataBase

Ontology

Logic Model Physical Model

Map Data

Map Data

Map Data

Figure 7. Architecture In Layers

step consists of applying the methodology and software toolsto build the data warehouse.

The final result the data warehouse was created and pop-ulated in its entirety using the methodology and softwarecomponents proposed in section 3. However, when we applythe methodology we obtained small failures such as::

• Elements of data sources that could not be properlyextracted.

• Generation of the logical data model with some errors.• Errors in some of the extraction and insertion statements.

B. Obtained results

After completing the process, we summarize the mostrelevant results in the following table:

Table IRESULTS

Feature ResultIntegration time 1 day

Implementation time 1 weekPercentage of data correctly integrated 95%

Errors in the logic model 12Errors in extraction queries 8Errors in creation queries 13

WhereIntegration time refers to the time it takes to execute

integration process, i.e., since the selection of data sourcesis made until the data store is populated.

Implementation time is the time it took us to fully implementthe methodology. since we had contact with the organizationuntil we deliver the final data warehouse.

Percentage of data correctly integrated is percentage ofrecords that came from its source to its correct destination.

Errors in logic model are the mistakes that were made inthe logic model automatically generated.

Errors in extraction/insertion statements the mistakes thatoccurred in the extraction and insertion statements automati-cally generated.

The errors in creation statements are the mistakes thatoccurred in the creation statements automatically generated.

V. CONCLUSIONS AND FUTURE WORK

Good definition and development of an ETL process isthe key to success the construction of a data warehouse. Itis very important to properly characterize and classify thedata sources, create mechanisms to extract, categorize andtransform the data in a simple manner, accurate and consistent.

This paper presents a new approach for the generation ofa data warehouse. With the use of ontologies it is able tointegrate data based on its meaning. Moreover, an alternativeapproach to automate key steps of the process such as: theextraction and categorization of the metadata, the generation oflogical and physical data models and the generation of routinesto extract and insert information was presented.

The proposed methodology is clear, intuitive, easy to followand the software components are easy to configure.

Given that in literature there are no similar approachesto our proposal we can say that the results are favorable.Regarding the comparison with ETL tools, the results ofimplementation time are acceptable, because in general theproposed approaches have longer implementation times (insome occasions weeks) [24] due to how hard it is to configureand use.

Although, the results are encouraging, but we are currentlyworking on:

• Taking advantages of the ontology-based data model toperform tasks such as: use of reasoners to find trendsor patterns of information; generation of customized datasets; generation of routines to check the data consistencyand approaches to manage the evolution of the data.

• Add support for other data sources to extend the func-tionality.

REFERENCES

[1] Biebow B. Szulman S Aussenac-Gilles, N. Modelling the travellingdomain from a nlp description with terminae. Workshop on Evaluationof Ontology Tools, European Knowledge Acquisition Workshop, 1:70–78, 2002.

[2] Peter Bartalos and Maria Bielikova. An approach to object-ontologymapping. In Slovak University of Technology, pages 9–16, 2007.

[3] Zhang C.Z. Liu J.M. Cao, L.B. Ontology-based integration of businessintelligence. Web Intelligence and Agent Systems, 4:4–10, 2006.

[4] Maynard D. Bontcheva K. &Tablan V Cunningham, H. Gate: Aframework and graphical development environment for robust nlp toolsand applications. AnnualMeeting of the Association for ComputationalLinguistics, 40:169–175, 2002.

[5] & Lally A. Ferrucci, D. Uima: An architectural approach to unstructuredinformation processing in the corporate research environment. NaturalLanguage Engineering, 3:327–348, 2004.

[6] Narasimhaiah Gorla. Features to consider in a data warehousing system.Commun. ACM, 46:111–115, November 2003.

[7] T. R. Gruber. A translation approach to portable ontology specifications.Knowledge Acquisition, 5:199–200, 1993.

[8] Yi-Chuan Lu Hilary Cheng and Calvin Sheu. An ontology-basedbusiness intelligence application in a financial knowledge managementsystem. Expert Systems with Applications, 36:3614–3622, 2009.

[9] Diana Maynard Horacio Saggion, Adam Funk and Kalina Bontcheva.Ontology-based information extraction for business intelligence. TheSemantic Web, 4825:843–856, 2007.

[10] Fensel D. Broekstra J. Decker S. Erdmann M. Goble C. van HarmelenF. Klein M. Staab S. Studer R. Motta E. Horrocks, I. OIL: The OntologyInference Layer. Technical Report. Vrije Universiteit Amsterdam,Faculty of Sciences, 2000.

[11] W. H. Inmon. Building the Data Warehouse,3rd Edition. John Wiley &Sons, Inc., New York, NY, USA, 3rd edition, 2002.

[12] Ramos J.A. Mezcla automática de catálogos electrónicos. PhD thesis,Universidad Politécnica de Madrid, 2001.

[13] R. Kimball and J. Caserta. The data warehouse ETL toolkit: practicaltechniques for extracting, cleaning, conforming, and delivering data.ITPro collection. Wiley, 2004.

[14] Webick R Lassila, O. Resource description framework (rdf) model andsyntax specification, 06 2002.

[15] Jiarui Ni Longbing Cao and Dan Luo. Ontological engineering in datawarehousing. Frontiers of WWW Research and Development., 3841:923–929, 2006.

[16] Heflin J. Luke, S. Shoe 1.01 proposed specification shoe project, 02200.

[17] Mizoguchi R. Vanwelkenhuysen J. Ikeda M. Task ontology for reuseof problem solving knowledge. Towards Very Large Knowledge BasesKnowledge Building and Knowledge Sharing, pages 46–59, 1995.

[18] Oracle. Oracle xe documentation, 07 2011.[19] Kent. E. Robert. Conceptual knowledge markup language version 0.2.[20] Ascential Software. Datastage product family architectural overview.[21] Schnurr H.P. Studer R. Sure Y Staab, S. Knowledge processes and

ontologies. IEEE Intelligent Systems, 16:1, 2001.[22] James Hendler & Ora Lassila Tim Berners-Lee. The semantic web.

Scientific American, 5:34–43, 2001.[23] Panos Vassiliadis, Alkis Simitsis, Panos Georgantas, and Manolis Ter-

rovitis. A framework for the design of etl scenarios. In CAiSE’03, pages520–535, 2003.

[24] Len Wyatt, Brian Caufield, and Daniel Pol. Principles for an etlbenchmark. In Raghunath Nambiar and Meikel Poess, editors, Perfor-mance Evaluation and Benchmarking, volume 5895 of Lecture Notes inComputer Science, pages 183–198. Springer Berlin Heidelberg, 2009.