Chapter 1 Introduction to Data Warehousing Systemshodhganga.inflibnet.ac.in/bitstream/10603/98388/11/11_chapter1.pdf · Chapter 1 Chapter 1 Introduction to Data Warehousing System

Chapter 1

Chapter 1

Introduction to Data Warehousing System

1.1 Introduction

1.2 Need for Data Warehousing

1.3 Evolution of Data Warehousing

1.4 Definitions of Data Warehouse

1.4.1 Characteristics of Data Warehousing

1.5 Goals and Applications of Data Warehousing

1.6 Future of Data Warehousing

1.7 Importance of Data Warehousing

1.8 Business Intelligence and Data Warehousing (BIDWH)

1.9 Issues of Data Warehousing Design

1.9.1 Top Down Design Approach

1.9.2 Bottom Up Design Approach

1.10 Basic Architecture of Data Warehousing System

1.11 Components of Data Warehouse System and Their Problems

1.11.1 Wrapper/Monitor

1.11.2 Integrator

1.12 Motivation

1.13 Objectives of the Research

1.14 Research Methodology

1.15 Organization of the Thesis

1.16 Conclusions

Chapter summary

1

Chapter 1

1.1 Introduction

In the 1990s, as business grew more complex, corporate offices spread

across the globe, and competition became fiercer, business executives became

desperate for information to be competitive and improve the bottom line. The

operational computer systems did provide information to run day-to-day operations,

but what the executives needed were different kinds of information that could be

readily used to make strategic decisions. They wanted to know where to build the

next warehouse for their product and which markets they should strengthen. The

operational systems, important as they were, could not provide strategic information.

Due to rapidly changed market dynamics, competitive pressure, globalization and

other similar factors forced business to review their structures, approaches and

strategies. Therefore, businesses were compelled to look into new ways of getting

information for dynamic markets.

During the last decade, the interest to analyze data has increased

significantly, because of competitive advantages of data in decision making process.

A key to survival in the business world is being able to analyze, plan and react to

changing business conditions as fast as possible. Many organizations own billions of

bytes of data, but they suffer from different problems because data are spread over

different computer systems, data from different sources are incompatible, data are

available too late, etc.

In order to solve these problems, the new concepts and tools have evolved

into an information technology called Data Warehousing. The Data Warehouse

(DW) can meet informational needs of knowledge workers and can provide strategic

business opportunities by allowing customers and vendors to access corporate data.

A large retail store collects vast amounts of information about their day to

day activities. The same retail store probably collects other types of information as

well, such as customer data, inventory data, advertisement data, employee data, etc.

An increasing number of organizations are realizing that the vast amounts of

collected data can and must be used to guide their business decisions [116][12].

Typically, the management of the organization wants to answer complex analytical

queries based on the collected data.

2

Chapter 1

Building a Data Warehouse provides a number of benefits such as:

• The processing of analytical queries is simplified because only the data

warehouse needs to be accessed.

• The warehouse data can keep a historical record of the various source data.

By retaining all of this data, the current activity of an organization can be

compared against history and can also be used for forecasting the future

activities of an organization.

1.2 Need for Data Warehousing

From a business perspective in order to survive and succeed in today’s

highly competitive global environment, business users demand business answers

mainly because: [12], [79],[132].

• Decisions need to be made quickly and correctly, using all available data.

• Users are business domain experts, not computer professionals.

• The amount of data collects doubles every 18 months, which affects

response time and the sheer ability to comprehend its content.

• Competition is heating up in the areas of business intelligence.

In addition, the necessity for data warehouses has increased as organizations

distribute control away from the middle-management layer that has traditionally

provided and screened business information. As users depend more on information

obtained from information technology systems, the need to provide an information

warehouse for the remaining staff to use becomes more critical.

There are several technology reasons for the existence of data warehousing.

First, the data warehouse is designed to address the incompatibility of informational

and operational transactional systems. These two classes of information systems are

designed to satisfy different, often incompatible, requirements. At the same time, the

IT infrastructure is changing rapidly and its capabilities are increasing.

The analysis [143] carried out shown that the percentage of data stored in

digital form increased drastically after 1980s. The study documented the rise of

digitization. The volume of data is growing at an exponential rate. Several research

groups have been studying the amount of data that enterprises and individuals are

3

Chapter 1

generating, storing, and consuming in the whole worlds economy. All analyses, each

with different methodologies and definitions, agree on one fundamental point—the

amount of data in the world has been expanding rapidly and will continue to grow

exponentially for the foreseeable future despite there being a question mark over

how much data we, as human beings, can absorb.

Table 1.1 (a) Growth of Data volume (By decade)

Year Data volume (Exabyte’s) Growth in Percentage

1970 39 --

1980 127 325

1990 1997 1572

2000 69877 3499

2010 4453761 6373

Table 1.1 (b) Yearly increment in Data Volume

Year Data volume (Exabyte’s) Growth in Percentage

2001 103738 148

2002 161324 155

2003 244737 151

2004 408644 166

2005 599763 146

2006 867822 144

2007 1183545 149

2008 1779867 151.38

2009 2737963 153.83

2010 4453761 162.67

2011 7622324 171.14

2012 12368961 162.27

4

Chapter 1

Figure 1.1 Growth of Data (Yearly)

It is estimated that the percentage of data stored in digitized form increased

by 35 percent in the year 2000. It is also increased by 59 percent in the year 2012 as

compared to year 2011(Figure 1.1). It is also found that the rate at which data

generation is increasing is much faster than the world’s data storage capacity is

expanding, pointing strongly to the continued widening of the gap between the two

(Table 1.1 (a) and (b)). Data can create significant value for the world economy,

enhancing the productivity and competitiveness of public and private sector

companies and the public sector and creating substantial economic surplus for

consumers [107],[116],[170],[177].

The generation of data is growing exponentially (Figure 1.1) and advancing

technology may allow the global economy to store and process ever greater

quantities of data. Human beings may have limits in their ability to consume and

understand huge data. Despite these apparent limits, there are ways to help

organizations and individuals to process, visualize, and synthesize meaning from

huge data. For instance, more sophisticated visualization techniques and algorithms,

including automated algorithms, enable people to see patterns in large amounts of

data and help them to unearth the most pertinent insights.

0

2000000

4000000

6000000

8000000

10000000

12000000

14000000

1 2 3 4 5 6 7 8 9 10 11 12

Data volume (Exabyte’s)

Year

2001 2002 200 3 2004 2005 2006 2007 2008 2009 2010 2011 2012

14x10^6

12x10^6

10x10^6

8x10^6

6x10^6

4x10^6

2x10^6

Year

Data Volume (Exabytes)

5

Chapter 1

The price of computer processing speed in Million instructions per second

(MIPS) continues to decline, while the power of microprocessors doubles every four

years [175]. Irrespective of the following reasons the computation capacity is going

to rise.

• The price of digital storage is rapidly dropping.

• Network bandwidth is increasing, while the price of high bandwidth is

decreasing.

• The workplace is increasingly heterogeneous in terms of hardware and

software and

• Legacy systems need to, and can, be integrated with new applications.

1.3 Evolution of Data Warehousing

In the early days of computing, disk storage was extremely expensive, data

was stored on magnetic tape and had to be read sequentially from flat files [65][116].

The management of data was tightly integrated with the application system and file

description was also stored within each application program. Every application had

its own private files with little opportunity to share data outside their own

applications and required the developer to start from scratch by designing file

formats. As computer grew in capability, this trade-off became increasingly

unnecessary and a number of general purpose database systems emerged.

As the cost of disk storage fell, opportunities to store data for real-time

access arose. Specialized Data Base Management System (DBMS) software

emerged during the 1960s for the sole purpose of managing data. Application

systems were then able to focus on the user interface, screen navigation, data

validations etc. and could leave the data management tasks to the specialized DBMS

technology. The application system simply had to call the DBMS when it needed to

read or store data. The application system simply had to call the DBMS when it

needed to read or store data.

In 1970 Edgard F. Codd set the foundations of the relational database

model. Codd’s relational model introduced the notion of data independence, which

separated the physical representation of data from the logical representation

6

Chapter 1

presented to applications. Data could be moved from one part of the disk to another

or stored in a different format without causing applications to be rewritten and

RDBMS is fairly stable now. Improvements have been made but fundamentals of the

technology haven't changed.

Figure 1.2 Evolution of Data Warehouse

In the 1980s, business operations became decentralized geographically

[12],[32]. Competition increased at the global level. Customer demands and market

needs favored a decentralized management style. Rapid technological change created

low-cost microcomputers. The Local Area Networks (LAN) became the basis for

computerized solutions. The large number of applications based on DBMSs and the

need to protect investments in centralized DBMS software made the notion of data

sharing attractive. Distributed Database Management System (DDBMS) became a

hit. Also online transaction processing system (OLTP’s) was developed to capture

and store business operations data. Their most obvious shortcomings were the

Before 1960- Flat Files used to store data.

During 1960- DBMS Software emerged.

During 1970- RDBMS Software Emerged, Dr. E. F. Codd

set the foundations of RDBMS.

In the 1980s-Business operations became decentralized

geographically and DDBMS emerged. OLTP technology

used to capture and store business user needs.

In the 1990s-Organizations began to achieve competitive

advantage by building data warehouse systems. OLAP

technology to analyze data stored in Data Warehouse

7

Chapter 1

inability to address the business users needs to access stored transaction data and

management’s decision support requirements. The OLTP’s did not address history

and summarization requirements or support integration needs- the ability to analyze

data across different systems.

Figure 1.3 Levels of Sophistecation

In the 1990s, organizations began to achieve competitive advantage by

building data warehouse systems. Data warehousing has become the most feasible

solution to optimize and manipulate data. The current trend is to gather the data that

is needed in an optimized database regardless of the number of different applications

and different platforms that are used to generate the source data. The data warehouse

by bringing together data stored in disparate systems is a return to a centralized

concept. The main difference is that data warehousing enables enterprise and local,

decision support needs to be met while allowing independent data island to flourish.

1.4 Definitions of Data Warehouse

“A Data Warehouse is a subject oriented, integrated, nonvolatile, and time-

variant collection of data in support of management’s decisions [65]”.

“A data warehouse is a copy of transaction data specially structured for

query and analysis”. It means that the users access the data, as they want for the

analysis by querying [115][116].

8

Chapter 1

“A data warehouse combines various data sources into a single source for

end user access. End user can perform ad-hoc querying, analysis, data mining and

visualization of warehouse information. The goal of data warehouse is to establish a

data repository that makes operational data accessible in a form that is readily

acceptable for decision support and other applications [41]”

A Data Warehouse is divided into three parts [51]:

1. Data Management Layer: Data warehouse itself, which contains data and

its associated software.

2. Data Staging Layer: Data acquisition software, which extracts data from

legacy systems and external sources, consolidates and summarizes the data,

and loads it into the Data Warehouse.

3. Data Accessing Layer: Client software, which allows users to access and

analyze data in the Data Warehouse.

“The data in the data warehouse is: Seperate, Available, Integrated, Time

stamped, Subject oriented, Non volatile and accessible [83]”.

1.4.1 Charactrestics of Data Warehousing

The first charactrestic is subject oriented, means data is stored by business

subjects, which differ from enterprise wise. These subjects are critical for the

enterprise. In case of manufacturing company- sales, shipments and inventroy are

critical business subjects, where in a retail store, sales at the check-out counter is a

critical subject.

Simillarly, in operational systems, we store data by individual applications.

For example for an order processing application, we keep the data for that particular

application. This application provides the data to all the functions for entering orders,

checking stock, veryfing customer’s credit, and accessing the order for shipment. But

thses data sets contain only the data that is needed for those functions relating to this

particular application. In striking contrast, in the data warehousing, data is stored by

subjects, not by applications. For example, in the data warehouse for an insurance

company, claims data are organized around the subject of claims and not by

individual applications of general insureance and life insurance. 9

Chapter 1

Figure 1.4 The Data Warehouse is Subject Oriented

The second charactrestic of the data warehouse is integrated. The

integrator component of a data warehouse intregates the data fetched from different

data sources. For proper decision making, we need to pull together all the relevant

data from the various applications and remove the inconsistencies. The data in the

data warehouse comes from several disparate operational systems. These are

disparate applications, so the operational platforms and operating systems could be

different. The file layouts character code representation and field naming

conventions all could be different. The data is converted, reformated, resequenced,

summarized and so fourth. We have to standarize the various data elements and

make sure of the meanings of data names in each source application. For example,

Naming conventions, Codes, Data attributes and measurenments etcwould need

standarization. The result is that dataonce it resides in the data warehouse has a

single physical corporate image.

Order

Processing

Consumer

Loans

Customer

Billing

Accounts

Receivable

Claims

Processing Savings

Accounts

Sales Product

Customer Account

Claims Policy

Operational Applications Data Warehouse Subjects

10

Chapter 1

Figure 1.5 The Data Warehouse is integrated

The third charactrestic of a data warehouse is non-volatile. Operational data

is regularly accessed and manipulated one record at a time. Data is updated in the

operational environment as a regular matter of course but data warehouse data

exhibits a very different set of characteristic. The data in the data warehouse is not

intended to run the day-to-day business. When we want to process the next order

received from a customer, we do not look into the data warehouse to find the current

stock status. Data warehouse data is loaded and accessed, but it is not updated.

Instead, when data in the data warehouse is loaded, it is loaded in a snapshot, static

format. When subsequent changes occur, a new snapshot record is written. In doing

so, a historical record of data is kept in the data warehouse.

The fourth characteristic of a data warehouse is time variant. The data in

the warehouse is meant for analysis and decision making. If a user is looking at the

buying patteren of a specific customer, the user needs data not only about the current

purchase, but on the past purchases as well. When a user wants to find out the reason

for the drop in sales in the particular region, the user needs all the sales data for that

rigion over a period extending back in time.

Savings

Account

Checking

Account

Loans Account

Subject =

Account

Data Warehouse

Data

from

applic

ations

11

Chapter 1

Figure 1.6 The Data Warehouse is Nonvolatile

Time variancy implies that every unit of data in the data warehouse is

accurate as of some moment in time. In some cases, a record is time stamped. In

other cases, a record has a date transaction. But in every case, there is some form of

time marking to show the moment in time during which the record is accurate.

1.5 Goals and Applications of Data Warehousing

Data warehousing is primarily used to archive data that is of the nature of

business knowledge [12]. A primary requirement is the fast analysis of this shared

data, resulting in multidimensional views of the data, which in turn results in

knowledge acquisition. OLAP comes into play when it is required to analyze data

stored in a warehouse, so as to generate complex results, in a time-constrained

environment – that is, just-in-time information. Most data warehouses support, called

Ad-hoc querying, which implies that any combination of complex queries can be

executed against the stored data.

The significant goals of the Data Warehousing are[32][41]:

• Providing an environment to store and maintain an organization’s

historical information,

• Be an adaptive and resilient source of information,

• Be the foundation for decision making and

• To provide analysis as well as reporting through OLAP tools.

Here are some examples that can use the OLAP option to realize valuable

gains in functionality and performance [41][12][13]:

OLTP Databases Data

Warehouse

Loads

Read Add/Change/Delete Read

12

Chapter 1

• Planning applications allow organizations to predict outcomes. They

generate new data using predictive analytical tools such as models,

forecasts, aggregation, allocation, and scenario management. Some

examples of this type of application are corporate budgeting, financial

analysis, and demand planning systems.

• Budgeting and financial analysis systems allow organizations to analyze

past performance, build revenue and spending plans, manage to attain

profit goals and model the effects of change on the financial plan.

Management can determine spending and investment levels that are

appropriate for the anticipated revenue and profit levels. Financial

analysts can prepare alternative budgets and investment plans contingent

on factors such as fluctuations in currency values.

• Demand planning systems allow organizations to predict market demand

based on factors such as sales history, promotional plans and pricing

models. They can model different scenarios that forecast product demand

and then determine appropriate manufacturing goals.

As these point highlights, the data processing that is required to answer

analytical questions is fundamentally different from the data processing required to

answer transactional questions. The users are different, their goals are different, their

queries are different and the type of data that they need is different. A relational data

warehouse enhanced with the OLAP option provides the best environment for data

analysis.

OLAP holds several benefits for businesses: -

1. OLAP helps managers in decision-making through the multidimensional

data views that it is capable of providing, thus increasing their

productivity.

2. OLAP applications are self-sufficient owing to the inherent flexibility

provided to the organized databases.

3. It enables simulation of business models and problems through extensive

usage of analysis-capabilities.

13

Chapter 1

4. In conjunction with data warehousing, OLAP can be used to provide

reduction in the application backlog, faster information retrieval and

reduction in query drag.

1.6 Future of Data Warehousing

Data warehouses have become indispensable for many enterprises, as they

store and analyze large amounts of structured business data [177][115][12]. Data

warehouse’s architecture consolidates important business events, for example sales

and models them as facts. These facts are characterized by a number of hierarchical

dimensions like time or products with associated numerical measures like sales price.

The world of higher education as well as business in general is becoming

increasingly competitive. Those institutions and businesses that realize the potential

benefit of the information resource first will gain a competitive advantage. As stated

in the closing statement of the White Paper by E. F. Codd and Associates, the quality

of strategic business decisions made as a result of OLAP is significantly higher and

more timely than those made traditionally. Ultimately, an enterprise’s ability to

compete successfully and to grow and prosper will be in direct correlation to the

quality, efficiency, effectiveness and pervasiveness of its OLAP capability. It is,

therefore, incumbent upon IT organizations within enterprises of all sizes, to prepare

for and to provide rigorous OLAP support for their organizations.

• As a DW becomes a mature part of an organization, it is likely that it will

become as “anonymous” as any other part of the Information System (IS).

• One challenge to face is coming up with a workable set of rules that

ensure privacy as well as facilitating the use of large data sets.

• Another is the need to store unstructured data such as multimedia, maps

and sound.

• The growth of the internet allows integration of external data into a DW,

but its varying quality is likely to lead to the evolution of third-party

intermediaries whose purpose is to rate data quality.

14

Chapter 1

1.7 Importance of Data Warehousing

The amount of information produced by today’s large-scale enterprises has

been growing speedily. Operational sources rapidly and continuously are generating

new data, such as auction databases, inventory and order processing system. In order

to make intelligent business decisions, complex analytical queries are issued and

answered across data sources [117]. For instance, a large retail store’s data

warehouse may collect information from its regional inventory and sales databases.

Once the data warehouse is built, it is used to answer analytical or decision support

queries. Such data warehouse may be used to answer queries such as:

• Which stores and for what months was a particular item in high demand

but short in supply?

• How much specific item did we sell last year, last month, last week in

store XYZ? and comparing sales data of this item in various stores?

• What internal factors influenced specific item sales? and what external

factors (weather) influence specific item sales? and

• How can we help suppliers to reduce their cost?

Hence, the importance of a Data Warehousing may be understood with

reference to:

Indian Railway needs data warehouse for:

− Studying booking pattern of trains,

− Changing quota distribution to ensure maximum utilization,

− Changing train frequency depending on traffic patterns so as to provide

maximum revenue and

− Predicting a season during which additional coaches should be added or

additional trains should be scheduled so as to meet the passenger’s

demand and to maximize revenues.

Such analysis enables railways to tackle the competition posed by low cost

travelling options offered by competitors.

Data warehouse in Government provides powerful decision making tools in

the hands of end users in order to facilitate prompt decision making [41][42]. It also

15

Chapter 1

reduces the amount of resources-time and manpower spent on managing the volumes

and variety of database handled by their informatics centers. In such data warehouse,

there are number of data marts.

1. Data mart for agriculture stores the information of land-holding patterns

across the villages in state/country. It can be used to analyze information

about land-holding amongst citizens, institutions, males, females,

scheduled castes and scheduled tribes, etc.

2. Data mart for amenities contains the census data of village amenities. It

contains information on availability for amenities like education, health,

drinking water, transportation, communication and irrigation. Different

types of analysis can be done e.g. village amenities analysis, irrigation

analysis etc.

3. Weather forecast data mart holds the statistics information on daily levels

of rainfall across various weather stations in state/country. This helps the

concerned authority to plan water supply to various cities and using

various models to forecast rainfall levels.

4. The health status data mart stores the information on various health

camps conducted across state/country to detect and cure malaria patients.

This has vital information like number of people suffering from malaria,

deaths caused due to malaria, source of malaria infection, demographic

information of malaria patients, etc. Using the data warehouse, the end

users will be able to plan various precautionary measures to reduce the

number of people suffering from malaria.

Using data mining tools, the knowledge is discovered by applying mining

rules upon data warehouse data. Knowledge discovery is the creation of knowledge

from structured and unstructured data. Data mining discovers interesting knowledge

from large amounts of data stored in databases, data warehouses, or other

information repositories. Before data mining, various processes take place on the

data to purify it. The data mining step may interact with the user or a knowledge

base. The interesting patterns are presented to the user and may be stored as a new

knowledge in the knowledge base. The figure 1.7 represents the processes of

knowledge discovery. 16

Chapter 1

Figure 1.7 Process of Knowledge Discovery

By performing data mining, interesting knowledge, regularities, or high-

level information can be extracted from databases and viewed or browsed from

different angles. The discovered knowledge can be applied to decision making,

process control, information management and query processing. Therefore, data

mining is considered one of the most important frontiers in database and information

systems and one of the most promising interdisciplinary developments in the

information technology.

Knowledge discovery

Pattern Evaluation

Pattern

Data Mining

Task relevant Data

Selection

Data Warehouse

Data Cleaning and Integration

Databases

17

Chapter 1

1.8 Business Intelligence and Data Warehousing (BIDWH)

Success depends on how quickly and in what manner a company responds

to rapidly changing market conditions. Business Intelligence (BI) solutions empower

organizations with the insight necessary to make better decisions faster BI solutions

channel data and processes into a single source of the truth, providing every

employee with an accurate view of an organization and actionable items based on

clearly defined key performance indicators.

The rapid pace of today’s business environment has made Business

Intelligence (BI) systems indispensable to an organization’s success. BI systems turn

a company's raw data into useable information that can help management identify

important trends, analyze customer behavior and make intelligent business decisions

quickly. Over the past few years, business intelligence systems have been used to

understand and address back office needs such as efficiency and productivity. Now,

organizations are increasingly using BI to analyze customer behavior, understand

market trends, and search for new opportunities.

Figure 1.8 Business Intelligence Process

18

Chapter 1

BI relies on Data warehousing, making cost-effective storing and managing

of warehouse data critical to any BIDW solution. Without an effective data

warehouse, organizations cannot extract the data required for information analysis in

time to facilitate expedient decision-making. The ability to obtain information in

real-time has become increasingly critical in recent years because decision-making

cycle times have been drastically reduced. Competitive pressures require businesses

to make intelligent decisions based on their incoming business data—and do it

quickly. Simply put, the ability to turn raw data into useful information in a timely

manner can add hundreds of thousands—up to millions—of dollars to an

organization’s bottom line.

1.9 Issues of Data Warehousing Design

A data warehouse is a single data repository where data from multiple data

sources is integrated for online business analytical processing (OLAP) [65][115].

This implies a data warehouse needs to meet the requirements from all the business

processes within the entire organization. Thus, data warehouse design is a highly

complex, lengthy and thus error-prone process. Furthermore, business analytical

tasks change over time, which results in changes in the requirements for the systems.

Therefore, data warehouse and OLAP systems are rather dynamic and the design

process is continuous.

Data warehouse design takes approaches different from view

materialization in the industries. It sees data warehouses as database systems with

special needs such as answering management related queries. The focus of the design

becomes how the data from multiple data sources should be extracted, transformed

and loaded (ETL) to be organized in a database as the data warehouse. There are two

dominant approaches, the “top-down" approach [65] and the “bottom-up" approach

[115].

1.9.1 Top Down Design Approach

In the “Top-Down" design approach, a data warehouse is defined as a

subject-oriented, time-variant, non-volatile and integrated data repository for the

19

Chapter 1

entire enterprise [115]. Data from multiple sources are validated, reformatted and

stored in a normalized (up to 3NF) database as the data warehouse. The data

warehouse stores “atomic" data, the data at the lowest level of granularity, from

where dimensional data marts can be built by selecting the data needed for specific

business subjects or specific departments. The approach is a data driven approach as

the data is gathered and integrated first and then business requirements by subjects

for building data marts are formulated. The advantage of this approach is that it

provides a single integrated data source, thus data marts built from it will have

consistency when they overlap. The diagram of Top Down approach is as shown in

Figure 1.9.

Figure 1.9 Top Down Design Approach

It is possible to apply centralized rules and control and may see quick

results if implemented with iterations. The disadvantage of this method is that the

initial effort, cost and time for implementing a data warehouse is significant. It also 20

Chapter 1

needs high level of cross-functional skills. The development time from building the

data warehouse to having the first data mart available to the users is substantial,

leading to a late Return On Investment (ROI).

1.9.2 Bottom-Up Design Approach

In the “Bottom-Up" approach, a data warehouse is defined as “a copy of

transaction data specifically structured for query and analysis" [115], namely the star

schema. In this approach, data marts are created first to provide reporting and

analytical capabilities for specific business processes (or subjects). Thus it is

considered to be a business driven approach in contrast to Inmon's data driven

approach. Data marts contain the lowest grain data and, if needed, aggregated data

too. Instead of a normalized database for the data warehouse, a denormalised

dimensional database is adopted to meet the information delivery requirements of

data warehouses. Using this approach, in order to use the set of data marts as the

enterprise data warehouse, data marts should be built with conformed dimensions in

mind, meaning that common objects are represented the same in different data marts.

The conformed dimensions link the data marts to form a data warehouse, which is

usually called a virtual data warehouse. The advantage of the “bottom-up" design

approach is that it has quick ROI, as creating a data mart, a data warehouse for a

single subject, takes far less time and effort than creating an enterprise-wide data

warehouse. Also the risk of failure is also less. This approach is inherently

incremental. This approach allows project team to learn and grow.

However, the independent development by subjects can result in

inconsistencies, if common objects are not integrated and are updated at different

times in the individual data marts. Although guidelines are provided in the method,

such as a planning stage for data warehouse but, it does not provide effective

techniques, and thus it is not guaranteed that the method resolves the integrity

problem in practice[121][123]. This approach proliferates unmanageable interfaces.

The figure of Bottom-Up design approach is as shown in Figure 1.10.

21

Chapter 1

Figure 1.10 Bottom Up Design Approach

Table 1.2 Top-Down Design Approach V/S Bottom-Up Design Approach

Top-Down Design Approach Bottom-Up Design Approach

A truly corporate effort, an enterprise

view of data.

A departmental view of data. Its own

narrow view.

Inherently architected- not a union of

disparate data marts.

Inherently incremental; can schedule

important data marts first.

Single, central storage of data about the

content.

Departmental data stored.

Centralized rules and control. Departmental rules and control.

May see quick results if implemented

with iterations.

Less risk of failure, favorable return on

investment and proof of concepts.

22

Chapter 1

1.10 Basic Architecture of Data Warehousing System

The architecture of a data warehouse includes data sources, wrapper,

monitor, integrator and the data warehouse data repository (Figure 1.11). The bottom

level of the architecture depicts the data sources, which store day-to-day transactions

data. The monitor component of the architecture is responsible for automatically

detecting changes of interest in the source data and reporting them to the integrator

component. The wrapper component is responsible for translating information from

the native format of the source to compatible format. The new information sources

and change in existing data sources are propagated to the integrator. The integrator

works as liaising between wrapper/monitor and Data Warehouse data repository. The

integrator brings source data into the warehouse data repository, which may include

filtering the information, summarizing it and merging. In order to properly integrate

new change information into data repository, need a process to store change data into

data repository without affecting the Quality of Service (QoS) from the same or

different data sources.

Figure 1.11 Basic Architecture of Data Warehousing System

The information stored at the warehouse is in the form of derived views of

data from the sources. These views stored at the warehouse are often referred to as

Data Warehouse

Data Source 1

Wrapper/Monitor

Data Source 2 Data Source n

Wrapper/Monitor Wrapper/Monitor

Integrator

23

Chapter 1

materialized views. The data warehouse itself can use an off-the-shelf or special

purpose database management system. Although in Figure1.11 there is a single,

centralized warehouse, but the warehouse certainly may be implemented as a

distributed database system and in fact data parallelism or distribution may be

necessary to provide the desired performance.

The architecture and basic functionality we have described is more

general than that provided by most commercial data warehousing systems. In

particular current systems usually assume that the sources and the warehouse

subscribe to a single data model (normally relational), that propagation of

information from the sources to the warehouse is performed as a batch process

(perhaps off-line) and that queries from the integrator to the information sources are

never needed.

1.11 Components of Data Warehouse System and Their Problems

A leading issue in database research is to provide integrated access to

multiple and distributed heterogeneous databases. There are different issues in the

components of Data Warehouse; need to overcome and solve these issues. The

components and their problems are as given below:

1.11.1 Wrapper/ Monitors

The data in the data warehouse is extracted from data sources through

wrapper/monitor components. The wrapper/monitor components have two

interrelated responsibilities.

a) Translation: Making the underlying information source appear as if it

subscribes to the data model used by the warehousing system. For example,

if the information source consists of a set of flat files but the warehouse

model is relational then the wrapper monitor must support an interface

that presents the data from the information source as if it were relational.

The translation problem is inherent in almost all approaches to data

integration- both lazy and eager- and is not specific to data warehousing.

Typically, a component that translates an information source into a

common integrating model is called a translator or wrapper. Most 24

Chapter 1

commercial data warehousing systems assume that both the information

sources and the warehouse are relational so translation is not an issue.

However, some vendors do provide wrappers for other common types of

information sources.

b) Change detection: Monitoring the information source for changes to the

data that are relevant to the warehouse and propagating those changes to

the integrator. Note that this functionality relies on translation since like the

data itself changes to the data must be translated from the format and model

of the information source into the format and model used by the

warehousing system.

One approach is to ignore the change detection issue altogether and simply

propagate entire copies of relevant data from the information source to the

warehouse periodically. The integrator can combine this data with existing

warehouse data from other sources or it can request complete information from all

sources and recompute the warehouse data from scratch. Ignoring change detection

may be acceptable in certain scenarios, for example when it is not important

for the warehouse data to be current and it is acceptable for the ware house to

be off- line occasionally. However, if currency, efficiency and continuous access are

required then we believe that detecting and propagating changes and incrementally

folding the changes into the warehouse will be the preferred solution. In considering

the change detection problem, we have identified several relevant types of

information sources [78]:

a) Cooperative sources: Sources that provide triggers or other active database

capabilities so that notifications of changes of interest can be programmed to

occur automatically.

b) Logged sources: Sources maintaining a log that can be queried or inspected,

so changes of interest can be extracted from the log.

c) Queryble sources: Sources that allow the wrapper/monitor to query the

information at the source so that periodic polling can be used to detect

changes of interest.

25

Chapter 1

d) Snapshot sources: Sources that do not provide triggers, logs, or queries.

Instead periodic dumps, or snapshots, of the data are provided off-line, and

changes are detected by comparing successive snapshots.

Each type of information source capability provides interesting research

problems for change detection.

• In cooperative sources, although triggers and active databases have been

explored in depth, putting such capabilities to use in the warehousing context

still requires addressing the translation aspect; similarly for logged sources.

• In queryable sources, in addition to translation one must consider

performance and semantic issues associated with polling frequency. If the

frequency is too high performance will degrade, while if the frequency is too

low, changes of interest may not be detected in a timely way.

• In snapshot sources the challenge is to compare very large database dumps,

detecting the changes of interest in an efficient and scalable way.

An important related problem in all of these scenarios is to develop

appropriate representations for the changes to the data, especially if a non-

relational model is used.

It may be noted that a different wrapper/monitor component is needed for

each information source, since the functionality of the wrapper/ monitor is dependent

on the type of the source (database system, legacy system, etc.) as well as on the data

provided by that source [98]. Clearly, it is undesirable to hard-code a

wrapper/monitor for each information source participating in a warehousing system,

especially if new information sources become available frequently. Hence, a

significant research issue is to develop techniques and tools that automate or

semi-automate the process of implementing wrapper/monitors through a toolkit

or specification-based approach.

1.11.2 Integrator

The ongoing job of the integrator component is to receive change

notifications from the wrapper/monitor component and reflect these changes in the

data warehouse data repository.

26

Chapter 1

At a sufficiently abstract level the data in the warehouse can be seen as a

materialized view or set of views where the base data resides at the information

sources. Viewing the problem in this way, the job of the integrator is essentially to

perform materialized view maintenance. There are number of reasons that

conventional view maintenance techniques cannot be used, and each of these reasons

highlights a research problem associated with data warehousing.

• Data warehouses may contain a significant amount of historical information,

while the underlying sources may not maintain this information. Relevant

area of research here certainly includes efficient monitoring of historical

information.

• Data warehouses contain highly aggregated and summarized information.

Therefore, efficient view maintenance in the presence of aggregation and

summary information appears to be an open problem.

• The data sources update independently without caring data warehouse view

updation. In this scenario, certain anomalies arise when attempting to keep

views consistent with base data and algorithms must be used that are

considerably more complicated than conventional view maintenance

algorithms.

• In a data warehousing environment it may be necessary to transform the base

data (sometimes referred to as data scrubbing) before it is integrated into the

warehouse. Transformations might include, for example, aggregating or

summarizing the data, sampling the data to reduce the size of the warehouse,

discarding or correcting data suspected of being erroneous, inserting default

values or eliminating duplicates and inconsistencies

• Although integrators can be based purely on the data model used by the

warehousing system, a different integrator still will be needed for each

data warehouse since a different set of views over different base data will be

stored. As with wrapper/monitors, it is desirable not to require that each

integrator be hard coded from scratch, but rather to provide techniques

and tools for generating integrators from high-level, nonprocedural

specifications.

27

Chapter 1

1.12 Motivation

Business data are growing very fast and the need is to store and analyze

these data for further use. The amount of data doubles every 18 months and it affects

upon the response time and the sheer ability to comprehend its contents. System

users are business domain experts and they are not computer professional. Business

needs decisions to be made quickly using all available data and past data. A data

warehouse stores views derived from data that may not reside at the warehouse.

These views are called as materialized views. Using these materialized views, user

queries can be answered quickly because querying the external sources where the

base data reside is avoided. However, when the data sources change, the views in

the warehouse can become inconsistent with the base data and must be

maintained.

A variety of approaches have been proposed to maintain these views and it

is classified into two broad categories. Most database systems achieve this by eager

maintenance or immediate maintenance, where all affected views are maintained as

part of the update transaction. This method is suitable for views whose base tables

which are seldom updated and the updates are likely to be followed immediately by

queries. The second method of view maintenance is deferred view maintenance or

lazy approach where view maintenance is delayed and takes place only when

explicitly triggered by a user. Under this approach, update transaction do not

maintain views but just store away enough information so that affected views can be

maintained later. Actual maintenance is done when the low priority jobs running or

when the system has free cycles. Lazy or deferred maintenance allows updates to

complete faster so locks are released sooner, which reduces the frequency of lock

contention, lock conflicts and transaction aborts. If a view is not up to date when

needed by a query, it is transparently brought up to date before the query is allowed

to access it. In this case, the first beneficiary of the view pays for all or part of the

views maintenance by experiencing a delay. This approach has a serious drawback

that a query may see an out-of–date view and produce an incorrect result.

Both the view maintenance techniques require access to the base

relations in order to maintain materialized view. In data warehousing scenario,

accessing base relations can be difficult since these relations are distributed 28

Chapter 1

across different sources. Often the data sources may be unavailable or, even if

available; the cost of accessing the sources may be prohibitive due to

communication costs. Accessing base relation/s, compute the changes and then

reflect these changes in the warehouse requires quite large time. For these

reasons, the self-maintainability of the view is an important issue in data

warehousing. We call a view self-maintainable if it can be maintained at the

warehouse without accessing the source data.

Therefore, research is required to optimize the overall view

maintenance process, to reduce the cost of deriving changes and performing

updates to the warehouse site.

1.13 Objectives of the Research

The following objectives have been set for the present study.

1. To study the Data Warehousing System to understand data model,

architecture and components.

2. To understand materialized view maintenance problem and to analyze

different methods used in view maintenance process.

3. To propose the techniques for materialized view maintenance to

overcome problems in integrator component of a data warehouse.

4. Testing of proposed methods and comparing with existing methods.

Study have been carried out on Data Warehouse Design issues, various

challenges in data warehousing system architecture, efficient change detection and

materialize view maintenance, data warehouse view optimizations. Also different

maintenance algorithms have been studied.

1.14 Research Methodology

Literature survey was conducted to study the existing methodology of

materialized view maintenance methods. Journals of IEEE, ACM transactions, IEEE

conference proceedings, IEEE digital library, online papers and text books were

referred to obtain the latest information about the topic. Data warehouse consultants,

developers, BI developers were interacted to understand the requirements and

finding the exact problems in the relevant area. 29

Chapter 1

Simulation is one of the widely used operations research and management

science techniques and it ranks very high among the most widely used methods and

techniques. Therefore this approach is used to find out the effect of our proposed

technique as well as to compare the results across the existing techniques.

1.15 Organization of the Thesis

The thesis is divided into six chapters consisting of Introduction to Data

Warehousing System, Literature Review, Materialized View Maintenance Methods

and Performance Evaluation, Analytical Models for Materialized View Maintenance

Methods, Proposed Simplification and Optimization of View Maintenance Process

and the last chapter is Conclusions, Limitations and Future Work.

Chapter 1 Introduction to Data Warehousing System

The content in first chapter covers the preliminary concept, general

overview of Data Warehousing (DWH) system, its need, traces the evolution,

problems, goals and applications of Data Warehousing. It also describes the future of

Data Warehousing, importance, general architecture, and various design issues in

DWH. Business Intelligence (BI) and its need are explained. The chapter also

describes motivation of the research work, statement of the problem, objectives of

the research, research methodology and the organization of the thesis.

Chapter 2 Literature Review

The second chapter explains a review of the Literature survey undertaken

for the research study. The theory presented here has been collected from books,

articles, research papers and internet. The process of materialize view updation,

maintenance and selection are also described. Data modeling techniques,

multidimensional databases and detailed DWH architecture is presented in this

chapter. The various materialize view maintenance techniques found in the literature

are classified into different categories. The various techniques used to maintain the

materialized view in DWH are compared in this chapter. The techniques found in the

literature are classified into appropriate the category.

30

Chapter 1

Chapter 3 Materialized View Maintenance Methods and Performance Evaluation

This third chapter deals with the problem of materialized view maintenance

and maintenance overhead in data warehousing. The standard approaches of

materialize view maintenance process are described. The eager maintenance or

Incremental View Maintenance (IVM) and lazy or Deferred View Maintenance

(DVM) are explained. We have proposed a materialize view maintenance framework

using maintenance manager which keeps track of active view maintenance task.

Scheduling of maintenance task is also explained in this chapter. We have also

checked the effect of maintenance task on response time of the query. To verify the

feasibility and effectiveness of these standard view maintenance strategies an

experimental study is carried out, results are taken and compared. The results are

calculated using single view, two views and multiple views.

Chapter 4 Analytical Models for Materialized View Maintenance Methods

In the fourth chapter we have categorized the maintenance methods and

proposed an analytical model. The results are calculated and compared to show the

best performance from these models and methods. The concept of unrestricted base

access and run time view maintenance are also explained in detail with the

advantages and comparison of various analytical models. We have considered the

parameter total amount of space required to store the change at the data warehouse

site.

Chapter 5 Proposed Simplifications and Optimization of View Maintenance

Process

The fifth chapter describes the proposed view maintenance method, we

have considered the secondary relations to store the intermediate results of the view

at the data warehouse site. Whenever the data sources change, the changes are

computed incrementally and stored in these secondary relations. Then these changes

are subsequently sent to the higher level secondary relations. At the time of view

maintenance, the entire contents of the final view are integrated into the ware house

views. The characteristic of this method is, the entire view maintenance process is

hidden from the data warehouse user and it does not affect upon the warehouse 31

Chapter 1

performance. The experimental model has been developed and the results are

compared with the existing techniques.

Chapter 6 Conclusions, Limitations and Future Work

The sixth chapter describes the conclusions, limitations and future work.

Here conclusions are drawn and the limitations of the research work are given for the

objectives defined for the research. Also directions for future research are described

and finally the authors list of publications relevant to the thesis is given.

1.16 Conclusions

In today’s fast-paced and ever-changing economy, information is seen as a

key business resource to gain the market advantage. To compete in today’s turbulent

market, organizations need to do considerable market research to offing out what

exactly people want rather than what they need. The last three decades have seen an

exponential growth in the area of information technology, catering to the information

processing, needs of business in the form of capturing, storing, analyzing and

transferring data that will help knowledge workers and decision makers make sound

business decisions. This is exactly where Data Warehousing comes into the picture.

Data Warehousing is the foundation of Decision Support System (DSS), of

which the goal is to enable decision makers to make better business decisions based

on analysis of historic data related to the business operation. Data warehousing has

become a major business trend, both for product and service sectors and for

application to daily business in all industries. Without data warehouse it is difficult

to answer the analytical queries because the data sources are distributed and

heterogeneous also. Concerning the data warehousing area in general, the most

focused problems are data integration, extraction and transformation, data warehouse

design and maintenance.

Before complex analytical queries are to be executed, the ETL process

needs to be performed on the data warehouse, so that the users of the data warehouse

get the latest and integrated data. Business analyst executes their queries over this

centralized data repository to gain the insights into the data.

Traditionally, data warehouses have been used to provide storage and

analysis of large amounts of historical data. In a typical data warehouse, updates 32

Chapter 1

occur in batches at regular time intervals (e.g.,every night). At all other times, the

data warehouse is regarded as a “read-only” database, where uses can pose long-

running decision support queries.

Chapter Summary

This chapter gives an insight in to the area of Data Warehousing, the

problems and the work focused in the research study. An introduction to the various

research issues in the field of data warehousing are discussed. We have discussed the

evolution of data warehousing, its architecture and its challenges also. An overview

of the applications of data warehousing is also presented. As a data warehouse

becomes a mature part of an organization, it is likely that it will become as

anonymous as any other part of the information system. In this regard the future of

the data warehouse is given. As a business grew more complex, to sustain the

business requires valuable and concise information. The diagram given shows the

process of knowledge discovery in the topic importance of the data warehousing.

The general architecture of the data warehousing divides the architecture into three

parts namely data sources, ETL component and the data warehouse. There are two

broad approaches in data warehouse design: top down approach and the bottom up

design approach. Each approaches having its pros and cons. In the later part of this

chapter, the motivation, objectives of our research study and the organization of the

thesis have been presented.

33

Documents

Chapter 1 Introduction to Data Warehousing Systemshodhganga.inflibnet.ac.in/bitstream/10603/98388/11/11_chapter1.pdf · Chapter 1 Chapter 1 Introduction to Data Warehousing System