Upload
others
View
26
Download
0
Embed Size (px)
Citation preview
Chapter 1
Chapter 1
Introduction to Data Warehousing System
1.1 Introduction
1.2 Need for Data Warehousing
1.3 Evolution of Data Warehousing
1.4 Definitions of Data Warehouse
1.4.1 Characteristics of Data Warehousing
1.5 Goals and Applications of Data Warehousing
1.6 Future of Data Warehousing
1.7 Importance of Data Warehousing
1.8 Business Intelligence and Data Warehousing (BIDWH)
1.9 Issues of Data Warehousing Design
1.9.1 Top Down Design Approach
1.9.2 Bottom Up Design Approach
1.10 Basic Architecture of Data Warehousing System
1.11 Components of Data Warehouse System and Their Problems
1.11.1 Wrapper/Monitor
1.11.2 Integrator
1.12 Motivation
1.13 Objectives of the Research
1.14 Research Methodology
1.15 Organization of the Thesis
1.16 Conclusions
Chapter summary
1
Chapter 1
1.1 Introduction
In the 1990s, as business grew more complex, corporate offices spread
across the globe, and competition became fiercer, business executives became
desperate for information to be competitive and improve the bottom line. The
operational computer systems did provide information to run day-to-day operations,
but what the executives needed were different kinds of information that could be
readily used to make strategic decisions. They wanted to know where to build the
next warehouse for their product and which markets they should strengthen. The
operational systems, important as they were, could not provide strategic information.
Due to rapidly changed market dynamics, competitive pressure, globalization and
other similar factors forced business to review their structures, approaches and
strategies. Therefore, businesses were compelled to look into new ways of getting
information for dynamic markets.
During the last decade, the interest to analyze data has increased
significantly, because of competitive advantages of data in decision making process.
A key to survival in the business world is being able to analyze, plan and react to
changing business conditions as fast as possible. Many organizations own billions of
bytes of data, but they suffer from different problems because data are spread over
different computer systems, data from different sources are incompatible, data are
available too late, etc.
In order to solve these problems, the new concepts and tools have evolved
into an information technology called Data Warehousing. The Data Warehouse
(DW) can meet informational needs of knowledge workers and can provide strategic
business opportunities by allowing customers and vendors to access corporate data.
A large retail store collects vast amounts of information about their day to
day activities. The same retail store probably collects other types of information as
well, such as customer data, inventory data, advertisement data, employee data, etc.
An increasing number of organizations are realizing that the vast amounts of
collected data can and must be used to guide their business decisions [116][12].
Typically, the management of the organization wants to answer complex analytical
queries based on the collected data.
2
Chapter 1
Building a Data Warehouse provides a number of benefits such as:
• The processing of analytical queries is simplified because only the data
warehouse needs to be accessed.
• The warehouse data can keep a historical record of the various source data.
By retaining all of this data, the current activity of an organization can be
compared against history and can also be used for forecasting the future
activities of an organization.
1.2 Need for Data Warehousing
From a business perspective in order to survive and succeed in today’s
highly competitive global environment, business users demand business answers
mainly because: [12], [79],[132].
• Decisions need to be made quickly and correctly, using all available data.
• Users are business domain experts, not computer professionals.
• The amount of data collects doubles every 18 months, which affects
response time and the sheer ability to comprehend its content.
• Competition is heating up in the areas of business intelligence.
In addition, the necessity for data warehouses has increased as organizations
distribute control away from the middle-management layer that has traditionally
provided and screened business information. As users depend more on information
obtained from information technology systems, the need to provide an information
warehouse for the remaining staff to use becomes more critical.
There are several technology reasons for the existence of data warehousing.
First, the data warehouse is designed to address the incompatibility of informational
and operational transactional systems. These two classes of information systems are
designed to satisfy different, often incompatible, requirements. At the same time, the
IT infrastructure is changing rapidly and its capabilities are increasing.
The analysis [143] carried out shown that the percentage of data stored in
digital form increased drastically after 1980s. The study documented the rise of
digitization. The volume of data is growing at an exponential rate. Several research
groups have been studying the amount of data that enterprises and individuals are
3
Chapter 1
generating, storing, and consuming in the whole worlds economy. All analyses, each
with different methodologies and definitions, agree on one fundamental point—the
amount of data in the world has been expanding rapidly and will continue to grow
exponentially for the foreseeable future despite there being a question mark over
how much data we, as human beings, can absorb.
Table 1.1 (a) Growth of Data volume (By decade)
Year Data volume (Exabyte’s) Growth in Percentage
1970 39 --
1980 127 325
1990 1997 1572
2000 69877 3499
2010 4453761 6373
Table 1.1 (b) Yearly increment in Data Volume
Year Data volume (Exabyte’s) Growth in Percentage
2001 103738 148
2002 161324 155
2003 244737 151
2004 408644 166
2005 599763 146
2006 867822 144
2007 1183545 149
2008 1779867 151.38
2009 2737963 153.83
2010 4453761 162.67
2011 7622324 171.14
2012 12368961 162.27
4
Chapter 1
Figure 1.1 Growth of Data (Yearly)
It is estimated that the percentage of data stored in digitized form increased
by 35 percent in the year 2000. It is also increased by 59 percent in the year 2012 as
compared to year 2011(Figure 1.1). It is also found that the rate at which data
generation is increasing is much faster than the world’s data storage capacity is
expanding, pointing strongly to the continued widening of the gap between the two
(Table 1.1 (a) and (b)). Data can create significant value for the world economy,
enhancing the productivity and competitiveness of public and private sector
companies and the public sector and creating substantial economic surplus for
consumers [107],[116],[170],[177].
The generation of data is growing exponentially (Figure 1.1) and advancing
technology may allow the global economy to store and process ever greater
quantities of data. Human beings may have limits in their ability to consume and
understand huge data. Despite these apparent limits, there are ways to help
organizations and individuals to process, visualize, and synthesize meaning from
huge data. For instance, more sophisticated visualization techniques and algorithms,
including automated algorithms, enable people to see patterns in large amounts of
data and help them to unearth the most pertinent insights.
0
2000000
4000000
6000000
8000000
10000000
12000000
14000000
1 2 3 4 5 6 7 8 9 10 11 12
Data volume (Exabyte’s)
Year
2001 2002 200 3 2004 2005 2006 2007 2008 2009 2010 2011 2012
14x10^6
12x10^6
10x10^6
8x10^6
6x10^6
4x10^6
2x10^6
Year
Data Volume (Exabytes)
5
Chapter 1
The price of computer processing speed in Million instructions per second
(MIPS) continues to decline, while the power of microprocessors doubles every four
years [175]. Irrespective of the following reasons the computation capacity is going
to rise.
• The price of digital storage is rapidly dropping.
• Network bandwidth is increasing, while the price of high bandwidth is
decreasing.
• The workplace is increasingly heterogeneous in terms of hardware and
software and
• Legacy systems need to, and can, be integrated with new applications.
1.3 Evolution of Data Warehousing
In the early days of computing, disk storage was extremely expensive, data
was stored on magnetic tape and had to be read sequentially from flat files [65][116].
The management of data was tightly integrated with the application system and file
description was also stored within each application program. Every application had
its own private files with little opportunity to share data outside their own
applications and required the developer to start from scratch by designing file
formats. As computer grew in capability, this trade-off became increasingly
unnecessary and a number of general purpose database systems emerged.
As the cost of disk storage fell, opportunities to store data for real-time
access arose. Specialized Data Base Management System (DBMS) software
emerged during the 1960s for the sole purpose of managing data. Application
systems were then able to focus on the user interface, screen navigation, data
validations etc. and could leave the data management tasks to the specialized DBMS
technology. The application system simply had to call the DBMS when it needed to
read or store data. The application system simply had to call the DBMS when it
needed to read or store data.
In 1970 Edgard F. Codd set the foundations of the relational database
model. Codd’s relational model introduced the notion of data independence, which
separated the physical representation of data from the logical representation
6
Chapter 1
presented to applications. Data could be moved from one part of the disk to another
or stored in a different format without causing applications to be rewritten and
RDBMS is fairly stable now. Improvements have been made but fundamentals of the
technology haven't changed.
Figure 1.2 Evolution of Data Warehouse
In the 1980s, business operations became decentralized geographically
[12],[32]. Competition increased at the global level. Customer demands and market
needs favored a decentralized management style. Rapid technological change created
low-cost microcomputers. The Local Area Networks (LAN) became the basis for
computerized solutions. The large number of applications based on DBMSs and the
need to protect investments in centralized DBMS software made the notion of data
sharing attractive. Distributed Database Management System (DDBMS) became a
hit. Also online transaction processing system (OLTP’s) was developed to capture
and store business operations data. Their most obvious shortcomings were the
Before 1960- Flat Files used to store data.
During 1960- DBMS Software emerged.
During 1970- RDBMS Software Emerged, Dr. E. F. Codd
set the foundations of RDBMS.
In the 1980s-Business operations became decentralized
geographically and DDBMS emerged. OLTP technology
used to capture and store business user needs.
In the 1990s-Organizations began to achieve competitive
advantage by building data warehouse systems. OLAP
technology to analyze data stored in Data Warehouse
7
Chapter 1
inability to address the business users needs to access stored transaction data and
management’s decision support requirements. The OLTP’s did not address history
and summarization requirements or support integration needs- the ability to analyze
data across different systems.
Figure 1.3 Levels of Sophistecation
In the 1990s, organizations began to achieve competitive advantage by
building data warehouse systems. Data warehousing has become the most feasible
solution to optimize and manipulate data. The current trend is to gather the data that
is needed in an optimized database regardless of the number of different applications
and different platforms that are used to generate the source data. The data warehouse
by bringing together data stored in disparate systems is a return to a centralized
concept. The main difference is that data warehousing enables enterprise and local,
decision support needs to be met while allowing independent data island to flourish.
1.4 Definitions of Data Warehouse
“A Data Warehouse is a subject oriented, integrated, nonvolatile, and time-
variant collection of data in support of management’s decisions [65]”.
“A data warehouse is a copy of transaction data specially structured for
query and analysis”. It means that the users access the data, as they want for the
analysis by querying [115][116].
8
Chapter 1
“A data warehouse combines various data sources into a single source for
end user access. End user can perform ad-hoc querying, analysis, data mining and
visualization of warehouse information. The goal of data warehouse is to establish a
data repository that makes operational data accessible in a form that is readily
acceptable for decision support and other applications [41]”
A Data Warehouse is divided into three parts [51]:
1. Data Management Layer: Data warehouse itself, which contains data and
its associated software.
2. Data Staging Layer: Data acquisition software, which extracts data from
legacy systems and external sources, consolidates and summarizes the data,
and loads it into the Data Warehouse.
3. Data Accessing Layer: Client software, which allows users to access and
analyze data in the Data Warehouse.
“The data in the data warehouse is: Seperate, Available, Integrated, Time
stamped, Subject oriented, Non volatile and accessible [83]”.
1.4.1 Charactrestics of Data Warehousing
The first charactrestic is subject oriented, means data is stored by business
subjects, which differ from enterprise wise. These subjects are critical for the
enterprise. In case of manufacturing company- sales, shipments and inventroy are
critical business subjects, where in a retail store, sales at the check-out counter is a
critical subject.
Simillarly, in operational systems, we store data by individual applications.
For example for an order processing application, we keep the data for that particular
application. This application provides the data to all the functions for entering orders,
checking stock, veryfing customer’s credit, and accessing the order for shipment. But
thses data sets contain only the data that is needed for those functions relating to this
particular application. In striking contrast, in the data warehousing, data is stored by
subjects, not by applications. For example, in the data warehouse for an insurance
company, claims data are organized around the subject of claims and not by
individual applications of general insureance and life insurance. 9
Chapter 1
Figure 1.4 The Data Warehouse is Subject Oriented
The second charactrestic of the data warehouse is integrated. The
integrator component of a data warehouse intregates the data fetched from different
data sources. For proper decision making, we need to pull together all the relevant
data from the various applications and remove the inconsistencies. The data in the
data warehouse comes from several disparate operational systems. These are
disparate applications, so the operational platforms and operating systems could be
different. The file layouts character code representation and field naming
conventions all could be different. The data is converted, reformated, resequenced,
summarized and so fourth. We have to standarize the various data elements and
make sure of the meanings of data names in each source application. For example,
Naming conventions, Codes, Data attributes and measurenments etcwould need
standarization. The result is that dataonce it resides in the data warehouse has a
single physical corporate image.
Order
Processing
Consumer
Loans
Customer
Billing
Accounts
Receivable
Claims
Processing Savings
Accounts
Sales Product
Customer Account
Claims Policy
Operational Applications Data Warehouse Subjects
10
Chapter 1
Figure 1.5 The Data Warehouse is integrated
The third charactrestic of a data warehouse is non-volatile. Operational data
is regularly accessed and manipulated one record at a time. Data is updated in the
operational environment as a regular matter of course but data warehouse data
exhibits a very different set of characteristic. The data in the data warehouse is not
intended to run the day-to-day business. When we want to process the next order
received from a customer, we do not look into the data warehouse to find the current
stock status. Data warehouse data is loaded and accessed, but it is not updated.
Instead, when data in the data warehouse is loaded, it is loaded in a snapshot, static
format. When subsequent changes occur, a new snapshot record is written. In doing
so, a historical record of data is kept in the data warehouse.
The fourth characteristic of a data warehouse is time variant. The data in
the warehouse is meant for analysis and decision making. If a user is looking at the
buying patteren of a specific customer, the user needs data not only about the current
purchase, but on the past purchases as well. When a user wants to find out the reason
for the drop in sales in the particular region, the user needs all the sales data for that
rigion over a period extending back in time.
Savings
Account
Checking
Account
Loans Account
Subject =
Account
Data Warehouse
Data
from
applic
ations
11
Chapter 1
Figure 1.6 The Data Warehouse is Nonvolatile
Time variancy implies that every unit of data in the data warehouse is
accurate as of some moment in time. In some cases, a record is time stamped. In
other cases, a record has a date transaction. But in every case, there is some form of
time marking to show the moment in time during which the record is accurate.
1.5 Goals and Applications of Data Warehousing
Data warehousing is primarily used to archive data that is of the nature of
business knowledge [12]. A primary requirement is the fast analysis of this shared
data, resulting in multidimensional views of the data, which in turn results in
knowledge acquisition. OLAP comes into play when it is required to analyze data
stored in a warehouse, so as to generate complex results, in a time-constrained
environment – that is, just-in-time information. Most data warehouses support, called
Ad-hoc querying, which implies that any combination of complex queries can be
executed against the stored data.
The significant goals of the Data Warehousing are[32][41]:
• Providing an environment to store and maintain an organization’s
historical information,
• Be an adaptive and resilient source of information,
• Be the foundation for decision making and
• To provide analysis as well as reporting through OLAP tools.
Here are some examples that can use the OLAP option to realize valuable
gains in functionality and performance [41][12][13]:
OLTP Databases Data
Warehouse
Loads
Read Add/Change/Delete Read
12
Chapter 1
• Planning applications allow organizations to predict outcomes. They
generate new data using predictive analytical tools such as models,
forecasts, aggregation, allocation, and scenario management. Some
examples of this type of application are corporate budgeting, financial
analysis, and demand planning systems.
• Budgeting and financial analysis systems allow organizations to analyze
past performance, build revenue and spending plans, manage to attain
profit goals and model the effects of change on the financial plan.
Management can determine spending and investment levels that are
appropriate for the anticipated revenue and profit levels. Financial
analysts can prepare alternative budgets and investment plans contingent
on factors such as fluctuations in currency values.
• Demand planning systems allow organizations to predict market demand
based on factors such as sales history, promotional plans and pricing
models. They can model different scenarios that forecast product demand
and then determine appropriate manufacturing goals.
As these point highlights, the data processing that is required to answer
analytical questions is fundamentally different from the data processing required to
answer transactional questions. The users are different, their goals are different, their
queries are different and the type of data that they need is different. A relational data
warehouse enhanced with the OLAP option provides the best environment for data
analysis.
OLAP holds several benefits for businesses: -
1. OLAP helps managers in decision-making through the multidimensional
data views that it is capable of providing, thus increasing their
productivity.
2. OLAP applications are self-sufficient owing to the inherent flexibility
provided to the organized databases.
3. It enables simulation of business models and problems through extensive
usage of analysis-capabilities.
13
Chapter 1
4. In conjunction with data warehousing, OLAP can be used to provide
reduction in the application backlog, faster information retrieval and
reduction in query drag.
1.6 Future of Data Warehousing
Data warehouses have become indispensable for many enterprises, as they
store and analyze large amounts of structured business data [177][115][12]. Data
warehouse’s architecture consolidates important business events, for example sales
and models them as facts. These facts are characterized by a number of hierarchical
dimensions like time or products with associated numerical measures like sales price.
The world of higher education as well as business in general is becoming
increasingly competitive. Those institutions and businesses that realize the potential
benefit of the information resource first will gain a competitive advantage. As stated
in the closing statement of the White Paper by E. F. Codd and Associates, the quality
of strategic business decisions made as a result of OLAP is significantly higher and
more timely than those made traditionally. Ultimately, an enterprise’s ability to
compete successfully and to grow and prosper will be in direct correlation to the
quality, efficiency, effectiveness and pervasiveness of its OLAP capability. It is,
therefore, incumbent upon IT organizations within enterprises of all sizes, to prepare
for and to provide rigorous OLAP support for their organizations.
• As a DW becomes a mature part of an organization, it is likely that it will
become as “anonymous” as any other part of the Information System (IS).
• One challenge to face is coming up with a workable set of rules that
ensure privacy as well as facilitating the use of large data sets.
• Another is the need to store unstructured data such as multimedia, maps
and sound.
• The growth of the internet allows integration of external data into a DW,
but its varying quality is likely to lead to the evolution of third-party
intermediaries whose purpose is to rate data quality.
14
Chapter 1
1.7 Importance of Data Warehousing
The amount of information produced by today’s large-scale enterprises has
been growing speedily. Operational sources rapidly and continuously are generating
new data, such as auction databases, inventory and order processing system. In order
to make intelligent business decisions, complex analytical queries are issued and
answered across data sources [117]. For instance, a large retail store’s data
warehouse may collect information from its regional inventory and sales databases.
Once the data warehouse is built, it is used to answer analytical or decision support
queries. Such data warehouse may be used to answer queries such as:
• Which stores and for what months was a particular item in high demand
but short in supply?
• How much specific item did we sell last year, last month, last week in
store XYZ? and comparing sales data of this item in various stores?
• What internal factors influenced specific item sales? and what external
factors (weather) influence specific item sales? and
• How can we help suppliers to reduce their cost?
Hence, the importance of a Data Warehousing may be understood with
reference to:
Indian Railway needs data warehouse for:
− Studying booking pattern of trains,
− Changing quota distribution to ensure maximum utilization,
− Changing train frequency depending on traffic patterns so as to provide
maximum revenue and
− Predicting a season during which additional coaches should be added or
additional trains should be scheduled so as to meet the passenger’s
demand and to maximize revenues.
Such analysis enables railways to tackle the competition posed by low cost
travelling options offered by competitors.
Data warehouse in Government provides powerful decision making tools in
the hands of end users in order to facilitate prompt decision making [41][42]. It also
15
Chapter 1
reduces the amount of resources-time and manpower spent on managing the volumes
and variety of database handled by their informatics centers. In such data warehouse,
there are number of data marts.
1. Data mart for agriculture stores the information of land-holding patterns
across the villages in state/country. It can be used to analyze information
about land-holding amongst citizens, institutions, males, females,
scheduled castes and scheduled tribes, etc.
2. Data mart for amenities contains the census data of village amenities. It
contains information on availability for amenities like education, health,
drinking water, transportation, communication and irrigation. Different
types of analysis can be done e.g. village amenities analysis, irrigation
analysis etc.
3. Weather forecast data mart holds the statistics information on daily levels
of rainfall across various weather stations in state/country. This helps the
concerned authority to plan water supply to various cities and using
various models to forecast rainfall levels.
4. The health status data mart stores the information on various health
camps conducted across state/country to detect and cure malaria patients.
This has vital information like number of people suffering from malaria,
deaths caused due to malaria, source of malaria infection, demographic
information of malaria patients, etc. Using the data warehouse, the end
users will be able to plan various precautionary measures to reduce the
number of people suffering from malaria.
Using data mining tools, the knowledge is discovered by applying mining
rules upon data warehouse data. Knowledge discovery is the creation of knowledge
from structured and unstructured data. Data mining discovers interesting knowledge
from large amounts of data stored in databases, data warehouses, or other
information repositories. Before data mining, various processes take place on the
data to purify it. The data mining step may interact with the user or a knowledge
base. The interesting patterns are presented to the user and may be stored as a new
knowledge in the knowledge base. The figure 1.7 represents the processes of
knowledge discovery. 16
Chapter 1
Figure 1.7 Process of Knowledge Discovery
By performing data mining, interesting knowledge, regularities, or high-
level information can be extracted from databases and viewed or browsed from
different angles. The discovered knowledge can be applied to decision making,
process control, information management and query processing. Therefore, data
mining is considered one of the most important frontiers in database and information
systems and one of the most promising interdisciplinary developments in the
information technology.
Knowledge discovery
Pattern Evaluation
Pattern
Data Mining
Task relevant Data
Selection
Data Warehouse
Data Cleaning and Integration
Databases
17
Chapter 1
1.8 Business Intelligence and Data Warehousing (BIDWH)
Success depends on how quickly and in what manner a company responds
to rapidly changing market conditions. Business Intelligence (BI) solutions empower
organizations with the insight necessary to make better decisions faster BI solutions
channel data and processes into a single source of the truth, providing every
employee with an accurate view of an organization and actionable items based on
clearly defined key performance indicators.
The rapid pace of today’s business environment has made Business
Intelligence (BI) systems indispensable to an organization’s success. BI systems turn
a company's raw data into useable information that can help management identify
important trends, analyze customer behavior and make intelligent business decisions
quickly. Over the past few years, business intelligence systems have been used to
understand and address back office needs such as efficiency and productivity. Now,
organizations are increasingly using BI to analyze customer behavior, understand
market trends, and search for new opportunities.
Figure 1.8 Business Intelligence Process
18
Chapter 1
BI relies on Data warehousing, making cost-effective storing and managing
of warehouse data critical to any BIDW solution. Without an effective data
warehouse, organizations cannot extract the data required for information analysis in
time to facilitate expedient decision-making. The ability to obtain information in
real-time has become increasingly critical in recent years because decision-making
cycle times have been drastically reduced. Competitive pressures require businesses
to make intelligent decisions based on their incoming business data—and do it
quickly. Simply put, the ability to turn raw data into useful information in a timely
manner can add hundreds of thousands—up to millions—of dollars to an
organization’s bottom line.
1.9 Issues of Data Warehousing Design
A data warehouse is a single data repository where data from multiple data
sources is integrated for online business analytical processing (OLAP) [65][115].
This implies a data warehouse needs to meet the requirements from all the business
processes within the entire organization. Thus, data warehouse design is a highly
complex, lengthy and thus error-prone process. Furthermore, business analytical
tasks change over time, which results in changes in the requirements for the systems.
Therefore, data warehouse and OLAP systems are rather dynamic and the design
process is continuous.
Data warehouse design takes approaches different from view
materialization in the industries. It sees data warehouses as database systems with
special needs such as answering management related queries. The focus of the design
becomes how the data from multiple data sources should be extracted, transformed
and loaded (ETL) to be organized in a database as the data warehouse. There are two
dominant approaches, the “top-down" approach [65] and the “bottom-up" approach
[115].
1.9.1 Top Down Design Approach
In the “Top-Down" design approach, a data warehouse is defined as a
subject-oriented, time-variant, non-volatile and integrated data repository for the
19
Chapter 1
entire enterprise [115]. Data from multiple sources are validated, reformatted and
stored in a normalized (up to 3NF) database as the data warehouse. The data
warehouse stores “atomic" data, the data at the lowest level of granularity, from
where dimensional data marts can be built by selecting the data needed for specific
business subjects or specific departments. The approach is a data driven approach as
the data is gathered and integrated first and then business requirements by subjects
for building data marts are formulated. The advantage of this approach is that it
provides a single integrated data source, thus data marts built from it will have
consistency when they overlap. The diagram of Top Down approach is as shown in
Figure 1.9.
Figure 1.9 Top Down Design Approach
It is possible to apply centralized rules and control and may see quick
results if implemented with iterations. The disadvantage of this method is that the
initial effort, cost and time for implementing a data warehouse is significant. It also 20
Chapter 1
needs high level of cross-functional skills. The development time from building the
data warehouse to having the first data mart available to the users is substantial,
leading to a late Return On Investment (ROI).
1.9.2 Bottom-Up Design Approach
In the “Bottom-Up" approach, a data warehouse is defined as “a copy of
transaction data specifically structured for query and analysis" [115], namely the star
schema. In this approach, data marts are created first to provide reporting and
analytical capabilities for specific business processes (or subjects). Thus it is
considered to be a business driven approach in contrast to Inmon's data driven
approach. Data marts contain the lowest grain data and, if needed, aggregated data
too. Instead of a normalized database for the data warehouse, a denormalised
dimensional database is adopted to meet the information delivery requirements of
data warehouses. Using this approach, in order to use the set of data marts as the
enterprise data warehouse, data marts should be built with conformed dimensions in
mind, meaning that common objects are represented the same in different data marts.
The conformed dimensions link the data marts to form a data warehouse, which is
usually called a virtual data warehouse. The advantage of the “bottom-up" design
approach is that it has quick ROI, as creating a data mart, a data warehouse for a
single subject, takes far less time and effort than creating an enterprise-wide data
warehouse. Also the risk of failure is also less. This approach is inherently
incremental. This approach allows project team to learn and grow.
However, the independent development by subjects can result in
inconsistencies, if common objects are not integrated and are updated at different
times in the individual data marts. Although guidelines are provided in the method,
such as a planning stage for data warehouse but, it does not provide effective
techniques, and thus it is not guaranteed that the method resolves the integrity
problem in practice[121][123]. This approach proliferates unmanageable interfaces.
The figure of Bottom-Up design approach is as shown in Figure 1.10.
21
Chapter 1
Figure 1.10 Bottom Up Design Approach
Table 1.2 Top-Down Design Approach V/S Bottom-Up Design Approach
Top-Down Design Approach Bottom-Up Design Approach
A truly corporate effort, an enterprise
view of data.
A departmental view of data. Its own
narrow view.
Inherently architected- not a union of
disparate data marts.
Inherently incremental; can schedule
important data marts first.
Single, central storage of data about the
content.
Departmental data stored.
Centralized rules and control. Departmental rules and control.
May see quick results if implemented
with iterations.
Less risk of failure, favorable return on
investment and proof of concepts.
22
Chapter 1
1.10 Basic Architecture of Data Warehousing System
The architecture of a data warehouse includes data sources, wrapper,
monitor, integrator and the data warehouse data repository (Figure 1.11). The bottom
level of the architecture depicts the data sources, which store day-to-day transactions
data. The monitor component of the architecture is responsible for automatically
detecting changes of interest in the source data and reporting them to the integrator
component. The wrapper component is responsible for translating information from
the native format of the source to compatible format. The new information sources
and change in existing data sources are propagated to the integrator. The integrator
works as liaising between wrapper/monitor and Data Warehouse data repository. The
integrator brings source data into the warehouse data repository, which may include
filtering the information, summarizing it and merging. In order to properly integrate
new change information into data repository, need a process to store change data into
data repository without affecting the Quality of Service (QoS) from the same or
different data sources.
Figure 1.11 Basic Architecture of Data Warehousing System
The information stored at the warehouse is in the form of derived views of
data from the sources. These views stored at the warehouse are often referred to as
Data Warehouse
Data Source 1
Wrapper/Monitor
Data Source 2 Data Source n
Wrapper/Monitor Wrapper/Monitor
Integrator
23
Chapter 1
materialized views. The data warehouse itself can use an off-the-shelf or special
purpose database management system. Although in Figure1.11 there is a single,
centralized warehouse, but the warehouse certainly may be implemented as a
distributed database system and in fact data parallelism or distribution may be
necessary to provide the desired performance.
The architecture and basic functionality we have described is more
general than that provided by most commercial data warehousing systems. In
particular current systems usually assume that the sources and the warehouse
subscribe to a single data model (normally relational), that propagation of
information from the sources to the warehouse is performed as a batch process
(perhaps off-line) and that queries from the integrator to the information sources are
never needed.
1.11 Components of Data Warehouse System and Their Problems
A leading issue in database research is to provide integrated access to
multiple and distributed heterogeneous databases. There are different issues in the
components of Data Warehouse; need to overcome and solve these issues. The
components and their problems are as given below:
1.11.1 Wrapper/ Monitors
The data in the data warehouse is extracted from data sources through
wrapper/monitor components. The wrapper/monitor components have two
interrelated responsibilities.
a) Translation: Making the underlying information source appear as if it
subscribes to the data model used by the warehousing system. For example,
if the information source consists of a set of flat files but the warehouse
model is relational then the wrapper monitor must support an interface
that presents the data from the information source as if it were relational.
The translation problem is inherent in almost all approaches to data
integration- both lazy and eager- and is not specific to data warehousing.
Typically, a component that translates an information source into a
common integrating model is called a translator or wrapper. Most 24
Chapter 1
commercial data warehousing systems assume that both the information
sources and the warehouse are relational so translation is not an issue.
However, some vendors do provide wrappers for other common types of
information sources.
b) Change detection: Monitoring the information source for changes to the
data that are relevant to the warehouse and propagating those changes to
the integrator. Note that this functionality relies on translation since like the
data itself changes to the data must be translated from the format and model
of the information source into the format and model used by the
warehousing system.
One approach is to ignore the change detection issue altogether and simply
propagate entire copies of relevant data from the information source to the
warehouse periodically. The integrator can combine this data with existing
warehouse data from other sources or it can request complete information from all
sources and recompute the warehouse data from scratch. Ignoring change detection
may be acceptable in certain scenarios, for example when it is not important
for the warehouse data to be current and it is acceptable for the ware house to
be off- line occasionally. However, if currency, efficiency and continuous access are
required then we believe that detecting and propagating changes and incrementally
folding the changes into the warehouse will be the preferred solution. In considering
the change detection problem, we have identified several relevant types of
information sources [78]:
a) Cooperative sources: Sources that provide triggers or other active database
capabilities so that notifications of changes of interest can be programmed to
occur automatically.
b) Logged sources: Sources maintaining a log that can be queried or inspected,
so changes of interest can be extracted from the log.
c) Queryble sources: Sources that allow the wrapper/monitor to query the
information at the source so that periodic polling can be used to detect
changes of interest.
25
Chapter 1
d) Snapshot sources: Sources that do not provide triggers, logs, or queries.
Instead periodic dumps, or snapshots, of the data are provided off-line, and
changes are detected by comparing successive snapshots.
Each type of information source capability provides interesting research
problems for change detection.
• In cooperative sources, although triggers and active databases have been
explored in depth, putting such capabilities to use in the warehousing context
still requires addressing the translation aspect; similarly for logged sources.
• In queryable sources, in addition to translation one must consider
performance and semantic issues associated with polling frequency. If the
frequency is too high performance will degrade, while if the frequency is too
low, changes of interest may not be detected in a timely way.
• In snapshot sources the challenge is to compare very large database dumps,
detecting the changes of interest in an efficient and scalable way.
An important related problem in all of these scenarios is to develop
appropriate representations for the changes to the data, especially if a non-
relational model is used.
It may be noted that a different wrapper/monitor component is needed for
each information source, since the functionality of the wrapper/ monitor is dependent
on the type of the source (database system, legacy system, etc.) as well as on the data
provided by that source [98]. Clearly, it is undesirable to hard-code a
wrapper/monitor for each information source participating in a warehousing system,
especially if new information sources become available frequently. Hence, a
significant research issue is to develop techniques and tools that automate or
semi-automate the process of implementing wrapper/monitors through a toolkit
or specification-based approach.
1.11.2 Integrator
The ongoing job of the integrator component is to receive change
notifications from the wrapper/monitor component and reflect these changes in the
data warehouse data repository.
26
Chapter 1
At a sufficiently abstract level the data in the warehouse can be seen as a
materialized view or set of views where the base data resides at the information
sources. Viewing the problem in this way, the job of the integrator is essentially to
perform materialized view maintenance. There are number of reasons that
conventional view maintenance techniques cannot be used, and each of these reasons
highlights a research problem associated with data warehousing.
• Data warehouses may contain a significant amount of historical information,
while the underlying sources may not maintain this information. Relevant
area of research here certainly includes efficient monitoring of historical
information.
• Data warehouses contain highly aggregated and summarized information.
Therefore, efficient view maintenance in the presence of aggregation and
summary information appears to be an open problem.
• The data sources update independently without caring data warehouse view
updation. In this scenario, certain anomalies arise when attempting to keep
views consistent with base data and algorithms must be used that are
considerably more complicated than conventional view maintenance
algorithms.
• In a data warehousing environment it may be necessary to transform the base
data (sometimes referred to as data scrubbing) before it is integrated into the
warehouse. Transformations might include, for example, aggregating or
summarizing the data, sampling the data to reduce the size of the warehouse,
discarding or correcting data suspected of being erroneous, inserting default
values or eliminating duplicates and inconsistencies
• Although integrators can be based purely on the data model used by the
warehousing system, a different integrator still will be needed for each
data warehouse since a different set of views over different base data will be
stored. As with wrapper/monitors, it is desirable not to require that each
integrator be hard coded from scratch, but rather to provide techniques
and tools for generating integrators from high-level, nonprocedural
specifications.
27
Chapter 1
1.12 Motivation
Business data are growing very fast and the need is to store and analyze
these data for further use. The amount of data doubles every 18 months and it affects
upon the response time and the sheer ability to comprehend its contents. System
users are business domain experts and they are not computer professional. Business
needs decisions to be made quickly using all available data and past data. A data
warehouse stores views derived from data that may not reside at the warehouse.
These views are called as materialized views. Using these materialized views, user
queries can be answered quickly because querying the external sources where the
base data reside is avoided. However, when the data sources change, the views in
the warehouse can become inconsistent with the base data and must be
maintained.
A variety of approaches have been proposed to maintain these views and it
is classified into two broad categories. Most database systems achieve this by eager
maintenance or immediate maintenance, where all affected views are maintained as
part of the update transaction. This method is suitable for views whose base tables
which are seldom updated and the updates are likely to be followed immediately by
queries. The second method of view maintenance is deferred view maintenance or
lazy approach where view maintenance is delayed and takes place only when
explicitly triggered by a user. Under this approach, update transaction do not
maintain views but just store away enough information so that affected views can be
maintained later. Actual maintenance is done when the low priority jobs running or
when the system has free cycles. Lazy or deferred maintenance allows updates to
complete faster so locks are released sooner, which reduces the frequency of lock
contention, lock conflicts and transaction aborts. If a view is not up to date when
needed by a query, it is transparently brought up to date before the query is allowed
to access it. In this case, the first beneficiary of the view pays for all or part of the
views maintenance by experiencing a delay. This approach has a serious drawback
that a query may see an out-of–date view and produce an incorrect result.
Both the view maintenance techniques require access to the base
relations in order to maintain materialized view. In data warehousing scenario,
accessing base relations can be difficult since these relations are distributed 28
Chapter 1
across different sources. Often the data sources may be unavailable or, even if
available; the cost of accessing the sources may be prohibitive due to
communication costs. Accessing base relation/s, compute the changes and then
reflect these changes in the warehouse requires quite large time. For these
reasons, the self-maintainability of the view is an important issue in data
warehousing. We call a view self-maintainable if it can be maintained at the
warehouse without accessing the source data.
Therefore, research is required to optimize the overall view
maintenance process, to reduce the cost of deriving changes and performing
updates to the warehouse site.
1.13 Objectives of the Research
The following objectives have been set for the present study.
1. To study the Data Warehousing System to understand data model,
architecture and components.
2. To understand materialized view maintenance problem and to analyze
different methods used in view maintenance process.
3. To propose the techniques for materialized view maintenance to
overcome problems in integrator component of a data warehouse.
4. Testing of proposed methods and comparing with existing methods.
Study have been carried out on Data Warehouse Design issues, various
challenges in data warehousing system architecture, efficient change detection and
materialize view maintenance, data warehouse view optimizations. Also different
maintenance algorithms have been studied.
1.14 Research Methodology
Literature survey was conducted to study the existing methodology of
materialized view maintenance methods. Journals of IEEE, ACM transactions, IEEE
conference proceedings, IEEE digital library, online papers and text books were
referred to obtain the latest information about the topic. Data warehouse consultants,
developers, BI developers were interacted to understand the requirements and
finding the exact problems in the relevant area. 29
Chapter 1
Simulation is one of the widely used operations research and management
science techniques and it ranks very high among the most widely used methods and
techniques. Therefore this approach is used to find out the effect of our proposed
technique as well as to compare the results across the existing techniques.
1.15 Organization of the Thesis
The thesis is divided into six chapters consisting of Introduction to Data
Warehousing System, Literature Review, Materialized View Maintenance Methods
and Performance Evaluation, Analytical Models for Materialized View Maintenance
Methods, Proposed Simplification and Optimization of View Maintenance Process
and the last chapter is Conclusions, Limitations and Future Work.
Chapter 1 Introduction to Data Warehousing System
The content in first chapter covers the preliminary concept, general
overview of Data Warehousing (DWH) system, its need, traces the evolution,
problems, goals and applications of Data Warehousing. It also describes the future of
Data Warehousing, importance, general architecture, and various design issues in
DWH. Business Intelligence (BI) and its need are explained. The chapter also
describes motivation of the research work, statement of the problem, objectives of
the research, research methodology and the organization of the thesis.
Chapter 2 Literature Review
The second chapter explains a review of the Literature survey undertaken
for the research study. The theory presented here has been collected from books,
articles, research papers and internet. The process of materialize view updation,
maintenance and selection are also described. Data modeling techniques,
multidimensional databases and detailed DWH architecture is presented in this
chapter. The various materialize view maintenance techniques found in the literature
are classified into different categories. The various techniques used to maintain the
materialized view in DWH are compared in this chapter. The techniques found in the
literature are classified into appropriate the category.
30
Chapter 1
Chapter 3 Materialized View Maintenance Methods and Performance Evaluation
This third chapter deals with the problem of materialized view maintenance
and maintenance overhead in data warehousing. The standard approaches of
materialize view maintenance process are described. The eager maintenance or
Incremental View Maintenance (IVM) and lazy or Deferred View Maintenance
(DVM) are explained. We have proposed a materialize view maintenance framework
using maintenance manager which keeps track of active view maintenance task.
Scheduling of maintenance task is also explained in this chapter. We have also
checked the effect of maintenance task on response time of the query. To verify the
feasibility and effectiveness of these standard view maintenance strategies an
experimental study is carried out, results are taken and compared. The results are
calculated using single view, two views and multiple views.
Chapter 4 Analytical Models for Materialized View Maintenance Methods
In the fourth chapter we have categorized the maintenance methods and
proposed an analytical model. The results are calculated and compared to show the
best performance from these models and methods. The concept of unrestricted base
access and run time view maintenance are also explained in detail with the
advantages and comparison of various analytical models. We have considered the
parameter total amount of space required to store the change at the data warehouse
site.
Chapter 5 Proposed Simplifications and Optimization of View Maintenance
Process
The fifth chapter describes the proposed view maintenance method, we
have considered the secondary relations to store the intermediate results of the view
at the data warehouse site. Whenever the data sources change, the changes are
computed incrementally and stored in these secondary relations. Then these changes
are subsequently sent to the higher level secondary relations. At the time of view
maintenance, the entire contents of the final view are integrated into the ware house
views. The characteristic of this method is, the entire view maintenance process is
hidden from the data warehouse user and it does not affect upon the warehouse 31
Chapter 1
performance. The experimental model has been developed and the results are
compared with the existing techniques.
Chapter 6 Conclusions, Limitations and Future Work
The sixth chapter describes the conclusions, limitations and future work.
Here conclusions are drawn and the limitations of the research work are given for the
objectives defined for the research. Also directions for future research are described
and finally the authors list of publications relevant to the thesis is given.
1.16 Conclusions
In today’s fast-paced and ever-changing economy, information is seen as a
key business resource to gain the market advantage. To compete in today’s turbulent
market, organizations need to do considerable market research to offing out what
exactly people want rather than what they need. The last three decades have seen an
exponential growth in the area of information technology, catering to the information
processing, needs of business in the form of capturing, storing, analyzing and
transferring data that will help knowledge workers and decision makers make sound
business decisions. This is exactly where Data Warehousing comes into the picture.
Data Warehousing is the foundation of Decision Support System (DSS), of
which the goal is to enable decision makers to make better business decisions based
on analysis of historic data related to the business operation. Data warehousing has
become a major business trend, both for product and service sectors and for
application to daily business in all industries. Without data warehouse it is difficult
to answer the analytical queries because the data sources are distributed and
heterogeneous also. Concerning the data warehousing area in general, the most
focused problems are data integration, extraction and transformation, data warehouse
design and maintenance.
Before complex analytical queries are to be executed, the ETL process
needs to be performed on the data warehouse, so that the users of the data warehouse
get the latest and integrated data. Business analyst executes their queries over this
centralized data repository to gain the insights into the data.
Traditionally, data warehouses have been used to provide storage and
analysis of large amounts of historical data. In a typical data warehouse, updates 32
Chapter 1
occur in batches at regular time intervals (e.g.,every night). At all other times, the
data warehouse is regarded as a “read-only” database, where uses can pose long-
running decision support queries.
Chapter Summary
This chapter gives an insight in to the area of Data Warehousing, the
problems and the work focused in the research study. An introduction to the various
research issues in the field of data warehousing are discussed. We have discussed the
evolution of data warehousing, its architecture and its challenges also. An overview
of the applications of data warehousing is also presented. As a data warehouse
becomes a mature part of an organization, it is likely that it will become as
anonymous as any other part of the information system. In this regard the future of
the data warehouse is given. As a business grew more complex, to sustain the
business requires valuable and concise information. The diagram given shows the
process of knowledge discovery in the topic importance of the data warehousing.
The general architecture of the data warehousing divides the architecture into three
parts namely data sources, ETL component and the data warehouse. There are two
broad approaches in data warehouse design: top down approach and the bottom up
design approach. Each approaches having its pros and cons. In the later part of this
chapter, the motivation, objectives of our research study and the organization of the
thesis have been presented.
33