Data Warehousing 1

DATAWAREHOUSING&DATAMINING

By

Name: N.SravanSuryaKumar T.PavanKumar

Branch: III IT,(051239) III IT,(051229)

Email:[email protected] pavankumar_38 @yahoo.com

ST’Anns College of Engineering & Technology

ABSTRACT

http://images.google.co.in/imgres?imgurl=http://techpubs.sgi.com/library/dynaweb_docs/0650/SGI_EndUser/books/MineSetNT_T/sgi_html/figures/dataprocess.gif&imgrefurl=http://techpubs.sgi.com/library/tpl/cgi-bin/getdoc.cgi%3Fcoll%3D0650%26db%3Dbks%26fname%3D/SGI_EndUser/MineSetNT_T/ch01.html&h=415&w=400&sz=16&hl=en&start=2&um=1&tbnid=2bveRcQLudJ7JM:&tbnh=125&tbnw=120&prev=/images%3Fq%3Ddata%2Bmining%26svnum%3D10%26um%3D1%26hl%3Den%26sa%3DX

http://images.google.co.in/imgres?imgurl=http://www.findbase.org/images/datawarehousing.jpg&imgrefurl=http://www.findbase.org/display-file.php%3Ffilename%3Dguide.db%26submenu%3Dguide&h=344&w=528&sz=25&hl=en&start=16&um=1&tbnid=3JzwuKsdsHo2QM:&tbnh=86&tbnw=132&prev=/images%3Fq%3Ddata%2Bwarehousing%26svnum%3D10%26um%3D1%26hl%3Den%26sa%3DX%00%E5%A1%B9%EF%92%81%E1%B4%BB%E4%A1%BF%E2%B2%AF%E5%B6%82%E8%97%84%E6%8C%A7%00%00%EA%AE%A5

One may claim that the exponential growth in the amount of data provides great

opportunities for data mining. In many real world applications, the number of sources over

which this information is fragmented grows at an even faster rate, resulting in barriers to

widespread application of data mining. A data warehouse is designed especially for decision

support queries.

Data warehousing is the process of extracting and transforming operational data into

informational data and loading it into a central data store or warehouse.

The idea behind data mining , then is the “ non trivial process of identifying valid,

novel , potentially useful, and ultimately understandable patterns in India”

Data mining is concerned with the analysis of data and the use of software technique

for finding patterns and regularities in sets of data. Data mining potential can be enhanced if

the appropriate data has been collected and stored in data warehouse

Data warehousing provides the means to change raw data into information for

making effective business decision – the emphasis on information , not data. The data

warehouse is the hub for decision support data.

This paper also explains partition algorithm to discover all requirements sets from the

data warehousing using the data mining. Also explained relation between operational data ,

data warehouse and data marts.

DATA WAREHOUSE & DATAMINING

Every day organizations, both large and small, genetic billions of bytes of data

related all aspects of their business. But locked up variety of systems, most of this data is

extremely difficult to access. Only a very small part of data – captured, processed and stored is

available to decision markers.

INTRODUCTION

What is data warehouse?

A data warehouse in its simplest perception , is in more than a collection of

the key pieces of information used to manage the and direct business for the most

popular outcome.

A large amount the right information is the key to survival in today’s

competitive environment. And this kind of information can be available only if there’s a totally

integrated enterprise data warehouse.

A data warehouse is repository of integrated information, available for queries and

analysis. For such a repository, data and information extracted from heterogeneous resources

and consolidated in a single source. This makes it much easier and efficient to query the data.

There are two fundamentally different types of information systems in

enterprises: operational systems and informational systems

Operational systems run daily enterprises information like ERP(enterprises resource

planning). Information systems analyze the data make decision on how enterprise will be

operate, not only information systems have different focus from operational ones, they often

have a different scope altogether.

There are some specific rules that govern the basic warehouse , namely that such a

structure should be:

Time dependent: that is containing information collected over time, which implies there must

always be connection between the information in the warehouse and time when it was entered.

This is one of the most important aspect of warehouse as its relates to data mining, because

information can then be stored according to period.

Non-volatile: that is data in a data warehouse never updated but used only for queries. Thus

such data only located from other database such as the operational database. End- users we want

to update data must use operational databases, as only latter can be updated, changed and

deleted. This means that a data warehouse will always be filled with historical data.

Subject oriented: that is, built all existing applications of the operational data. Not all the

information in operational database is useful for data warehouse, since the data warehouse is

designed specially for decision support while the operational database information containing

day-to-day.

Integrated: that is, it reflects the business information of organization. In an operational data

environment you will find many types of information being used in variety of applications and

some applications will be using different name for same entities. However in a data warehouse

is essential to integrate this information and make it consistent; only one name must exist to

describe each entity.

A data warehouse is designed especially for decision support queries, therefore only

data that is needed for decision support extracted from operational data and stored and stored in

warehouse.

Need for DATA WAREHOUSE

1. To summarize the large volumes of data.

2. To integrate data’s from different sources.

3. Make decision makers to access past data.

4. Enable people to make informed decision.

Users

From the definition we can infer that the data warehouse users are as follows

1. This person’s job involves drawing conclusions from, and making decision

Based on large masses of data.

2. This person doesn’t want to get involved with finding and organizing the

Data for this purpose.

3. This person also doesn’t want to access a database highly technical fashion.

STRUCTURE OF DATA WAREHOUSE

Data warehousing is one of the hottest industry trends for good reason. The

structure of a data warehouse consist as follows.

Physical data warehouse

Logical data warehouse

Data marts

Physical data marts in which all the data for the data warehouse are stored, along

with meta data and processing for scrubbing , organizing , packing and processing detail the

data.

Logical marts also contain as physical database but does not contain actual data.

Instead it contains the information necessary to access the data wherever they reside.

Data mart is subset of an enterprise wide data warehouse, which potentially

supports an enterprise element.

DATA MARTS

Data marts are partitions of the overall data warehouse. It contains

overlapping data. The task of implementing a data warehouse can be a very big effort,

taking a significant amount of time. One feasible option is to start with a set of data

marts for each component of departments. One can have a stand alone data mart or

dependent data mart. A set of smaller, manageable, database is called data marts.

Stand alone data mart - a data mart with minimal or no impact on the enterprise

operational databases.

Dependent data mart – similar to stand alone data mart, Except that

management of data source by enterprise data base is required. These data sources include

operational databases and external source of data.

DATA WAREHOUSE-ARCITECTURE

The architecture of an information system refers to the way its pieces are laid out ,

what types of tasks allocated to each piece of hoe pieces interaction with each other and how

they interact with outside world. The architecture of data warehouse is shown in fig.

DATA INFORMATION DECISION

DATA

DIPPERS

OLAP

TOOLS

FIG. DATA WAREHOUSE ARCHITECTURE

The architecture consist of following components

1. Load Manager

2. Warehouse manager

3. Query manager

Each component has some specific process.

Load Manager

It is constructed using a combination of off-the- shelf tools, spoke coding,

OPERATIONAL DATA

EXTERNALDATA

LOAD

MANAGER

DETAILEDINFORMATIONINFORMATION

SUMMARY INFO

META DATA

QUERY

MANAGER

WAREHOUSE MANAGER

C programs and shell scripts.

Extract the data from the source systems.

Fast load the extracted data into temporary data source.

Perform simple transformation into a structure similar to the one in the data

warehouse.

Warehouse Manager

It is constructed using a combination of third party systems management software,

bespoke code, C programa and shell scripts.

Support warehouse management process , such as transforming data, backup and

archives into data warehouse.

Query Manager

It is constructed using a combination of user access tools, specialist data warehousing

monitoring tools, native database facilities, bespoke coding , C programs and shell

scripts.

Direct queries to appropriate table.

Schedule the execution of user queries.

DATA WAREHOUSE AND BACKEND PROCESS

DATA EXTRACTION

- which gather data from multiple heterogeneous, external source.

DATA CLEANING

- which detects errors in the data and rectifies them when possible.

DATA TRANSFORMATION

- which converts data from legacy or host format to data warehouse format.

LOADING

- which sorts, summarizes , consolidates, computes, views, checks integrity and

builds idicces and partition.

REFRESH

Refresh- which propagate the updates from the data sources to the

Warehouse.

DESIGNING A DATA WAREHOUSE

Designing a data warehouse requires specialist knowledge of data design because

the data model consisting of data needed by user who want access at high speed, and so the data

design for warehouse can be differently from that of operational databases.

In a data warehouse , an end-user may want to make joins from many tables

and this can be place tremendous demands on the system. For that reason , the data warehouse

requires a high speed machine and a wide variety of optimization process.

META DATA

In setting up a data warehouse, the end user and the administrator must have access

to all the information in the tables and attributes. They will want to know a number of things ,

such as where the data is located, what data exists, what data type or format it is in , hoe this

data related to other data in other databases, where the data is from and to , whom data belongs

to. For these reason, another database containing the so – called Meta data is needed , which

describes of structure of contents of the databases.

Meta data can exist in any of three forms: 1. Human meta data 2. computer based meta data for people to use 3. computer based meta data computer to use Human meta data:

People always have some sort of meta data in their heads or in their files

.Computer based meta data for people to use:

Data warehouse developers often store the descriptive Data in its own. This provides a

comprehensive guide to the data resource

Computer based meta data for Computer to use: If the meta data items people to use is stored in a well, structured computer

readable form , they can be read by a DBMS. This smooth between users and warehouse.

BACK FLUSHING

DATA WAREHOUSE

Cleaning Reforming OLAP DSS

EIS DATA BASES

DATAMINING

Other Data Inputs/ New Data

Acquisition data for the warehouse involves the following steps

The data must be extracted from multiple and heterogeneous sources.

Data must be formatted for consistency within the warehouse. Names ,meaning and

domain of data from unrelated sources must be reconciled.

The data must be cleaned to ensure validity . For input data , Cleaning must occur before

the data are located into the warehouse.

Recognizing erroneous and incomplete data is difficult to automatic and cleaning that

requires automatic error correction can be even together. They will be likely want to

upgrade their data with the cleaned data. The process of returning cleaning data to the

source is called backflushing.

The data must be fitted into data model of the warehouse. Data from the various sources

must be installed in the data model warehouse. Data may have converted from relational,

object oriented or legacy databases to multidimensional madel.

DATAMETA DATA

The must be located in the data warehouse. The sheer volume of data in the warehouse

makes loading the data a significant task.

The basis techniques are used to build data warehouse, known the ‘top down’

approaches. In the ‘top down’ approach, we first build a data warehouse from that we select

needed information to design a data mart. In ‘bottom up’ approach first data marts are designed

from that we can design a data warehouse.

The relationship between operational data, a data warehouse and data marts

OPERATIONAL DATA DATA WAREHOUSE DATA MARTS

EXTRACT FROM

SEVARAL DATA BASES

Functionality: The data warehouse access component support enhanced spreadsheet

functionality, effect queries processing, structured queries and hoc queries, data mining and

materialized views. In particular enhanced spreadsheet functionality includes support for state of

the art spreadsheet application as well as for OLAP application program.

These offers three programmed functionality’s such as the following:

ROLL-UP: Data is summarized with increasing generalization

DRILL-DOWN : increasing levels of details are revealed

PIVOT: cross tabulation

SELECTION: data is available by value or range

DERIVED ATTRIBUTES: attributes are computed by operations on sorted and

derived values.

PARTITION ALGORITHM TO DISCOVER ALL REQUIREMENT SETS FROM THE

DATA WAREHOUSING USING THE DATA MINING

INTRODUCTION DATA MINING

Data mining or knowledge discovery in data bases is the nontrivial extraction of

implicit, previously unknown and potentially useful information from the data. This

encompasses a number of technical approaches, such as clustering , data summarization, finding

dependency networks, classification analyzing changes , and detecting anomalies. Data mining

search for the relationship and global patterns that exists in large databases byt are hidden

among of data ,such as the relationship between patient data and medical diagnosis. The

relationship represents valuable knowledge about the databases, and objects in the database, it

the database is a faithful mirror of the real word registered by the database. If refers to using a

variety of techniques to identify nuggets of information or decision making knowledge in the

database and extracting these in such a way that they can be put to use in areas such as

decision support , prediction ,forecasting and estimation . In particular , finding associations

between items in a database of customer transaction. Market basket analysis technique used to

group items together. A rule may contain more than one ,item in the antecedent and the

consequent of the rule. In this paper . we concentrate on finding association, but with different

slant (i.e) by using partition algorithm. In the next section , we review the basis concepts of

association rule.

BASICS

Let A = { l1,l2,l3,….lm} be a set of items Let T, the transaction in the database,

be a set of transaction, where each transaction t is a set of items. Thus, t is subset of A . A

transaction t is said to support an iteml1, if l1 is present in t. t is said to supports each item l in

X. An item set X . A has a support in T , denoted by S(X)T, is s% of transaction in T support

X. Support can also be defined as a fraction supports, which means the proportion of

transaction supporting X in T. For a given transaction database T, an association rule is an

expression of the form X Y, where X and Y are subsets of A and X Y holds with confidence,

if % of information in D that support X also support Y. The rule X Y has support in the

transaction set T if % of transaction in T support X U Y. ANN NN.

Each rule has a left hand side and right hand side . The left hand side is also the

antecedent and right hand side is also called the consequent. In general,

both the left hand side and right hand side containing multiple items. Confidence (or

predictability) measures how much a particular item is dependent on another. Support does not

depend on the direction(or implication) of the rule, it is only dependent on the set of items in

the rule.

The discovery of association rules in the most well studied problem in data

mining. There are many interesting algorithm proposed recently and we shall discuss about the

partition algorithm for making association. The features of any efficient algorithm are(a)

reduce the I/o operations, and (b) at the same time be efficient in computing.

PARTITION ALGORITHM

Partition algorithm is based on the observation on the frequent sets are normally

very few in number compared to the set of all item sets. The partition algorithm uses two

scans of databases to discover all frequent sets by scanning the database once. This set is super

set of all frequent item sets i.e it may contain false positives. The algorithm executes in two

phases. In the first phase, the partition algorithm logically divides the database into a number of

non-overlapping partitions. The partitions are considered one at a time and all frequent item sets

for that partition are generated. Partition algorithm as follows.

P = Partition-database(T); n = Number of partitions

For I = 1 to n begin //Phase 1

read-in-partition(Ti in P)

L1=generate a1 frequent items set of T using a priori method in main memory

End

For (k=2 ; LIK = 1,2,…….,n,k++) do begin // Merge Phase

CGK = U I =l n LIK end

For I =1 to n do begin

read_in_partition(T1 in P) //Phase 2

for all candidates C CG compuate S(C ) Ti end

LG = { C CG/ S ( C ) T1 >= }

Answer = LG

EXAMPLE:

Let us take the database T, and let us partition, for the sake of illustration, T into

three partitions T1,T2,T3, each containing 5 transactions. The first partition T1 contains

transactions 1 to 5, T2 contains transactions 6 to 10, similarly, T3 contain transactions 11 to 15.

We fix the local support as equal to given support, that is 20%. Thus ,Any item set that appears

in just one of the transaction in any partition is local frequent set in the partition.

A1 A2 A3 A4 A5 A6 A7 A8 A9

1 0 0 0 1 1 0 0 1

0 1 0 1 0 0 0 1 0

0 0 0 1 1 0 1 0 0

0 1 1 0 0 0 0 0 0

0 0 0 0 1 1 1 0 0

0 1 1 1 0 0 0 0 0

0 1 0 0 0 1 1 0 1

0 0 0 0 1 0 0 0 0

0 0 0 0 0 0 0 1 0

0 0 1 0 1 0 1 0 0

0 0 0 0 1 1 0 1 0

0 1 0 1 0 1 1 0 0

1 0 1 0 1 0 1 0 0

0 1 1 0 0 0 0 0 1

L= { {1}, {2},{3},{4},{5},{6},{7},{8},{9}, {1,5},{1,6,},{1,8}, {2,3},{2,4},

{2,8},{4,5},{4,7},{4,8},{5,6},{5,7},{5,8},{6,7},{6,8}, {1.6,8},{1,5,6},

{1,5,8},{2,4,8},{4,5,7},{5,6,8},{5,6,7},{1,5,6,8}}

similarly

L={{2},{3},{4},{5},{6},{7},{8},{9},{2,3},{2,4},{2,7},{2,9},{3,4},{3,5},{3,7},

{5,7},{6,7},{6,9},{7,9},{2,3,4},{2,6,7},{1,5,8},{2,6,9},{2,7,9},{3,5,7},{2,6,7,9}}

L= { {1}, {2},{3},{4},{5},{6},{7},{8},{9}, {1,3},{1,5,},{1,7}, {2,3},{2,4},

{2,6},{2,7},{2,9},{3,5},{3,7},{3,9},{4,6},{4,7},{5,6},{5,7},{5,8},{6,7},{6,8},

{1,3,5},{1,5,7},{2,3,9},{2,4,6},{2,4,7},{3,5,7},{4,6,7},{5,6,8},{2,4,6,7}}

In phaseII , we have candidatesset as

C=LULUL

L={{1},{2},{3},{4},{5},{6},{7},{8},{9},{1,3},{1,5},{1,6,},{1,7},{1,8},{2,3},{2,4},{2,6},

{2,7},{2,8},{2,9},{3,4},{3,5},{3,7},{3,9},{4,5},{4,6},{4,7},{4,8},{5,6},{5,7},{5,8},{5,7}.

{6,7},{6,8},{6,9},{7,9},{1,3,5},{1,3,7},{15,6},{1,5,7,},{1,5,8,},{1.6,8},{1,5,8},{1,6,8},

{2,3,4},{2,3,9},{2,4,6},{2,4,7},{2,4,8},{2,6,7},{2,6,7},{2,6,9},{2,7,9},{3,5,7},{4,5,7},{4,6,7},

{5,6,8},{5,6,7},{1,5,6,8}{2,6,7,9}{1,3,5,7},{2,4,6,7}}

ADVANTAGES

- Data warehouse are free from the restrictions of the transactional environment

there is an increased efficiency in query processing.

- Artificial intelligence techniques, which may include genetic algorithm And neural

networks, are used classification and are employed to discover knowledge from

the data warehouse that may be unexpected or Difficult to specify queries.

APPLICATONS

Data warehousing can be a key differentiator in many industries . At present ,

some of the most popular Data warehouse application include:

Sales and marketing analysis across all industries.

Inventory turn and product tracking in manufacturing.

Category management ,vendor analysis , and marketing , program effectiveness

analysis in retail

Profitability analysis or risk assessment in banking.

Claims analysis or fraud detection in insurance.

Data mining has many and varied fields of applications such as:

a. Retail/Marketing

Identify buying patterns from customers

Find associations among customers demographic characteristics.

Predict response to mailing campaigns.

Market basket analysis.

b. Banking

Detect pattern of fraudulent credit card use

Identify ‘loyal’ customer.

Determinine credit card spending by customer groups

Find hidden correlation between different financial indicators.

c. Medicine

Characterize patient behavior to protect office visits

Identify successful medical therapies for different illness.

d. Transportation

Determine the distribution schedule among outlets

Analyze loading patterns

e. Insurance and Health Care

Claim analysis – i.e which medical procedure are claimed

Together.

Predict which customer will buy new polices.

Identify behavior pattern of risky customers

Identify fraudulent behavior

*HOW DATA WAREHOUSE& DATAMINING IS USEFUL IN GOVERNMENT

A large number of data warehouse can be identified from existing data sources

with in the central government ministers. Let us examine potential areas on which data

warehouse may be developed and also in future.

CECNSUS DATA, AGRICULTURE, RURAL DEVELOPMENT, HEALTH PLANNING,

EDUCATION, COMMERCE AND TRADE.

OTHER SECTORS:

Tourism, Programme implementation, Revenue, Economic affairs, Audit and

Accounts.

CRITICAL ISSUES

Data ware housing helps business makes informed decisions. But there are a few

critiacal issues that must be faced a head on while designing and implementation a data

warehouse. These issues are as follows.

Capacity planning

Security backup and recovery

Service level agreement

Performance tuning

Testing

Implementation obstacle

CONCLUSION:

Data warehousing provides the means to change raw data into information for making

effective business decision – the emphasis on information, not data. The data warehouse is the

hub for decision support data. Comprehensive data warehouse that integrate operational data

with customer, supplier, and market information have resulted in an explosion of information.

Completion requires timely and sophisticated analysis on an integrated view of the data

. Data mining tool can enhance inference process. Speed up design cycle, but con not be

substitute for statistical and domain expertise. Data mining allows for the creation of a self

learning organization.

So the future of data warehouse lies in their accessibility from the internet. Successful

implementation of a data warehouse and data mining requires a high performance; scalable

combination of hardware and software which can integrate easily within existing system, so

customer can use data warehouse to improve their decision –making—and their competitive

advantage

A good data warehouse provides the RIGHT data…to the RIGHT PEOPLE… at the

RIGHT time… RIGHT now! While data warehousing organizes data for business analysis,

internet has emerged as the standard for information sharing.

REFERENCES:

Data mining technologies – Arun K Pujari

Data warehousing, Data mining and OLAP

Berson & Smith, Mc-Graw Hill.

Data mining techniques, tools and trends – Bhavani Thuraisingam

Data Base Systems – Elmasri, Tata Mc-Graw Hill

Documents

Data Warehousing 1