58528578-SEM-6-BC0058-1-Data-WareHouse

BCA VI-Sem BC0058-(01) – Data Warehousing

1. What are Data Warehouse requirements? How do you gather the requirements?

Ans:

Requirements analysis is fundamental to the design and development of business

intelligence. The user requirements drive decisions about the data to incorporate in the data

warehouse, how to organize (dimension) the data and how often to refresh the data. User

requirements also determine the type of tool required and how the user will interact with the

data. Data granularity decisions should not be driven simply by the availability of data -- the

users should have a need for the data beyond their current abilities.

Business intelligence requirements analysis has three fundamental steps:

1. Identify the data and granularity the users interact with, or would like to interact with,

via interviews.

2. Develop a data model to provide the data at the appropriate granularity.

3. Determine how the data will be sourced to this data model.

There are other related tasks which also need to be addressed as a result of the

requirements analysis process such as: query/OLAP tool features to support the user's

interactions with the data; hardware infrastructure required to load, store and provide data to

the end user; source data analysis; archiving strategy; and sizing.

Data Warehouse Requirements Gathering

The initial requirements are identified through interviews, with a representative set of end

users. In preparation for the interviews, it's often useful for the end users to collect a sample

of the reports they work with in reporting and analyzing their data as well as screen shots of

any tools they useWhat are your business objectives?

• How would you interpret data set results?

• How should the data you work with be organized? Should it be organized by

customer, product, geography and time? Should it be organized by account, salesperson,

distribution channel and month?

Suresh Kumar Suthar Roll : 52077676 1


• What are the hierarchies, rollups or aggregations used with these dimensions? Do

customers roll up to geographies that roll up to total? Do products roll up to product

groups? Do salespeople roll up to districts that roll up to regions? What types of summary

reports do you work with?

• What are the measures or facts you work with (e.g., revenues, expenses, balances,

variances, percent growth, percent of total)? How are they defined?

• Do you "filter" the data? Do you need data only at the top and bottom accounts? Do

you review the performance of only certain types of products? Do you segment the data

based on demographics?

• How often do you obtain refreshes of the data? Do you obtain them daily, weekly,

monthly or quarterly? Do you need it this often?

• What tools do you use to interact with the data?

• Is the data clean?

• Do you receive the data in a timely fashion?

• Do the tools you use support your requirements?

• What types of things would you like to do that you can't do today?

• What is your tool functionality?

• What is your data availability?

• Do you spend most of your time on analysis or preparation for analysis?



2. What is Data Transformation? Explain the tasks involved in it.

Ans:

Data transformations are often the most complex and, in terms of processing time, the most

costly part of the extraction, transformation, and loading (ETL) process. They can range

from simple data conversions to extremely complex data scrubbing techniques. Many, if not

all, data transformations can occur within an Oracle database, although transformations are

often implemented outside of the database (for example, on flat files) as well.

In metadata, a data transformation converts data from a source data format into destination data.

Data transformation can be divided into two steps:

1. data mapping maps data elements from the source to the destination and captures any transformation that must occur

2. code generation that creates the actual transformation program

Data element to data element mapping is frequently complicated by complex transformations that require one-to-many and many-to-one transformation rules.

The code generation step takes the data element mapping specification and creates an executable program that can be run on a computer system. Code generation can also create transformation in easy-to-maintain computer languages such as Java or XSLT.

When the mapping is indirect via a mediating data model, the process is also called data mediation.

Basic Tasks in Data Transformation

The data transformation contains the following basic tasks:

Selection: This takes place at the beginning of the whole process of data transformation.

We select either whole records or parts of several records from the source system. The task

of selection usually forms part of the extraction function itself.

Splitting/Joining: This task includes the types of data manipulation we need to perform on

the selected parts of source records. Sometimes we will be splitting the selected parts even


http://en.wikipedia.org/wiki/Metadata

http://en.wikipedia.org/wiki/Data_model

http://en.wikipedia.org/wiki/XSLT

http://en.wikipedia.org/wiki/Java_(programming_language)

http://en.wikipedia.org/wiki/Executable_program

http://en.wikipedia.org/wiki/One-to-many

http://en.wikipedia.org/wiki/Transformation

http://en.wikipedia.org/wiki/Code_generation

http://en.wikipedia.org/wiki/Data_element

http://en.wikipedia.org/wiki/Data_mapping


further during data transformation. Joining of parts selected from many source systems is

more widespread in the Data Warehouse environment.

Conversion: This is an all-inclusive task. It includes a large variety of rudimentary

conversions of single fields for two primary reasons – one to standardize among the data

extraction from disparate source systems, and the other to make the fields usable and

understandable to the users.

Summarization: Sometimes we may find that it is not feasible to keep data at the lowest

level of details in your Data Warehouse. It may be that none of our users ever need data at

the lowest granularity for analysis or querying. For example, for a grocery chain, sales data

at the lowest level of details for every transaction at the checkout may not needed. Storing

sales byproduct by store by day in the Data Warehouse may be quite adequate. So, in this

case, the data transformation function includes summarization of daily sales by product and

by store.

Enrichment: This task is the rearrangement and simplification of individual fields to make

them more useful for the Data Warehouse environment. We may use one or more fields

from the same input record to create a better view of the data for the Data Warehouse. This

principle is extended when one or more fields originate from multiple records, resulting in a

single field for the Data Warehouse.



4. What is Data Management? Explain the different topics in Data Management

Ans:

Data Management is a broad field of study, but essentially is the process of managing data

as a resource that is valuable to an organization or business. One of the largest

organizations that deal with data management, DAMA (Data Management Association),

states that data management is the process of developing data architectures, practices and

procedures dealing with data and then executing these aspects on a regular basis.

There are many topics within data management, some of the more popular topics include

data modeling, data warehousing, data movement, database administration and data

mining.

Data Modeling

Data modeling is first creating a structure for the data that you collect and use and then

organizing this data in a way that is easily accessible and efficient to store and pull the data

for reports and analysis. In order to create a structure for data, it must be named

appropriately and show a relationship with other data. It also must fit appropriately in a

class. For instance, if you have a database of media, you might have a hierarchal structure

of objects that include photos, videos, and audio files. Within each category, you can

classify objects accordingly.

Data Warehousing

Data warehousing is storing data effectively so that it can be accessed and used efficiently.

Different organizations collect different types of data, but many organizations use their data

the same way, in order to create reports and analyze their data to make quality business

decisions. Data warehousing is usually an organizational wide repository of data, however

for very large corporations in can encompass just one office or one department.



Data Movement

Data movement is the ability to move data from one place to another. For instance, data

needs to be moved from where it is collected to a database and then to an end user, but this

process takes quite a bit of logistic insight. Not only do all hardware, applications and data

collected need to be compatible with one another, they must also be able to be classified,

stored and accessed with ease within an organization. Moving data can be very expensive

and can require lots of resources to make sure that data is moved efficiently, that data is

secure in transit and that once it reaches the end user it can be used effectively either to be

printed out as a report, saved on a computer or sent as an email attachment.

Database Administration

Database administration is extremely important in managing data. Every organization or

enterprise needs database administrators that are responsible for the database

environment. Database administrators are usually given the authority to do the following

tasks that include recoverability, integrity, security, availability, performance and

development & testing support.

Recoverability is usually defined as a way to store data as a back up and then test the back

ups to make sure that they are valid. The task of integrity means that data that is pulled for

certain records or files are in fact valid and have high data integrity. Data integrity is

extremely important especially when creating reports or when data is used for analysis. If

you have data that is deemed invalid, your results will be worthless.

Database security is an essential task for database administrators. For instance, database

administrators are usually in charge of giving clearance and access to certain databases or

trees in an organization. Another important task is availability. Availability is defined as

making sure a database is up and running. The more up time, usually the higher level of

productivity. Performance is related to availability, it is considered getting the most out of the

hardware, applications and data as possible. Performance is usually in relation to an

organizations budget, physical equipment and resources.



Finally, a database administrator is usually involved in database development and testing

support. Database administrators are always trying to push the envelope, trying to get more

use out of the data and add better performing and more powerful applications, hardware and

resources to the database structure. A database that is administered correctly is not only a

sign of competent database administrator, but it also means that all end users have a huge

resource in the data that is available. This makes it easy to create reports, conduct analysis

and make high quality decisions based on data that is collected and used within the

organization.

Data Mining

Another important topic regarding data management is data mining. Data mining is a

process in which large amounts of data are sifted through to show trends, relationships, and

patterns. Data mining is a crucial component to data management because it exposes

interesting information about the data being collected. It is important to note that data is

primarily collected so it can be used to find these patterns, relationships and trends that can

help a business grow or create profit.

While there are many topics within data management, they all work together from the

beginning where data is collected to the end of the process where it is sifted through;

analyzed and formatted where specialists can then make quality decisions based upon it.



5. Define OLAP. Mention its advantages.

Ans:

OLAP stands for On Line Analytical Processing, a series of protocols used mainly for

business reporting. Using OLAP, businesses can analyze data in all manner of different

ways, including budgeting, planning, simulation, data warehouse reporting, and trend

analysis. A main component of OLAP is its ability to make multidimensional calculations,

allowing a wide and lightning-fast array of possibilities. In addition, the bigger the business,

the bigger its business reporting needs. Multidimensional calculations enable a large

business to complete in seconds what it otherwise would have waited a handful of minutes

to receive.

One main benefit of OLAP is consistency of calculations. No matter how fast data is

processed through OLAP software or servers, the reporting that results is presented in a

consistent presentation, so executives always know what to look for where. This is

especially helpful when comparing information from previous reports to information

contained in new ones and projected future ones. "What if" scenarios are some of the most

popular uses of OLAP software and are made eminently more possible by multidimensional

processing.

Another benefit of multidimensional data presentation is that it allows a manager to pull

down data from an OLAP database in broad or specific terms. In other words, reporting can

be as simple as comparing a few lines of data in one column of a spreadsheet or as

complex as viewing all aspects of a mountain of data. Also, multidimensional presentation

can create an understanding of relationships not previously realized. All of this, of course,

can be done in the blink of an eye.

Producers of OLAP software are familiar, including Oracle, IBM, and Hyperion Solutions.

Oracle, which has a reputation for being different, refers to OLAP software as Business

Intelligence. IBM and Hyperion Solutions, wishing to remain consistent with industry

standards, call their software OLAP.

Following are some of the benefits of OLAP -


http://www.wisegeek.com/what-is-a-spreadsheet.htm

http://www.wisegeek.com/what-is-a-manager.htm

http://www.wisegeek.com/what-is-lightning.htm

http://www.wisegeek.com/what-is-trend-analysis.htm

http://www.wisegeek.com/what-is-trend-analysis.htm

http://www.wisegeek.com/what-is-a-warehouse.htm

http://www.wisegeek.com/what-is-budgeting.htm


1. The main benefit of the OLAP is its steadiness in calculations. The reporting is

always represented in a coherent presentation irrespective of how fast data is deal

with through the OLAP server or software and thus allows the executives and

analysts to know exactly to look for where.

2. The other convenience of OLAP is that it allows the manager to tear down data from

OLAP database in specific or broad terms. In layman's term, the report can be as

simple of comparing two columns or as complex of analyzing a huge amount of data.

Moreover, it also helps to realize relationships that were forgotten earlier.


http://www.tech-faq.com/olap.shtml#%23


6. Explain Indexing on Dimensional Tables and Fact Tables.

Ans:

Indexing in the Dimension Tables:

Columns in the dimension tables are used in the predicates of queries. A query may run like

this: How much are the sales of Product A in the month of March for Northern Division?

Here the columns product, month, and division from three different dimension tables are

candidates for indexing. Inspect the columns or each dimension table carefully and plan the

indexes for these tables. We may be not be able to achieve performance improvement by

indexing the columns in the fact tables but the columns in the dimension tables offer

tremendous possibilities to improve performance through indexing.

Creating Index on Dimension Table:

Create a unique B-Tree index on the single-column primary key. Example the columns that

are commonly used to constrain the queries. These are candidates for bitmapped indexes.

Look for columns that are frequently accessed together in large Dimension Tables.

Determine bow these columns may be arranged and used to create multicolumn indexes.

Remember that the columns that are more frequently accessed or the columns that are at

higher hierarchical levels in the dimension table are placed at the high order of the

multicolumn indexes. Individually index every column likely to be used frequently in join

condition.

Indexing in the Fact Table:

The primary key of the fact table consists of the primary keys of all the connected

dimensions. If we have four dimension tables store, product time and promotion then the full

primary key of the fact table is the concatenation of the primary keys of the store, product,

time and promotion tables. What are the other columns? The other columns are metrics

such as sale units, sale dollars, cost dollars, and so on. These are the types of columns to

be considered for indexing the fact tables.

If the DBMS does not create an index on the primary key, deliberately create a B-Tree

index on the full primary key.



Review the individual components of concatenated key. Create indexes on combinations

based on query processing requirements. Carefully design the order of individual key

elements in the full-concatenated key for indexing. In the high order of the concatenated

key, place the keys of the dimension tables frequently referred to while querying.



7. What is repository? How is it helpful to the data warehouse maintenance?

Ans:

Data Repository is a logical (and sometimes physical) partitioning of data where multiple

databases which apply to specific applications or sets of applications reside. For example,

several databases (revenues, expenses) which support financial applications (A/R, A/P)

could reside in a single financial Data Repository.

A database warehouse is one large Data Repository of all business related information

including all historical data of the business organization implementing the data warehouse.

Data warehousing is a complex process of building a data repository in the form of a

relational database so that the company can support web or text mining in order to leverage

data and transform or aggregate them into useful information.

In all cases, organizations use data warehousing to gain a competitive advantage, support

for decision making processes through comprehensive data analysis.

Some of the key components of data warehousing are Decision Support Systems (DSS)

and Data Mining (DM).

Data volumes in data warehouse could grow at an exponential rate so there should be a

way to handle this tremendous growth. With respect to storage requirements, the critical

needs that need to be seriously considered in a data warehouse are high availability, high

data volume, high performance and scalability, simplification and usability and easy

management.

Partitioning of data into a logical or in some cases physical Data Repository could greatly

help meet the requirement in relation to dealing with the exponential growth of data volumes

in the data warehouse. If all the data in the data warehouse were not partitioned into several

Data Repositories, then there will be profound disadvantage in terms of perfomance and

efficiency.



For one, if the central server fails, the system would come to a halt. This is because data is

just located in one monolithic system, and when the hardware fails, there is no sort back up.

It may take some time to get the server up, depending on the nature of the problem. But in a

business company, even a few minutes of business stoppage can already translate into

thousands of potential dollars lost from the business.

When Data Repository is employed in the data warehouse, the load can be distributed

across many databases or even across many servers. For instance, instead of having one

computer handle the database related to customers, several databases could be handling

the different aspects of customers.

In a very large company such as a company that has several branches around the country,

instead of having all the customers in one database, several databases may be handling

different branch customer database in a data repository. Or as earlier mentioned, several

company departmental database may be broken down into various Data Repository such as

one data repository supporting several databases (revenues, expenses) which support

financial applications (A/R, A/P) could reside in a single financial Data Repository.

Data Repository offers easier and faster access due to the fact that related information are,

to some degree, lumped or clustered together. For instance, in the example with financial

Data Repository, anybody from the financial department or any other data use wanting

information related to financials will not have to dig through the entire volume of the data in

the data warehouse.


Documents

58528578-SEM-6-BC0058-1-Data-WareHouse