18
Introduction to Data Warehousing and Analysis Services 2005 A data warehouse is a system that takes data from a company’s databases and other data sources and transforms it into a structure conducive to business analysis. Mathematical operations are often performed on the newly structured or organized data to further its usefulness for making business decisions. Finally, the data is made available to the end user for querying and analysis. If the data warehouse is well architected then queries to the data warehouse will return query results quickly (in a matter of seconds). The business decision-maker will have a powerful tool that could never have been effectively used directly from the company’s daily operational systems. We con- sider data analysis to be of two forms. The first requires a person to investigate the data for trends. This method is called On Line Analytical Processing (OLAP). The second form utilizes algorithms to scour the data looking for trends. This method is called Data Mining. Analysis Services 2005 is a business intelligence platform that enables you to use OLAP and Data Mining. Now that you have the big picture of data warehousing, let us look at what you will learn in this chapter. In this chapter you learn what data warehousing really is and how it relates to business intelli- gence. This information comes wrapped in a whole load of new concepts, and you get a look at the best known approaches to warehousing with the introduction of those concepts. We explain data warehousing in several different ways and we are sure you will understand it. You will finally see how Analysis Services 2005 puts it all together in terms of architecture — at both client and server levels — based on a new data abstraction layer called Unified Dimensional Model (UDM). A Closer Look at Data Warehousing In the book Building the Data Warehouse, Bill Inmon described the data warehouse as “a subject oriented, integrated, non-volatile, and time variant collection of data in support of management’s COPYRIGHTED MATERIAL

Introduction to Data Warehousing and Analysis Services 2005€¦ · Introduction to Data Warehousing and Analysis Services 2005 A data warehouse is a system that takes data from a

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Introduction to Data Warehousing and Analysis Services 2005€¦ · Introduction to Data Warehousing and Analysis Services 2005 A data warehouse is a system that takes data from a

Introduction to DataWarehousing and Analysis

Services 2005

A data warehouse is a system that takes data from a company’s databases and other data sourcesand transforms it into a structure conducive to business analysis. Mathematical operations areoften performed on the newly structured or organized data to further its usefulness for makingbusiness decisions. Finally, the data is made available to the end user for querying and analysis. Ifthe data warehouse is well architected then queries to the data warehouse will return query resultsquickly (in a matter of seconds). The business decision-maker will have a powerful tool that couldnever have been effectively used directly from the company’s daily operational systems. We con-sider data analysis to be of two forms. The first requires a person to investigate the data for trends.This method is called On Line Analytical Processing (OLAP). The second form utilizes algorithmsto scour the data looking for trends. This method is called Data Mining. Analysis Services 2005 is abusiness intelligence platform that enables you to use OLAP and Data Mining. Now that you havethe big picture of data warehousing, let us look at what you will learn in this chapter.

In this chapter you learn what data warehousing really is and how it relates to business intelli-gence. This information comes wrapped in a whole load of new concepts, and you get a look at thebest known approaches to warehousing with the introduction of those concepts. We explain datawarehousing in several different ways and we are sure you will understand it. You will finallysee how Analysis Services 2005 puts it all together in terms of architecture — at both client andserver levels — based on a new data abstraction layer called Unified Dimensional Model (UDM).

A Closer Look at Data WarehousingIn the book Building the Data Warehouse, Bill Inmon described the data warehouse as “a subject oriented, integrated, non-volatile, and time variant collection of data in support of management’s

05_579185_ch01_3.qxp 12/18/07 2:44 PM Page 3

COPYRIG

HTED M

ATERIAL

Page 2: Introduction to Data Warehousing and Analysis Services 2005€¦ · Introduction to Data Warehousing and Analysis Services 2005 A data warehouse is a system that takes data from a

decisions.” According to Inmon, the subject orientation of a data warehouse differs from the operational orientation seen in On-Line Transaction Processing (OLTP) systems; so a subject seen in a datawarehouse might relate to customers, whereas an operation in an OLTP system might relate to a specificapplication like sales processing and all that goes with it.

The word integrated means that throughout the enterprise, data points should be defined consistently or there should be some integration methodology to force consistency at the data warehouse level. Oneexample would be how to represent the entity Microsoft. If Microsoft were represented in differentdatabases as MSFT, MS, Microsoft, and MSoft, it would be difficult to meaningfully merge these in adata warehouse. The best-case solution is to have all databases in the enterprise refer to Microsoft as, say,MSFT, thereby making the merger of this data seamless. A less desirable, but equally workable, solutionis to force all the variants into one during the process of moving data from the operational system to thedata warehouse.

A data warehouse is referred to as non-volatile since it differs from operational systems, which are oftentransactional in nature and updated regularly. The data warehouse is generally loaded at some presetinterval, which may be measured in weeks or even months. This is not to say it is never measured indays; but even if updates do occur daily, that is still a sparse schedule compared to the constant changesbeing made to transactional systems.

The final element in this definition regards time variance, which is a sophisticated way of saying how farback the stored data in the system reaches. In the case of operational systems, the time period is quiteshort, perhaps days, weeks, or months. In the case of the warehouse, it is quite long — typically on theorder of years. This last item might strike you as fairly self-evident because you would have a hard timeanalyzing business trends if your data didn’t date back further than two months. So, there you have it,the classic definition that no good book on data warehousing should be without.

Taking the analysis one step closer to the nuts and bolts of working systems, consider that a relationaldatabase can be represented graphically as an Entity-Relationship Diagram (ERD) in a case tool or in SQLServer 2005 itself (see Figure 1-2 for an example). Not only will you see the objects in the database shownin the diagram, but you will also see many join connections which represent the relationships betweenthe objects. Data warehouses can be formed from relational databases or multi-dimensional databases.When your data warehouse is modeled after the relational database model then data is stored in two-dimensional tables and analytical or business queries are normally very slow. When one refers to a data warehouse it is typically OLAP that is being referred to. In the case of OLAP you have a multi-dimensional database with data stored in such a way that business users can view it and efficientlyanswer business questions — all with fast query response times. There is more to come in this chapter onthe differences between relational and OLAP databases.

Data warehousing is the process by which data starting from an OLTP database is transformed andstored so as to facilitate the extraction of business-relevant information from the source data. An OLTPdatabase, like a point-of-sale (POS) database is transaction-based and typically normalized (well opti-mized for storage) to reduce the amount of redundant data storage generated. The result makes for fastupdates, but this speed of update capability is offset by a reduction in speed of information retrieval atquery time. For speed of information retrieval, especially for the purpose of business analytics, an OLAPdatabase is called for. An OLAP database is highly denormalized (not well optimized for storage) andtherefore has rows of data that may be redundant. This makes for very fast query responses because rel-atively few joins are involved. And fast responses are what you want while doing business intelligence

4

Chapter 1

05_579185_ch01_3.qxp 12/18/07 2:44 PM Page 4

Page 3: Introduction to Data Warehousing and Analysis Services 2005€¦ · Introduction to Data Warehousing and Analysis Services 2005 A data warehouse is a system that takes data from a

work. Figure 1-1 shows information extracted from transactional databases and consolidated into multi-dimensional databases; then stored in data marts or data warehouses. Data marts can be thought of asmini–data warehouses and quite often act as part of a larger warehouse. Data marts are subject-orienteddata stores for well-manicured (cleaned) data. Examples include a sales data mart, an inventory datamart, or basically any subject rooted at the departmental level. A data warehouse on the other hand,functions at the enterprise level and typically handles data across the entire organization.

Figure 1-1

TransactionalSource

Databases

RelationalDatabase

DataStaging

Data Marts/Data

Warehouse

DataMart

ClientApplication

BusinessAnalyst

ReportsBusinessDecisionMaker

DataMart

DataWarehouse

RelationalDatabase

CleanTransactional Data

Make sure you haveconformeddimensions

RelationalDatabase

RelationalDatabase

5

Introduction to Data Warehousing and Analysis Services 2005

05_579185_ch01_3.qxp 12/18/07 2:44 PM Page 5

Page 4: Introduction to Data Warehousing and Analysis Services 2005€¦ · Introduction to Data Warehousing and Analysis Services 2005 A data warehouse is a system that takes data from a

Figure 1-2

Key Elements of a Data WarehouseLearning the elements of a data warehouse or data mart is, in part, about building a new vocabulary; thevocabulary associated with data warehousing can be less than intuitive, but once you get it, it all makessense. The challenge, of course, is understanding it in the first place. Two kinds of tables form a datawarehouse: fact tables and dimension tables.

Figure 1-3 shows a fact and a dimension table and the relationship between them. A fact table typicallycontains the business fact data such as sales amount, sales quantity, the number of customers, and theforeign keys to dimension tables. A foreign key is a field in a relational table that matches the primary key column of another table. Foreign keys provide a level of indirection between tables that enable you to cross-reference them. One important use of foreign keys is to maintain referential integrity (dataintegrity) within your database. Dimension tables contain detailed information relevant to specificattributes of the fact data, such as details of the product, customer attributes, store information, and so

6

Chapter 1

05_579185_ch01_3.qxp 12/18/07 2:44 PM Page 6

Page 5: Introduction to Data Warehousing and Analysis Services 2005€¦ · Introduction to Data Warehousing and Analysis Services 2005 A data warehouse is a system that takes data from a

on. In Figure 1-3, the dimension table Product contains the information Product SKU and Product Name.The following sections go into more detail about fact and dimension tables.

Figure 1-3

Fact TablesWith the end goal of extracting crucial business insights from your data, you will have to structure yourdata initially in such a way as to facilitate later numeric manipulation. Leaving the data embedded insome normalized database will never do! Your business data, often called detail data or fact data, goes ina de-normalized table called the fact table. Don’t let the term “facts” throw you; it literally refers to thefacts. In business, the facts are things such as number of products sold and amount received for productssold. Yet another way to describe this type of data is to call them measures. Calling the data measuresversus detail data is not an important point. What is important is that this type of data is often numeric(though it could be of type string) and the values are quite often subject to aggregation (pre-calculatingroll-ups of data over hierarchies, which subsequently yield improved query results). A fact table oftencontains columns like the ones shown in the following table:

PPrroodduucctt IIDD DDaattee IIDD SSttaattee IIDD NNuummbbeerr ooff CCaasseess SSaalleess AAmmoouunntt

1 07/01/2005 6 3244 $90,842

1 07/01/2005 33 6439 $184,000

1 07/01/2005 42 4784 $98,399

1 08/01/2005 31 6784 $176,384

1 08/01/2005 6 2097 $59,136

1 08/01/2005 33 7326 $8,635

1 08/01/2005 42 4925 $100,962

1 09/01/2005 31 8548 $176,384

1 09/01/2005 6 945 $26,649

1 09/01/2005 33 8635 $246,961

1 09/01/2005 42 4935 $101,165

1 10/01/2005 31 9284 $257,631

1 10/01/2005 33 9754 $278,965

1 10/01/2005 42 4987 $102,733

... ... ... ... ...

IDProductIDSales QuantitySales Amount

Sales Table

ProductIDProduct SKUProduct Name

Product Table

7

Introduction to Data Warehousing and Analysis Services 2005

05_579185_ch01_3.qxp 12/18/07 2:44 PM Page 7

Page 6: Introduction to Data Warehousing and Analysis Services 2005€¦ · Introduction to Data Warehousing and Analysis Services 2005 A data warehouse is a system that takes data from a

This table shows the sales of different varieties of beer between the months of July and October 2005 infour different states. The product id, date id, and state ids together form the primary key of the fact table.The number of cases of beer sold and the sales amount are facts. The product id, date id, and state id areforeign keys that join to the products, date, and state tables. In this table the state ids 6, 31, 33, and 42refer to the states MA, CA, OR, and WA, respectively, and represent the order in which these states joinedthe United States. Building the fact table is an important step towards building your data warehouse.

Dimension TablesThe fact table typically holds quantitative data; for example, transaction data that shows number of unitssold per sale and amount charged to the customer for the unit sold. To provide reference to higher-levelroll-ups based on things like time, a complementary table can be added that provides linkage to thosehigher levels through the magic of the join (how you link one table to another). In the case of time, thefact table might only show the date on which some number of cases of beer was sold; to do businessanalysis at the monthly, quarterly, or yearly level, a time dimension is required. The following tableshows what a beer products dimension table would minimally contain. The product id is the primarykey in this table. The product id of the fact table shown previously is a foreign key that joins to the prod-uct id in the following table:

PPrroodduucctt IIDD PPrroodduucctt SSKKUU PPrroodduucctt NNaammee

1 SBF767 SuperMicro Ale

2 SBH543 SuperMicro Lager

3 SBZ136 SuperMicro Pilsner

4 SBK345 SuperMicro Hefeweizen

... ... ...

For illustrative purposes, assume that you have a dimension table for time that contains monthly, quar-terly, and yearly values. There must be a unique key for each value; these unique key values are calledprimary keys. Meanwhile, back in the fact table you have a column of keys with values mapping to theprimary keys in the dimension table. These keys in the fact table are called foreign keys. For now it isenough if you get the idea that dimension tables connect to fact tables and this connectivity providesyou with the ability to extend the usefulness of your low-level facts resident in the fact table.

A multi-dimensional database is created from fact and dimension tables to form objects called dimen-sions and cubes. Dimensions are objects that are created mostly from dimension tables. Some examplesof dimensions are time, geography, and employee which would typically contain additional informationabout those objects by which users can analyze the fact data. The cube is an object that contains fact dataas well as dimensions so that data analysis can be performed by slicing or dicing dimensions. For exam-ple, you could view the sales information for the year 2005 in the state of Washington. Each of thoseslices of information is a dimension.

DimensionsTo make sense of a cube, which is at the heart of business analysis and discussed in the next section, you must first understand the nature of dimensions. We say that OLAP is based on multidimensional

8

Chapter 1

05_579185_ch01_3.qxp 12/18/07 2:44 PM Page 8

Page 7: Introduction to Data Warehousing and Analysis Services 2005€¦ · Introduction to Data Warehousing and Analysis Services 2005 A data warehouse is a system that takes data from a

databases because it quite literally is. You do business analysis by observing the relationships betweendimensions like Time, Sales, Products, Customers, Employees, Geography, and Accounts. Dimensionsare most often made up of several hierarchies. Hierarchies are logical entities by which a business usermight want to analyze fact data. Each hierarchy can have one or more levels. A hierarchy in the geogra-phy dimension, for example, might have the following levels: Country, State, County, and City.

A hierarchy like the one in the geography dimension would provide a completely balanced hierarchy for the United States. Completely balanced hierarchy means that all leaf (end) nodes for cities would be anequal distance from the top level. Some hierarchies in dimensions can have an unbalanced distributionof leaf nodes relative to the top level. Such hierarchies are called unbalanced hierarchies. An organizationchart is an obvious example of an unbalanced hierarchy. There are different depths to the chain of super-visor to employee; that is, the leaf nodes are different distances from the top-level node. For example, ageneral manager might have unit managers and an administrative assistant. A unit manager might haveadditional direct reports such as a dev and a test manager, while the administrative assistant would nothave any direct reports. Some hierarchies are typically balanced but are missing a unique characteristicof some members in a level. Such hierarchies are called ragged hierarchies. An example of a ragged hierar-chy is a geography hierarchy that contains the levels Country, State, and City. Within the Country USAyou have State Washington and City Seattle. If you were to add the Country Greece and City Athens tothis hierarchy, you would add them to the Country and City levels. However, there are no states in theCountry Greece and hence member Athens is directly related to the Country Greece. A hierarchy inwhich the members descend to members in the lowest level with different paths is referred to as aragged hierarchy. Figure 1-4 shows an example of a Time dimension with the hierarchy Time. In thisexample, Year, Quarter, Month, and Date are the levels of the hierarchy. The values 2005 and 2006 aremembers of the Year level. When a particular level is expanded (indicated by minus sign in the figure)you can see the members of the next level in the hierarchy chain.

Figure 1-4

To sum up, a dimension is a hierarchical structure that has levels that may or may not be balanced. It hasa subject matter of interest and is used as the basis for detailed business analysis.

CubesThe cube is a multidimensional data structure from which you can query for business information. Youbuild cubes out of your fact data and the dimensions. A cube can contain fact data from one or more fact

Year Time−

Quarter

Month

Date

2005+2006−

Q1+Q2+Q3+

+++

Q4−October

NovemberDecember

9

Introduction to Data Warehousing and Analysis Services 2005

05_579185_ch01_3.qxp 12/18/07 2:44 PM Page 9

Page 8: Introduction to Data Warehousing and Analysis Services 2005€¦ · Introduction to Data Warehousing and Analysis Services 2005 A data warehouse is a system that takes data from a

tables and often contains a few dimensions. Any given cube usually has a dominant subject under analy-sis associated with it. For example, you might build a Sales cube with which you analyze sales by region,or a Call Processing cube with which you analyze length of call by problem category reported. Thesecubes are what you will be making available to your users for analysis.

Figure 1-5 shows a Beer Sales cube that was created from the fact table data shown previously. Considerthe front face of the cube that shows numbers. This cube has three dimensions: Time, Product Line, andState where the product was sold. Each block of the cube is called a cell and is uniquely identified by amember in each dimension. For example, analyze the bottom-left corner cell that has the values 4,784and $98,399. The values indicate the number of sales and the sales amount. This cell refers to the sales ofBeer type Ale in the state of Washington (WA) for July 2005. This is represented as [WA, Ale, Jul ‘05].Notice that some cells do not have any value; this is because no facts are available for those cells in thefact table.

The whole point of making these cubes involves reducing the query response time for the informationworker to extract knowledge from the data. To make that happen, cubes typically contain pre-calculatedsummary data called aggregations. Querying existing aggregated data is close to instantaneous comparedto doing cold (no cache) queries with no pre-calculated summaries in place. This is really at the heart ofbusiness intelligence, the ability to query data with possibly gigabytes or terabytes of pre-summarizeddata behind it and yet get an instant response from the server. It is quite the thrill when you realize youhave accomplished this feat!

Figure 1-5

Jul ‘05

Ale

Lager

ProductLine

Pilsner

Hefeweizen

CA6,784

$176,3848,548

$236,8229,284

$257,631

2,097$59,136

3,244$90,842

945$26,649

7,326$209,524

6,439$184,000

8,635$246,961

9,754$278,965

4,925$100,962

4,784$98,399

4,935$101,165

4,987$102,733

MA

OR

WA

Aug ‘05

Time

Sta

te

Sep ‘05 Oct ‘05

10

Chapter 1

05_579185_ch01_3.qxp 12/18/07 2:44 PM Page 10

Page 9: Introduction to Data Warehousing and Analysis Services 2005€¦ · Introduction to Data Warehousing and Analysis Services 2005 A data warehouse is a system that takes data from a

You learned about how cubes provide the infrastructure for storing multidimensional data. Well, it doesn’tjust store multidimensional data from fact tables; it also stores something called aggregations of that data.A typical aggregation would be the summing of values up a hierarchy of a dimension. For example,summing of sales figures up from stores level, to district level, to regional level; when querying for thosenumbers you would get an instant response because the calculations would have already been donewhen the aggregations were formed. The fact data does not necessarily need to be aggregated as sum ofthe specific fact data. You can have other ways of aggregating the data such as counting the number ofproducts sold. Again, this count would typically roll up through the hierarchy of a dimension.

The Star SchemaThe entity relationship diagram representation of a relational database shows you a different animalaltogether as compared to the OLAP (multidimensional) database. It is so different in fact, that there is aname for the types of schemas used to build OLAP databases: the star schema and the snowflake schema.The latter is largely a variation on the first. The main point of difference is the complexity of the schema;the OLTP schema tends to be dramatically more complex than the OLAP schema. Now that you knowthe infrastructure that goes into forming fact tables, dimension tables, and cubes, the concept of a starschema should offer little resistance. That is because when you configure a fact table with foreign keyrelationships to one or more of a dimension table’s primary keys, as shown in Figure 1-6, you have a starschema. Looks a little like a star, right?

Figure 1-6

The star schema provides you with an illustration of the relationships between business entities in aclear and easy-to-understand fashion. Further, it enables number crunching of the measures in the facttable to progress at amazing speeds.

Product

PK Product IDProduct SKUProduct NameProduct Sub CategoryProduct Category

Sales Fact

PK,FK1 Product ID

Sales QuantitySales Amount

PK,FK2 Time IDPK,FK3 Store ID

Time

PK

DayMonthYear

Time ID

Store

PK

StoreNameCityCountryPhoneManagerSize in SQFT

Store ID

11

Introduction to Data Warehousing and Analysis Services 2005

05_579185_ch01_3.qxp 12/18/07 2:44 PM Page 11

Page 10: Introduction to Data Warehousing and Analysis Services 2005€¦ · Introduction to Data Warehousing and Analysis Services 2005 A data warehouse is a system that takes data from a

The Snowflake SchemaIf you think the star schema is nifty, and it is, there is an extension of the concept called the snowflakeschema. The snowflake schema is useful when one of your dimension tables starts looking as detailed as the fact table it is connected to. With the snowflake, a level is forked off from one of the dimensiontables, so it is separated by one or more tables from the fact table. In Figure 1-7 the Product dimensionhas yielded a Product Category level. The Product Sub Category level is hence one table removed fromthe sales fact table. In turn, the Product Sub Category level yields a final level called the ProductCategory — which has two tables of separation between it and the sales fact table. These levels, whichcan be used to form a hierarchy in the dimension, do not make for faster processing or query responsetimes, but they can keep a schema sensible.

Figure 1-7

You have so far learned the fundamental elements of a data warehouse. The biggest challenge is tounderstand these well and design and implement your data warehouse to cater to your end-users. Thereare two main design techniques for implementing data warehouses. These are the Inmon approach andthe Kimball approach.

Inmon Versus KimballDifferent Approaches

In data warehousing there are two commonly acknowledged approaches to building a decision supportinfrastructure, and both can be implemented using the tools available in SQL Server 2005 with AnalysisServices 2005. It is worth understanding these two approaches and the often-cited difference of views

Product

PK Product IDProduct SKUProduct NameProduct Sub Category ID

Product Sub Category

PK

FK1FK1

FK1

Product Sub Category IDProduct Category NameProduct Category ID

Product Category

PK Product Category IDProduct Category Name

Sales Fact

PK,FK1 Product ID

Sales QuantitySales Amount

PK,FK4 Time IDPK,FK3 Store ID

Time

PK

DayMonthYear

Time ID

Location

PK

City NameCountry Name

Location ID

Store

PK

StoreNameLocation IDPhoneManagerSize in SQFT

Store ID

12

Chapter 1

05_579185_ch01_3.qxp 12/18/07 2:44 PM Page 12

Page 11: Introduction to Data Warehousing and Analysis Services 2005€¦ · Introduction to Data Warehousing and Analysis Services 2005 A data warehouse is a system that takes data from a

that result. These views are expressed most overtly in two seminal works: The Data Warehouse LifecycleToolkit by Ralph Kimball, Laura Reeves, Margy Ross, and Warren Thornthwaite, and Corporate InformationFactory by Bill Inmon, Claudia Imhoff, and Ryan Sousa.

Kimball identified early on the problem of the stovepipe. A stovepipe is what you get when several inde-pendent systems in the enterprise go about identifying and storing data in different ways. Trying to con-nect these systems or use their data in a warehouse results in something resembling a Rube-Goldbergdevice. To address this problem, Kimball advocates the use of conformed dimensions. Conformed refersto the idea that dimensions of interest — sales, for example — should have the same attributes and roll-ups (covered in the “Aggregations” section earlier in this chapter) in one data mart as another. Or atleast one should be a subset of the other. In this way, a warehouse can be formed from data marts. Thereal gist of Kimball’s approach is that the data warehouse contains dimensional databases for ease ofanalysis and that the user queries the warehouse directly.

The Inmon approach has the warehouse laid out in third normal form (not dimensional) and the users query data marts, not the warehouse. In this approach the data marts are dimensional in nature.However, they may or may not have conformed dimensions in the sense Kimball talks about.

Happily it is not necessary to become a card-carrying member of either school of thought in order to dowork in this field. In fact, this book is not strictly aligned to either approach. What you will find as youwork through this book is that by using the product in the ways in which it was meant to be used andare shown here, certain best practices and effective methodologies will naturally emerge.

Business Intelligence Is Data AnalysisHaving designed a data warehouse the next step is to understand and make business decisions from yourdata warehouse. Business intelligence is nothing but analyzing your data. An example of business analyticsis shown through the analysis of results from a product placed on sale at a discounted price, as commonlyseen in any retail store. If a product is put on sale for a special discounted price, there is an expected out-come: increased sales volume. This is often the case, but whether or not it worked in the company’s favorisn’t obvious. That is where business analytics come into play. We can use Analysis Services 2005 to findout if the net effect of the special sale was to sell more product units. Suppose you are selling organic honeyfrom genetically unaltered bees; you put the 8-ounce jars on special — two for one — and leave the 10- and12-ounce jars at regular price. At the end of the special you can calculate the lift provided by the special sale — the difference in total sales between a week of sales with no special versus a week of sales with thespecial. How is it you could sell more 8-ounce jars on special that week, yet realize no lift? It’s simple —the customers stopped buying your 10- and 12-ounce jars in favor of the two-for-one deal; and you didn’tattract enough new business to cover the difference for a net increase in sales.

You can surface that information using Analysis Services 2005 by creating a Sales cube that has threedimensions: Product, Promotion, and Time. For the sake of simplicity, assume you have only three prod-uct sizes for the organic honey (8-ounce, 10-ounce, and 12-ounce) and two promotion states (“no promo-tion” and a “two-for-one promotion for the 8-ounce jars”). Further, assume the Time dimension containsdifferent levels for Year, Month, Week, and Day. The cube itself contains two measures, “count of prod-ucts sold” and the “sales amount.” By analyzing the sales results each week across the three productsizes you could easily find out that there was an increase in the count of 8-ounce jars of honey sold, butperhaps the total sales across all sizes did not increase due to the promotion. By slicing on the Promotiondimension you would be able to confirm that there was a promotion during the week that caused an

13

Introduction to Data Warehousing and Analysis Services 2005

05_579185_ch01_3.qxp 12/18/07 2:44 PM Page 13

Page 12: Introduction to Data Warehousing and Analysis Services 2005€¦ · Introduction to Data Warehousing and Analysis Services 2005 A data warehouse is a system that takes data from a

increase in number of 8-ounce jars sold. When looking at the comparison of total sales for that week(promotion week) to the earlier (non-promotion) weeks, lift or lack of lift is seen quite clearly. Businessanalytics are often easier described than implemented, however.

Analysis Services 2005Analysis Services 2005 is part of Microsoft’s product SQL Server 2005. SQL Server 2005 is the latest SQLServer release from Microsoft in November of 2005. In addition to Analysis Services 2005, SQL Server2005 contains other services such as Integrations Services, Reporting Services, and Notification Servicesamong other things. Integration Services, Analysis Services, and Reporting Services together form thecore of business intelligence platform with SQL Server as the backend. Analysis Services 2005 not onlyprovides you the ability to build dimensions and cubes for data analysis but also supports several datamining algorithms which can provide business insight into your data that are not intuitive. AnalysisServices is part of a greater Business Intelligence platform, which leverages not only the rest of SQLServer 2005, but the .NET Framework (Common Language Runtime) and Visual Studio developmentenvironment as well. Next you will learn about the overall architecture of Analysis Services 2005 fol-lowed by the concept of Unified Dimensional Model (UDM) which helps you to have a unified view ofyour entire data warehouse.

SQL Server Analysis Services 2005 has been re-architected as both scalable and reliable enterprise classsoftware that provides fine-grain security. So, not only is it quite manageable; but also protects your datafrom malicious attacks. The architecture of Analysis Services 2005 provides efficient scalability in termsof scale-out and scale-up features. Several instances of Analysis Services 2005 can be integrated togetherto provide an efficient scale-out solution. On the other hand, the service has been architected with effi-cient algorithms to handle large dimensions and cubes on a single instance. Analysis Services 2005 provides a rich set of tools for creating OLAP databases; efficient and easy manageability, as well as profiling capabilities.

The Business Intelligence Development Studio (BIDS) integrated within Visual Studio is the developmenttool shipped with Analysis Services 2005 used for creating and updating cubes, dimensions, and DataMining models. The SQL Server Management Studio (SSMS) provides an integrated environment for man-aging SQL Server, Analysis Services, Integration Services, and Reporting Services. SQL Profiler in theSQL Server 2005 releases supports profiling Analysis Services 2005, which helps in analyzing the types of commands and queries sent from different users or clients to Analysis Services 2005. You learn moreabout BIDS and SSMS in Chapter 2 with the help of a tutorial. You learn about profiling an instance ofAnalysis Services using SQL Profiler in Chapter 12. In addition to the above-mentioned tools, AnalysisServices 2005 provides two more tools: the Migration Wizard and the Deployment Wizard. The MigrationWizard helps in migrating Analysis Services 2000 databases to Analysis Services 2005. The DeploymentWizard helps in deploying the database files created using BIDS to Analysis Services 2005.

The SSMS provides efficient, enterprise-class manageability features for Analysis Services. Key aspectsof an enterprise class service are availability and reliability. Analysis Services 2005 supports fail-overclustering on Windows clusters through an easy setup scheme and fail-over clustering certainly helpsprovide high availability. In addition, Analysis Services 2005 has the capability of efficiently recoveringfrom failures. You can set up fine-grain security so that you can provide administrative access to anentire service or administrative access to specific databases, process permissions to specific databases,and read-only access to metadata and data. In addition to this, certain features are turned off by defaultso that the Service is protected from hacker attacks.

14

Chapter 1

05_579185_ch01_3.qxp 12/18/07 2:44 PM Page 14

Page 13: Introduction to Data Warehousing and Analysis Services 2005€¦ · Introduction to Data Warehousing and Analysis Services 2005 A data warehouse is a system that takes data from a

Analysis Services 2005 natively supports XML for Analysis specification defined by the XML/A AdvisoryCouncil. What this means is that the communication interface to Analysis Services from a client is XML.This facilitates ease of interoperability between different clients and Analysis Services 2005. The architec-ture of SQL Server Analysis Services 2005 includes various modes of communication to the service as shown in Figure 1-8. Analysis Server 2005 provides three main client connectivity components tocommunicate to the server. The Analysis Management Objects (AMO) is a new object model that helpsyou manage Analysis Server 2005 and the databases resident on it. The OLE DB 9.0 is the client connec-tivity component used to interact with analysis services 2005 instances s for queries that conforms to theOLE DB standard. The ADOMD.Net is dot Net object model support for querying data from AnalysisServices 2005. In addition to the three main client connectivity components, two other components areprovided by Analysis Services 2005. They are DSO 9.0 (Decision Support Object) and HTTP connectivitythrough a data pump. DSO 8.0 is the extension of the management object of Analysis Server 2000 so thatlegacy applications can interact with migrated Analysis Server 2000 databases on Analysis Server 2005.The data pump is a component that is set up with IIS (Internet Information System) to provide connec-tion to Analysis Services 2005 over HTTP (Hypertext Transfer Protocol).

Even though XML/A helps in interoperability between different clients to Analysis Server, it comes with acost on performance. If the responses from the server are large, transmission of XML data across the wiremay take a long time depending on the type of network connection. Typically slow wide area networksmight suffer from performance due to large XML responses. In order to combat this, Analysis Services 2005supports the options for compression and binary XML so that the XML responses from the server could bereduced. These are optional features supported by Analysis Services 2005 that can be enabled or disabledon the Server.

Analysis Services 2005 stores metadata information of databases in the form of XML. Analysis Services2005 provides you with the option of storing the data or aggregated data efficiently in a proprietary for-mat on Analysis Services instance or storing them in the relational database. If you choose the data and/or aggregated data to be stored in the proprietary format you can expect better query performance thanthe case where the data is being retrieved from the relational database.. This proprietary format helpsAnalysis Services 2005 to retrieve the data efficiently and thereby improves the query performance.Based on where the data and/or aggregated fact data is stored you can classify the storage types asMOLAP (Multi-dimensional OLAP), ROLAP (Relational OLAP), or HOLAP (Hybrid OLAP).

MOLAP is the storage mode in which the data and aggregated data are both stored in proprietary for-mat on the Analysis Services instance. This is the default and recommended storage mode for AnalysisServices databases since you get better query performance as compared to the other storage types. Thekey advantages of this storage mode is fast data retrieval while analyzing sections of data and thereforeprovides good query performance and the ability to handle complex calculations. Two potential disad-vantages of MOLAP mode are storage needed for large databases and the inability to see new dataentering your data warehouse.

ROLAP is the storage mode in which the data is left in the relational database. Aggregated or summarydata is also stored in the relational database. Queries against the Analysis Services are appropriatelychanged to queries to the relational database to retrieve the right section of data requested. The keyadvantage of this mode is that the ability to handle large cubes is limited by the relational backend only.The most important disadvantage of the ROLAP storage mode are slow query performance. You willencounter slower query performance in ROLAP mode due to the fact that each query to the AnalysisServices is translated into one or more queries to the relational backend.

15

Introduction to Data Warehousing and Analysis Services 2005

05_579185_ch01_3.qxp 12/18/07 2:44 PM Page 15

Page 14: Introduction to Data Warehousing and Analysis Services 2005€¦ · Introduction to Data Warehousing and Analysis Services 2005 A data warehouse is a system that takes data from a

Figure 1-8

The HOLAP storage mode combines the best of MOLAP and ROLAP modes. The data in the relationaldatabase is not touched while the aggregated or summary data is stored on the Analysis Services instancein a proprietary format. If the queries to Analysis Services request aggregated data, they are retrieved fromthe summary data stored on the Analysis Services instance and they would be faster than data beingretrieved from the relational backend. If the queries request detailed data, appropriate queries are sent tothe relational backend and these queries can take a long time based on the relational backend.

BusinessIntelligence

DevelopmentStudio

SQL ServerManagement

Studio

SQL Server Analysis Services Client Connectivity components

Analysis Server

Data Source(Relational database such as SQL

Server, DB2, Oracle, Access,Text data through DTS Pipeline)

Analysis Server 2005

Local Cube

SQLProfiler

Analysis ServicesManagement Object

(AMO)

Analysis ServicesOLEDB 9.0 ADOMD.Net

IIS

AnalysisServicesPUMP

ClientApplications

SQL Server Analysis Server 2005

ClientApplications

LegacyClient

Applications

Decision SupportObjects 9.0

OLE

DB

, .N

et

XML/

A ov

er T

CP/

IP

XML/

A ov

er T

CP/

IP

XML/

A ov

er T

CP/

IP

XML/

A ov

er T

CP/

IP

XML/

A ov

er T

CP/

IP

AMO

OLE

DB

9.0

AMO

OLE

DB

9.0

ADO

MD

.Net

OLE

DB

9.0

XML/

A ov

er H

TTP

AMO

OLE

DB

9.0

ADO

MD

.Net

DS

O

16

Chapter 1

05_579185_ch01_3.qxp 12/18/07 2:44 PM Page 16

Page 15: Introduction to Data Warehousing and Analysis Services 2005€¦ · Introduction to Data Warehousing and Analysis Services 2005 A data warehouse is a system that takes data from a

Based on your requirements and maintainability costs you need to choose the storage mode that isappropriate for your business. Analysis Services 2005 supports all three storage modes.

The Unified Dimensional ModelCentral to the architecture is the concept of the Unified Dimensional Model (UDM) which, by the way, isunique to this release of the product. UDM, as the name suggests, provides you with a way to encapsulateaccess to multiple heterogeneous data sources into a single model. In fact, with the UDM, you will bebuffered from the difficulties previously presented by multiple data sources. Those difficulties were oftenassociated with cross–data-source calculations and queries — so, do not be daunted by projects with lots ofdisparate data sources. The UDM can handle it! The UDM itself is more than a multiple data-source cubeon steroids; it actually defines the relational schema upon which your cubes and dimensions are built.Think of the UDM as providing you with the best of the OLAP and relational worlds. UDM provides youwith the rich metadata needed for analyzing and exploring data along with the functionality like the com-plex calculations and aggregations of the OLAP world. It supports complex schemas, and is capable of sup-porting ad-hoc queries that are needed for reporting in the relational world. Unlike the traditional OLAPworld that allows you to define a single fact table within a cube, the UDM allows you to have multiple facttables. The UDM is your friend and helps you have a single model that will support all your businessneeds. Figure 1-9 shows a UDM within Analysis Services 2005 that retrieves data from heterogeneous datasources and serves various types of clients.

Key elements of the UDM are as follows:

❑ HHeetteerrooggeenneeoouuss ddaattaa aacccceessss ssuuppppoorrtt:: UDM helps you to integrate and encapsulate data fromheterogeneous data sources. It helps you combine various schemas into a single unified modelthat gives end users the capability of sending queries to a single model.

❑ RReeaall--ttiimmee ddaattaa aacccceessss wwiitthh hhiigghh ppeerrffoorrmmaannccee:: The UDM provides end users with real-timedata access. The UDM creates a MOLAP cache of the underlying data. Whenever there arechanges in the underlying relational database, a new MOLAP cache is built. When users querythe model, it provides the results from the MOLAP cache. During the time the cache is beingbuilt, results are retrieved from the relational database. UDM helps in providing real-time dataaccess with the speed of an OLAP database due to the MOLAP cache. This feature is calledproactive caching. You learn more about proactive caching in Chapter 18.

❑ RRiicchh mmeettaaddaattaa,, eeaassee ooff uussee ffoorr eexxpplloorraattiioonn,, aanndd nnaavviiggaattiioonn ooff ddaattaa:: UDM provides a consolidatedview of the underlying data sources with the richness of metadata provided by the OLAP world.Due to rich metadata supported by OLAP, end users are able to exploit this metadata to navigateand explore data in support of making business decisions. UDM also provides you with the abilityto view specific sections of the unified model based on your business analysis needs.

❑ RRiicchh aannaallyyttiiccss ssuuppppoorrtt:: In addition to the rich metadata support, the UDM provides you withthe ability to specify complex calculations to be applied to the underlying data; in this way youcan embed business logic. You can specify the complex calculations by a script-based calculationmodel using the language called MDX (Multi-Dimensional eXpressions). UDM provides richanalytics such as Key Performance Indicators and Actions that help in understanding your busi-ness with ease and automatically take appropriate actions based on changes in data.

❑ MMooddeell ffoorr RReeppoorrttiinngg aanndd AAnnaallyyssiiss:: The UDM provides the best functionality for relating toboth relational and OLAP worlds. UDM provides you with the capability of not only queryingthe aggregated data that are typically used for analysis, but also has the ability to provide fordetailed reporting up to the transaction level across multiple heterogeneous data sources.

17

Introduction to Data Warehousing and Analysis Services 2005

05_579185_ch01_3.qxp 12/18/07 2:44 PM Page 17

Page 16: Introduction to Data Warehousing and Analysis Services 2005€¦ · Introduction to Data Warehousing and Analysis Services 2005 A data warehouse is a system that takes data from a

Figure 1-9

Another handy aspect of using the UDM is the storage of foreign language translations for both data andmetadata. This is handled seamlessly by the UDM such that a connecting user gets the metadata anddata of interest customized to his or her locale. Of course, somebody has to enter those translations intothe UDM in the first place; it is not actually a foreign language translation system.

SummaryReading this chapter may have felt like the linguistic equivalent of drinking from a fire hose; it is goodyou hung in there because now you have a foundation from which to build as you work through the restof the book. Now you know data warehousing is all about structuring data for decision support. Thedata is consumed by the business analyst and business decision-maker and can be analyzed throughOLAP and Data Mining techniques.

Administrator

BusinessAnalyst

OLAP ClientTool

ReportAnalyst

OLAPBrowser

SQL ServerManagement

Studio

OracleData Store

ReportingTools

OLAP ClientTool

SQL ServerData Store

DB2Data Store

UnifiedDimensional

Model

NO

TIFI

CAT

ION

S

Analysis Services 2005

Automatic MOLAP Cache

TeradataData Store

XML for Analysis S

ervice

18

Chapter 1

05_579185_ch01_3.qxp 12/18/07 2:44 PM Page 18

Page 17: Introduction to Data Warehousing and Analysis Services 2005€¦ · Introduction to Data Warehousing and Analysis Services 2005 A data warehouse is a system that takes data from a

OLAP is a multidimensional database format that is a world apart in form and function when comparedto an OLTP relational database system. You saw how OLAP uses a structure called a cube, which in turnrelies on fact tables (which are populated with data called facts) and dimension tables. These dimensiontables can be configured around one or more fact tables to create a star schema. If a dimension table isdeconstructed to point to a chain of sub-dimension tables, the schema is called a snowflake schema.

By choosing Analysis Services 2005 you have chosen a business intelligence platform with awesomeinnovations built right in; like the UDM. Also, there is an advantage that Analysis Services 2005 offers —it comes from a particularly strong and reliable company that had the highest market share with its ear-lier product, Analysis Services 2000. The rest of this book illustrates the power of the platform quiteclearly.

In the unlikely event that you didn’t read the introduction, mention was made that you should read atleast the first three chapters serially before attempting to tackle the rest of the book. So, please do notskip Chapter 2, an introduction to Analysis Services and Chapter 3, an introduction to the technologybehind the most famous acronym in business analytics, MDX.

19

Introduction to Data Warehousing and Analysis Services 2005

05_579185_ch01_3.qxp 12/18/07 2:44 PM Page 19

Page 18: Introduction to Data Warehousing and Analysis Services 2005€¦ · Introduction to Data Warehousing and Analysis Services 2005 A data warehouse is a system that takes data from a

05_579185_ch01_3.qxp 12/18/07 2:44 PM Page 20