View
113
Download
12
Category
Tags:
Preview:
Citation preview
Data Management5 November 2012 TCS Public
ContentsData Warehouse Concepts Data Modeling Dimensional Modeling Implementation and Maintenance Data Management Data Quality Analysis Metadata Management Data Governance Master Data Management Data Storage, Movement and Access
5 November 2012
2
Data Warehouse Concepts
5 November 2012
3
Data Warehouse Concepts AgendaA.What is a Data Warehouse (DW) ?
B.What are the components of a DW ?C.What are the various architectures/formats of a DW ? D.Examples of Data Warehousing tools in use
5 November 2012
4
Need for Data Warehousing Business View Customer Centricity Single view of each customer and his/her activities Integrated information from heterogeneous sources Adaptability to rapidly changing business needs Multiple ways to view business performance Low cycle time, faster analytics Increased Global competition Crunch more and more data, faster and faster Mergers and Acquisition With each acquisition comes another set of disparate IT systems affecting consistency and performance
5 November 2012
5
Need for Data Warehousing Systemic ViewPerformance Optimization OLTP systems get overloaded with large analytical queries Data Models for OLTP and OLAP are very different Reduce reliance on IT to produce reports Reporting making on OLTP systems is very technical OLTP systems not built to hold history data Data Security To prevent unauthorized access to sensitive data
5 November 2012
6
Data Warehouse Defined
Data Warehouse is a Subject-Oriented Integrated Time-Variant Non-volatile collection of data enabling management decision making
5 November 2012
7
Subject OrientationProcess Oriented Subject Oriented
EntrySales Rep Quantity Sold Part Number Date Customer Name Product Description Unit Price Mail Address
SalesCustomers
Products
Transactional Storage
Data Warehouse Storage
5 November 2012
8
Data VolatilityVolatileInsert Change
Non-Volatile
Delete Insert Change Access Record-by-Record Data Manipulation Load
Access
Mass Load / Access of Data
Transactional Storage
Data Warehouse Storage
5 November 2012
9
Time VarianceCurrent Data Historical Data
Sales ( Region , Year - Year 97 - 1st Qtr)20 15 Sales ( in lakhs 10 ) 5 0 January February March Year97 East West North
Transactional Storage
Data Warehouse Storage
5 November 2012
10
Data Warehouse CharacteristicsStores large volumes of data used frequently by DSS Is maintained separately from operational databases Are relatively static with infrequent updates Contains data integrated from several, possibly heterogeneous operational databases Supports queries processing large data volumes
5 November 2012
11
Three views of Data WarehousingStrategic or Business view Define key business drivers of data warehouse How can business-driven approach achieve high ROI? Architectural or Technology view Alternative data warehousing architectures How can the right architecture achieve a high ROI? Methodology or Implementation view Development and implementation methodology How can the right methodology achieve a rapid ROI?
5 November 2012
12
Data Warehouse Concepts AgendaA.What is a Data Warehouse (DW) ?
B.What are the components of a DW ?C.What are the various architectures/formats of a DW ? D.Examples of Data Warehousing tools in use
5 November 2012
13
Data Warehouse ComponentsMetadata LayerExtractionFS1 FS2 S T A G I N G
CleansingTransformation Aggregation Summarization
Data Mart Population
DM1 DM2
. . .FSn
TransmissionN E T W O R K
ODS
DW
DMn A R E A
OLAP ANALYSIS Knowledge Discovery
Legacy System
5 November 2012
14
Data Warehouse Build LifecycleData extraction Data Cleansing and Transformation Data Load and refresh Build derived data and views Service queries Administer the warehouse
5 November 2012
15
Data Warehouse Concepts AgendaA.What is a Data Warehouse (DW) ?
B.What are the components of a DW ?C.What are the various architectures/formats of a DW ? D.Examples of Data Warehousing tools in use
5 November 2012
16
Data Warehouse ArchitecturesVirtual Data Warehouse Enterprise Data Warehouse Data Marts Distributed Data Marts Multi-tiered warehouse
5 November 2012
17
Virtual Data Warehouse
REPORTING TOOL
Legacy Client/ Server OLTP Application
Operational Systems Data
U S E R S
External
5 November 2012
18
Enterprise Data WarehouseData Preparation
LegacyClient/ Server
Select
REPORTING TOOL
Metadata Repository
Operational Systems Data
Extract
Transform Integrate
DATA WAREHOUSE
OLTP External
A P I
U S E R S
Maintain
5 November 2012
19
Data Marts
LegacyClient/ Server
Select
REPORTING TOOL
Metadata Repository
Extract
Transform
DATA MART
OLTP External
Integrate
A P I
U S E R S
Maintain
Data Preparation Operational Systems Data
5 November 2012
20
Distributed Data Marts
LegacyClient/ Server
Select
Data Mart
REPORTING TOOL
Extract
Transform
Data Mart
OLTP External
Integrate
A P I
U S E R S
Maintain
Data Mart
Data Preparation Operational Systems Data
5 November 2012
21
Multi-tiered Data Warehouse: Option 1Data MartLegacy
REPORTING TOOL
Select
Client/ Server
Extract
Metadata Repository
TransformOLTP
Data Mart
Integrate
DATA WAREHOUSE
A P I
U S E R S
External
Maintain
Data Mart
Operational Systems Enterprise wide Data
5 November 2012
22
Multi-tiered Data Warehouse: Option 2
Legacy Client/ Server OLTP External
Select
Data MartMetadata Repository
REPORTING TOOL
Extract
Transform Integrate
Data Mart
DATA WAREHOUSE
A P I
U S E R S
Maintain
Data Mart
Data Preparation Operational Systems Data
5 November 2012
23
Relative Data sizes in a Data WarehouseHighly Summarized Data
Lightly Summarized Data
Current Detail Data
Metadata
Older Detail Data
Cont.5 November 2012 24
Data Warehouse - ExampleMonthly sales by region for 1991-94 Monthly Sales by Product for 1991-94
Weekly sales by region for 1991-94
Weekly sales by product/sub-product for 1991-94
Sales Detail for 1991-94
Metadata
Sales Detail for 1985-90
5 November 2012
25
Building a Data Warehouse - StepsIdentify key business drivers, sponsorship, risks, ROI Survey information needs and identify desired functionality and define functional requirements for initial subject area. Architect long-term, data warehousing architecture Evaluate and Finalize DW tool & technology Conduct Proof-of-Concept Cont.5 November 2012 26
Building a Data Warehouse - StepsDesign target data base schema Build data mapping, extract, transformation, cleansing and aggregation/summarization rules Build initial data mart, using exact subset of enterprise data warehousing architecture and expand to enterprise architecture over subsequent phases Maintain and administer data warehouse
5 November 2012
27
Data Warehouse Concepts AgendaA.What is a Data Warehouse (DW) ?
B.What are the components of a DW ?C.What are the various architectures/formats of a DW ? D.Examples of Data Warehousing tools in use
5 November 2012
28
Representative DW ToolsTool Category ETL Tools OLAP Server Products ETI Extract, Informatica, IBM Visual Warehouse Oracle Warehouse Builder Oracle Express Server, Hyperion Essbase, IBM DB2 OLAP Server, Microsoft SQL Server OLAP Services, Seagate HOLOS, SAS/MDDB Oracle Express Suite, Business Objects, Web Intelligence, SAS, Cognos Powerplay/Impromtu, KALIDO, MicroStrategy, Brio Query, MetaCube Oracle, Informix, Teradata, DB2/UDB, Sybase, Microsoft SQL Server, RedBricks SAS Enterprise Miner, IBM Intelligent Miner, SPSS/Clementine, TCS Tools
OLAP Tools
Data Warehouse Data Mining & Analysis
5 November 2012
29
Data Warehousing - InsightsAn enabling technology that facilitates improved business decision-making A process, not a product A technique for assembling and managing a wide variety of data from heterogeneous systems for decision support
5 November 2012
30
Data Modeling
5 November 2012
31
Modeling ER ModelDefinitionLogical & Graphical representation of the information needs
Process
Classifying Entities Characterizing Attributes Inter-relating Relationships
5 November 2012
32
Modeling Logical ModelDefinition Representation of a business problem without regard to implementation, technology and organizational structure
Features Represent business requirement completely, correctly & consistently Remove redundancy Does not presuppose data granularity Not implemented
5 November 2012
33
Modeling Example of ER model
5 November 2012
34
Modeling Physical ModelDefinitionSpecification of what is implemented
Features Optimized Efficient Buildable Robust
5 November 2012
35
Dimensional Modeling Form of analytical design (or physical model) in which data is pre-classified as a fact or dimension
Improves performance by matching the data structure to the queries
Give this periods total sales volume and revenue by product, business unit and package
5 November 2012
36
Dimensional Modeling
5 November 2012
37
Dimensional Modeling AgendaA.What is Dimensional Modeling ?
B.What are the various types of Dimension Tables ?C.What are the various types of Fact Tables ? D.How do I model a star schema ?
5 November 2012
38
A Quick Recap of Data WarehousingData Warehouse is a SUBJECT ORIENTED, INTEGRATED, TIME-VARIANT, NON-VOLATILE collection of data enabling Management Decision Making
RDBMS ERP
Cleansing, Transformation, Validation, Massaging
Client Browsers
ODS
Extraction
CRM
STAGING AREA
Network
DW
Data
Mainframe DBs
MartsAggregation, summarization, Data Mart Population, Dimension loading, Fact Loading Reports, Cubes, Analysis, Data mining, Dashboards, MIS reports, Company Quarterly reports etc..5 November 2012 39
PC DBs
Dimensional Modeling: In PerspectiveDimensional Modeling is an effective, efficient and successful technique to design Enterprise Data Warehouse and Distributed Data Mart database schemas for maximum query performance
RDBMSERPClient Browsers
ODS
Extraction
CRM
STAGING AREA
Network
DW
Data
Mainframe DBs
Marts
DIMENSIONAL MODELING AREA
PC DBs5 November 2012 40
Dimensional Modeling Form of analytical design (or physical model) in which data is pre-classified as a fact or dimension
Improves performance by matching the data structure to the queriesGive this periods total sales volume and revenue by product, business unit and package
5 November 2012
41
Dimensional Model: Strengths Predictable, Standard Framework Query tools and user interfaces can be created to provide a consistent way of reporting data Most filter conditions being on dimensional attributes allows performance boosting through bit-vector indexes on dimension table columns Metadata functionality in the query tools can make use of the known cardinality of dimension values to offer such facilities such as valueselection drop-downs and value-selection windows Resilience to unexpected user querying patterns All dimensions are equivalent entry points into the fact table (number of joins to fact table is same = 1) Symmetrical query strategies and SQL Logical Design can be done independent of query patterns
5 November 2012
42
Dimensional Model: Strengths Graceful Extensibility Allows adding new unanticipated facts as long as they are consistent with the fact table grain Allows adding new dimension tables as long as a single value of that dimension is defined for each existing fact record Allows adding new unanticipated dimensional attributes
Standard approaches for common modeling situations Slowly Changing Dimensions (SCDs) Heterogeneous products (e.g. Savings account, current account) Pay-in-advance databases Event-handling databases (fact-less facts)
5 November 2012
43
What is a Fact?Facts MeasureSales Volume Revenue
5 November 2012
44
Facts and Fact TablesFact Measure
SalesDate Key (int) Store Key (int)
BillingDate Key (int) Customer Key (int) Service Line Key (int) Rate Plan Key (int) Number of Total Minutes Number of Calls (int) Service Charge (float) Taxes (float)
Revenue
Cost
Product Key (int) Sales (float) Qty Sold (int) Price (float)
No. of Accounts
Discount (float)
The term FACT represents a single business measure. E.g. Sales, Qty Sold Each fact has a GRAIN which is the set of perspectives or attributes that define/ qualify the fact completely. E.g. Grain of Sales could be for each PRODUCT, at each STORE, on each/ every DAY. A FACT TABLE is the primary table in a dimensional model where the business measures or FACTS are stored. A business measure or FACT is a row in a FACT TABLE. All FACTS in a FACT TABLE must be at the SAME GRAIN.5 November 2012 45
Fact tables: some features Fact tables express MANY-MANY RELATIONSHIPS between dimensions in dimensional models One product can be sold in many stores while a single store typically sells many different products at the same time The same customer can visit many stores and a single store typically has many customers The fact table is typically the MOST NORMALIZED TABLE in a dimensional model No repeating groups (1N), No redundant columns (2N) No columns that are not dependent on keys; all facts are dependent on keys (3N) No Independent multiple relationships (4N) Fact tables can contain HUGE DATA VOLUMES running into millions of rows Facts can be identified by answering the question: WHAT ARE WE MEASURING? The grain of the fact table can be identified by answering the question: HOW DO YOU DESCRIBE A SINGLE ROW IN THE FACT TABLE? All facts within the same fact table must be at the SAME GRAIN
5 November 2012
46
Fact tables: some features The grain of the fact table is the LIST OF DIMENSIONS that uniquely define each row of the fact table Each row of sales fact table is uniquely identified by a unique combination of store, product and time: which product, which store, when sold Additional foreign keys may exist in the fact table that point to other dimension tables e.g payment type, but which do not contribute to the grain Every foreign key in a Fact table is usually a DIMENSION TABLE PRIMARY KEY Every foreign key in a Fact table is usually an INTEGER KEY Fact tables are TYPICALLY used in GROUP BY SQL queries Every column in a fact table is either a foreign key to a dimension table primary key or a fact Every non-key column in the fact table is typically used in the SELECT clause of a SQL query
5 November 2012
47
What is a Dimension?Data Warehouse is Subject-Oriented
Integrated Time-Variant Non-volatile collection of data in support of managements decision.
SubjectProduct
DimensionBusiness UnitPackage
5 November 2012
48
Dimensions and Dimension TablesDimension Perspective
StoreStore Key (int) Store name (char) Street Address (char) City (Char) State (Char) Region (Char) Country (Char)
ProductProduct Key (int) Product id (char/int) Product name (char) Product Group Brand Department
Customer Geography
Time
The term DIMENSION represents a single category or perspective by which associated FACTS are interpreted and understood.
E.g. Store is a perspective by which sales are understood. It is the answer to the question Where did the sales occur?A DIMENSION TABLE is a table which holds a list of attributes or qualities of the dimension most often used in queries and reports. E.g. The Store dimension can have attributes such as the street and block number, the city, the region and the country where it is located in addition to its name. Every row in the DIMENSION TABLE represents a unique instance of that DIMENSION and has a unique identifier called the DIMENSION KEY.5 November 2012 49
Dimension tables: some features Dimensions can be identified by answering the question: HOW DO BUSINESS PEOPLE DESCRIBE THE DATA THAT RESULTS FROM A BUSINESS PROCESS? Dimension tables are ENTRY POINTS into the fact table. Typically The number of rows selected and processed from the fact table depends on the conditions (WHERE clauses) the user applies on the dimensional attributes selected Dimension tables are typically DE-NORMALIZED in order to reduce the number of joins in resulting queries Dimension table attributes are generally STATIC, DESCRIPTIVE fields describing aspects of the dimension Dimension tables typically designed to hold IN-FREQUENT CHANGES to attribute values over time using SCD concepts Dimension tables are TYPICALLY used in GROUP BY SQL queries Dimension Tables serve to simplify SQL GROUP BY queries Every column in the dimension table is TYPICALLY either the primary key or a dimensional attribute Every non-key column in the dimension table is typically used in the GROUP BY clause of a SQL Query
5 November 2012
50
Some more jargons.. Hierarchy LevelHierarchy - Geography DimensionWorld Level World
Member Attribute Grain
Continent Level USA
America
Europe
Asia
Country Level State Level City Level
Canada
Argentina
FL
GA
VA
CA
WA
Miami Attributes: Population, Tourists Place
Tampa
Orlando
Naples Dimension Member / Business Entity5 November 2012 51
Some more jargons..A Dimension can have one or more hierarchies Another Hierarchy Geography Dimension A hierarchy can have one or more levels or grainEconomy Level Upper Developed Global Level Global
Developing
Third world
Each Financial have one or more members level canClass Level Regional Level City Level Metro Suburb
Middle
Lower
Town
Village
Each member can have one or more attributesChennai Attributes: Population, Tourists Place Delhi Mumbai Kolkatta Dimension Member / Business Entity
5 November 2012
52
The Star Schema: Linking Facts and DimensionsDate/TimeDate Key (int) Date (date dd/mm/yy) Day Of Week (char) Day of Month (int) Month (int) Quarter (int) Year (int)
StoreStore Key (int) Store name (char) Street Address (char) City (Char) State (Char) Region (Char) Country (Char)
ProductProduct Key (int) Product id (char/int) Product name (char) Product Group Brand Department
SalesDate Key (int) Store Key (int) Product Key (int) Customer Key (int) Payment Type Key (int) Sales (float) Qty Sold (int) Price (float)
Store
Discount (float)
Time
Store Sales
Payment Type
The Star Join Schema or STAR SCHEMA is a single FACT TABLE joined to a set of DIMENSION TABLES Simple, Symmetric, Extensible and Optimized! GRAIN of the Star Schema is the grain of its central Fact table!5 November 2012 53
Customer
Product
Star Schema
Particular form of a dimensional modelCentral fact table containing Measures Surrounded by one perimeter of descriptors - Dimensions
5 November 2012
54
Star SchemaFact TableThis table is the core of the Star Schema Structure and contains the Facts or Measures available through the Data Warehouse.
These Facts answer the questions of What, How Much, or How Many.Some Examples:Sales Dollars, Units Sold, Gross Profit, Expense Amount, Net Income, Unit Cost, Number of Employees, Turnover, Salary, Tenure, etc.
5 November 2012
55
Star SchemaDimension TablesThese tables describe the Facts or Measures. These tables contain the Attributes and may also be Hierarchical. These Dimensions answer the questions of Who, What, When, or Where. Some Examples: Day, Week, Month, Quarter, Year Sales Person, Sales Manager, VP of Sales Product, Product Category, Product Line Cost Center, Unit, Segment, Business, Company
5 November 2012
56
Star SchemaEmployee_Dim EmployeeKey EmployeeID . . .
Time_Dim TimeKey TheDate . . .
Shipper_Dim ShipperKey ShipperID . . .
Sales_Fact TimeKey EmployeeKey ProductKey CustomerKey ShipperKey Required Data (Business Metrics) or (Measures) . . .
Product_Dim ProductKey ProductID . . .
Customer_Dim CustomerKey CustomerID . . .
5 November 2012
57
Snow Flake Schema Complex dimensions are re-normalized
Different levels or hierarchies of a dimension are kept separateGiven dimension has relationship to other levels of same dimension
5 November 2012
58
Star and Snow are De-normalizedIt violates 3NF in dimensions by collapsing higher-level dimensions into the lowest level as in Brand and Category. It violates 2NF in facts by collapsing common fact data from Order Header into the transaction, such as Order Date. It often violates Boyce-Codd Normal Form (BCNF) by recording redundant relationships, such as the relationships both from Customer and Customer Demographics to Booked Order. However, it supports changing dimensions by preserving 1NF in Customer and Customer Demographics.
5 November 2012
59
A shot at Dimensional Modeling.STEP 1 Identify Subjects (Dimensions) Identify Hierarchies of a Dimension
Identify Attributes of levels in Hierarchies Define Grain Country Industry Segment Industry Type State City Customer Fin. Class
5 November 2012
60
A shot at Dimensional Modeling.STEP 2 Use KPIs to identify the Facts Group the Facts in a logical set
Financial Transactions Trans. Amount No. of Bonds No. of Transactions Service Cost ...
Non-Financial Transactions
No. of Cheques Cleared No. of Visits to a Branch No. of DEMAT Transactions ...
5 November 2012
61
A shot at Dimensional Modeling.STEP 3 Link the Group of Facts to the Dimensions that participate in the Facts
Customer
Product
Time
Financial Transactions
Organization
Channel
5 November 2012
62
A shot at Dimensional Modeling.STEP 4 Define Granularity for each Group of Facts
Customer (Customer)
Product (Scheme)
Time (Day-Hour)
Financial Transactions
Organization (Branch)
Channel (Channel)
5 November 2012
63
Dimensional Modeling AgendaA.What is Dimensional Modeling ?
B.What are the various types of Dimension Tables ?C.What are the various types of Fact Tables ? D.How do I model a star schema ?
5 November 2012
64
Types of Dimensions Primary Dimension Contributes to the fact grain A set of these uniquely defines the associated fact E.g. SALES fact is typically completely defined by store, product and time Secondary Dimension Does NOT contribute to the fact grain Non-primary dimensions such as payment type, customer, manufacturer are still important for analysis of SALES fact Useful for rich analytic slicing and dicing, e.g. Top 10 customers. Degenerate Dimension A dimension without any attributes; but useful for analysis Generally included in the associated fact table before facts E.g. invoice number, by itself, in a shipping fact
5 November 2012
65
Types of Dimensions Conformed Dimension A dimension used across the enterprise Requires standardized structure and definition Requires to be designed upfront before individual designing schemas Plugs into multiple stars as either primary or secondary dimension E.g. Customer, Product, Store, Time, Employee Customer could be captured at the store card swiping machine (sales fact), be part of Marketing promotion strategy (campaign fact) and also could be serviced by call center for warranty replacements (warranty fact) Employee may be a Sales-rep claiming credit for sales (Sales fact) or may be a Finance manager authorizing vendor payments (vendor Payment fact) or a call center person taking customer calls (Service Call fact)
5 November 2012
66
Types of Dimensions Slowly Changing Dimension Dimensional attributes change over time Need to capture these changing realities as history Requires special design techniques to keep it single valued for each fact row while still retaining history E.g. Customer City, Marriage status, salesrep department Type 1 (Overwrite previous values) Type 2 (Create additional time-stamped dimension record Type 2 automatically partitions history Type 3 (Create additional attribute column to retain any one previous value e.g. first value, previous value) Requires dimension key to be generalized
5 November 2012
67
Types of Dimensions Rapidly Changing Small Dimension Same as SCD except frequency is higher Need to track changes to attributes E.g. Employee attributes such as appraisal rating E.g. telecom product: rate plans keep changing Large Dimension Size increases with decreasing level of granularity Typical of public utility companies, government agencies Human records kept by supermarkets e.g. Shoppers Stop Do NOT create SCDs to address slow changes/ history See Monster Dimension for SCD strategy Choose indexing strategies to reduce query run times Choose RDBMS wisely e.g. SQLServer vs. Oracle vs. Teradata
5 November 2012
68
Types of Dimensions Rapidly Changing Monster Dimension Similar to large dimension but typical of a large insurance customer dimension Customers and Claims are rapidly created and changed Need to track history for credit and legal reasons Remove the continuously changing attributes to another dimension table e.g. demographic Reduce the cardinality of these attributes by banding them e.g. income_band, credit_band, etc. Then create all possible combinations of these attributes Then assign a dimension key to each unique set of these combinations; this is the demographics table For each combination that represents the customers status in a particular period, plug the demographic key into the fact as an additional key
5 November 2012
69
Types of Dimensions Junk Dimension A convenient grouping of random flags and attributes to get them out of the fact table Retain only useful fields Remove fields that make no sense at all Remove fields that are inconsistently filled Remove fields that are of operational interest only Design similar to demographics; maximum unique combinations, assign integer key, plug into fact Create new combination (insert new dimension record) at ETL run-time E.g. Yes/No Flags in old retail transaction data
5 November 2012
70
Types of Dimensions Role-playing Dimension dimension appears several times in the same fact table Typically, Date/Time dimension plays many roles E.g. Order Fulfillment is a typical retail fact table having the following dimensions: Order Date, Packaging date, Shipping date, Delivery date Create one fact table key for each role Create one SQL view of the dimension for each role Use view names to run SQL queries In Business Objects, this scenario is designed using Aliases and Contexts E.g. (2) Employee dimension: Salesrep, Manager, Appraiser, Appraisee in Sales Compensation fact and Employee Appraisal fact respectively.
5 November 2012
71
Dimensional HierarchyGeography DimensionWorld Level
World America USA Europe Canada Asia Argentin a
Continent Level Country Level State Level City Level
FLMiami
GATampa
VA
CAOrlando
WANaples
Attributes: Population, Tourists Place
Dimension Member / Business Entity
5 November 2012
72
Dimensional Modeling AgendaA.What is Dimensional Modeling ?
B.What are the various types of Dimension Tables ?C.What are the various types of Fact Tables ? D.How do I model a star schema ?
5 November 2012
73
Types of FactsValue Based Classification Numeric Facts Count / Occurrence Based (e.g. Employees assigned to a project) Non-numeric Facts (e.g. Comments in fact tables)
Summary Based Classification Additive (along all dimensions) Semi Additive (mostly along Time dimension) Non Additive (cannot be added along any dimension)In our example discussed earlier in these slides, sales, number of total minutes are examples of value based and additive facts as these can be added across all dimensions Whereas, Price, Quantity Sold are examples of value based but semi-additive facts as these can be added only across some dimensions e.g. they cannot be added across the product dimension as the fact Total price does not make sense. Rather Average Price across products makes more sense.5 November 2012 74
Types of Fact TablesFact tables are classified based upon: the type of grain they address or the level of detail they contain AND the way the measurements are taken with respect to time Thus we have: Transaction Fact Table
Snapshot Fact Table Summary Fact Table Figure 1 The context of a transaction is modeled as a set of generally independent dimensions. Figure 1 shows seven such dimensions. The measured transaction amount is in a fact table that refers to all the dimensions by foreign keys pointing outward to their respective dimension tables. The clean removal of all the context detail from the transaction record is an important normalization step and is why fact tables are highly normalized.5 November 2012 75
Types of Fact TablesKimball mentions 3 fundamental types of fact table grain: Transaction: A transaction is a set of data fields that record a basic business event. e.g. a point-of-sale in a supermarket, attendance in a classroom, an insurance claim, etc.
Figure 2
The measurements group nicely together into a single fact table with the same grain. Periodic Snapshot: A snapshot is a measurement of status at a specific point in time. E.g. In Figure 2, earned premium is the fraction of the total policy premium that the insurance company can book as revenue during the particular reporting period. The periodic-snapshot-grained fact table represents a predefined time span.5 November 2012 76
Types of Fact TablesThe accumulating-snapshotgrained fact table represents an indeterminate time span, covering the entire history starting when the collision coverage was created for the car in our example and ending with the present moment. Figure 2 In dramatic contrast to the other fact-table types, we frequently revisit accumulating-snapshot fact records to update the facts. Remember that in this table there is generally only one fact record for the collision coverage on a particular customers car. As history unfolds, we must revisit the same record several times to revise the accumulating status.5 November 2012 77
Dimensional Modeling AgendaA.What is Dimensional Modeling ?
B.What are the various types of Dimension Tables ?C.What are the various types of Fact Tables ? D.How do I model a star schema ?
5 November 2012
78
Visualizing a dimensional modelThe most popular way of visualizing a dimensional model is to draw a cube. We can represent a three-dimensional model using a cube. Usually a dimensional model consists of more than three dimensions and is referred to as a hypercube. However, a hypercube is difficult to visualize, so a cube is the more commonly used term. In Figure 1, the measurement is the volume of production, which is determined by the combination of three dimensions: location, product, and time.
Figure 1The location dimension and product dimension have their own two levels of hierarchy. For example, the location dimension has the region level and plant level. In each dimension, there are members such as the east region and west region of the location dimension. Although not shown in the figure, the time dimension has its numbers, such as 1996 and 1997. Each sub-cube has its own numbers, which represent the volume of production as a measurement. For example, in a specific time period (not expressed in the figure), the Armonk plant in East region has produced 11,000 Cell Phones, of model number 1001.5 November 2012 79
Data Warehouse Bus MatrixAll Dimensional models together form the logical design of the data warehouse.To Decide which Dimensional Models to build we start with a top-down planning approach called the Data Warehouse Bus Architecture Matrix.
This Matrix forces us to list all the possible data marts we could possibly build and name all the dimensions that are present in those data marts (at a high level).A Dimensional Model is made up of one or more star schemas. Some of these star schemas may be snow flaked for better organization and storage.
DimensionOrganization Equipment Employee Customer Accounts Calendar
Subject Area Accounts Sales Quotes General Ledger Shipment Parts/Finance
5 November 2012 80
Outage
Vendor
Dimensional Modeling ApproachCDM LDM PDMEach star schema has a single fact table at its centre surrounded by multiple dimension tables. Once we do this, we can then start the design of each individual fact table/star schema using a 4-step process.
STEP 1: Identify Subject Area/ Business ProcessStart the model by choosing a single business process or a business sub-process to model so that you have only one fact table: e.g. SALES business process.
STEP 2: Define Fact Table GrainChoose the GRAIN of the central fact table. e.g. i) Each Sales transaction is a fact record: Grain is sales by product by store by transaction ii) Each daily product sales total in each store is a fact record: Grain is sales by product by store by day
STEP 3a: Identify DimensionsChoose the DIMENSIONS as follows: Primary dimensions from fact grain e.g. product, store, day (or date)
Additional dimensions based on user interviews, reports analysis Ensure each dimension is at its lowest level of detail possible while still being single valued.
5 November 2012
81
Dimensional Modeling ApproachSTEP 3b: Identify grain of dimension tableEnsure that each dimension table grain is NOT lower than the central fact table grain: e.g. the store dimension should have one row for each store. Each store may have departments but the store dimensions row should represent only the store and not the department. STEP 3c: Identify all dimensional Attributes For each dimension choose only SINGLE VALUED attributes e.g. if Region is an attribute of the store dimension then it should have one and only one value for each store. STEP 3d: Identify Dimension Hierarchies and attributes of levels in Hierarchies
Country Industry Segment Industry Type State City Customer Fin. Class
SalesDate Key (int) Store Key (int) Product Key (int) Customer Key (int) Sales (float) Qty Sold (int) Figure: Customer Dimension Hierarchies (Industry, Geography)
Price (float)Discount (float)
STEP 4a: Choose Facts
Choose each fact for the fact table making sure that the fact is relevant and also has the same grain has the fact table e.g. for SALES fact table, typical facts would be price, quantity sold and sales amount as these are all dimensioned by product, by store, by day.5 November 2012 82
Dimensional Modeling ApproachSTEP 4b: Connect Fact to dimension tables by means of surrogate keysCustomer (Customer) Product (Scheme)
Time (Day-Hour)
Financial Transactions
Organization (Branch)
Channel (Channel)
Important Notes:1. Each dimension table will have a MEANINGLESS SINGLE PART INTEGER PRIMARY KEY called surrogate key. This key also occurs as part of the central fact primary key.
2.
All fact table primary keys should ideally be foreign keys to the corresponding dimension tables.
5 November 2012
83
Implementation and MaintenanceDWH Design, Deployment and Maintenance
5 November 2012
84
Implementation & Maintenance Agenda
A.Physical Design Steps B.Physical Design Considerations C.Physical Storage D.Indexing E.Performance Enhancement techniques F.Deployment Activities G.Security H.Backup and recovery I. Monitoring Data Warehouse J. User Training and Support K.Managing Data Warehouse5 November 2012 85
DWH Physical Design Process
5 November 2012
86
Physical Design Process Develop Standards.Process Standards
Naming StandardsDatabase Objects Word Separators Names in Logical and Physical Model Physical File naming standards
Naming of files & tables in Staging area
5 November 2012
87
ContinueCreate Aggregates PlanIdentify granularity level
Determine the Data Partitioning SchemeSelecting fact and dimensions Horizontal or Vertical Number of partitions Criteria for partitions
5 November 2012
88
ContinueEstablish Clustering OptionsPlacing and managing related units of data together
Prepare Indexing Strategy
Identify the columns.Identify the sequence.
Assign Storage Structures Complete Physical ModelReview all the above activities
Create physical model
5 November 2012
89
Implementation & Maintenance Agenda
A.Physical Design Steps B.Physical Design Considerations C.Physical Storage D.Indexing E.Performance Enhancement techniques F.Deployment Activities G.Security H.Backup and recovery I. Monitoring Data Warehouse J. User Training and Support K.Managing Data Warehouse5 November 2012 90
Physical Design ConsiderationsImprove Performance
Ensure ScalabilityManage Storage Provide Ease of Administration Design Flexibility Assign Storage Structures
5 November 2012
91
Implementation & Maintenance Agenda
A.Physical Design Steps B.Physical Design Considerations C.Physical Storage D.Indexing E.Performance Enhancement techniques F.Deployment Activities G.Security H.Backup and recovery I. Monitoring Data Warehouse J. User Training and Support K.Managing Data Warehouse5 November 2012 92
Physical Storage Types of Storage StructureFiles Facts Dimensions Indexed Data Structures
5 November 2012
93
Storage ConsiderationsSet correct units of units of database space allocationData Block
Set proper block usage parametersFree and Used Space
Manage data migrationRow Chaining
Row Migrating
Manage block utilization Should have less free space5 November 2012 94
ContinueResolve dynamic extensionInserting a new record. Updating the existing record.
Employ file striping techniquesSplitting files into multiple physical partsEnables Concurrent I/O
5 November 2012
95
Implementation & Maintenance Agenda
A.Physical Design Steps B.Physical Design Considerations C.Physical Storage D.Indexing E.Performance Enhancement techniques F.Deployment Activities G.Security H.Backup and recovery I. Monitoring Data Warehouse J. User Training and Support K.Managing Data Warehouse5 November 2012 96
Indexing the Data WarehouseWhat are Indexes?Why are Indexes required? When should I create an Index?
What are different types of Indexes?
5 November 2012
97
Types of IndexesB-tree: The default and the most common. B-tree cluster:Defined specifically for cluster. Hash cluster:Defined specifically for a hash cluster Bitmap indexes: Compact, work best for columns with a small set of values.
Bitmap Join Indexes - Index based on one table that involves columns of one or more different tables through a join.Function-based: Contain the pre- computed value of a function/expression.
5 November 2012
98
B-tree Index
5 November 2012
99
B-tree Index - AdvantagesAll leaf blocks of the tree are at the same depth, so retrieval of any record from anywhere in the index takes approximately the same amount of time. B-tree indexes automatically stay balanced. All blocks of the B-tree are three-quarters full on the average. B-trees provide excellent retrieval performance for a wide range of queries, including exact match and range searches. Inserts, updates, and deletes are efficient, maintaining key order for fast retrieval. B-tree performance is good for both small and large tables and does not degrade as the size of a table grows.
5 November 2012
100
Bitmapped Index
5 November 2012
101
Bitmapped Index - AdvantagesReduced response time for large classes of ad hoc queries A substantial reduction of space use compared to other indexing techniques Dramatic performance gains even on very low end hardware Very efficient parallel DML and loads
5 November 2012
102
Clustered Index
5 November 2012
103
Implementation & Maintenance Agenda
A.Physical Design Steps B.Physical Design Considerations C.Physical Storage D.Indexing E.Performance Enhancement techniques F.Deployment Activities G.Security H.Backup and recovery I. Monitoring Data Warehouse J. User Training and Support K.Managing Data Warehouse5 November 2012 104
Performance Enhancement TechniquesData PartitioningDecomposing tables into smaller and more manageable pieces called partitions
Range, list, hash & composite partitioning.
Data Clustering Parallel Processing Summary levels Referential Integrity Checks
Initialization ParametersData Arrays5 November 2012 105
Implementation & Maintenance Agenda
A.Physical Design Steps B.Physical Design Considerations C.Physical Storage D.Indexing E.Performance Enhancement techniques F.Deployment Activities G.Security H.Backup and recovery I. Monitoring Data Warehouse J. User Training and Support K.Managing Data Warehouse5 November 2012 106
Major Deployment ActivitiesComplete User Acceptance Perform Initial Loads Get User Desktops Ready Complete Initial User Training
5 November 2012
107
Deployment ApproachesTop-Down ApproachDeploy the overall enterprise DWH followed by the dependent data marts,one by one.
Bottom-up ApproachGather departmental requirements, and deploy the independent data marts,one by one.
Practical/general ApproachDeploy the Subject data marts, one by one, with flexible approach with fully conformed dimension following water fall model.
Note It is always advisable to deploy in stages5 November 2012 108
Implementation & Maintenance Agenda
A.Physical Design Steps B.Physical Design Considerations C.Physical Storage D.Indexing E.Performance Enhancement techniques F.Deployment Activities G.Security H.Backup and recovery I. Monitoring Data Warehouse J. User Training and Support K.Managing Data Warehouse5 November 2012 109
SecurityPrepare a Security PolicyShould cover scope of information, physical security, network and connections, DB access privileges, & access matrix.
Manage user privilegesPassword considerations Security tools
5 November 2012
110
Implementation & Maintenance Agenda
A.Physical Design Steps B.Physical Design Considerations C.Physical Storage D.Indexing E.Performance Enhancement techniques F.Deployment Activities G.Security H.Backup and recovery I. Monitoring Data Warehouse J. User Training and Support K.Managing Data Warehouse5 November 2012 111
Backup and RecoveryWhy Backup is required? What is Data warehouse Administration? What are the roles of Data warehouse Administrator (DWA)?
5 November 2012
112
DWA - RolesBuilding the data warehouseOngoing monitoring and maintenance for the data warehouse
Coordinating usage of the data warehouseManagement feedback as to successes and failures Competition for resources for making the data warehouse a reality Selection of hardware and software platforms
5 November 2012
113
Backup StrategyShould the data be actually discarded or should the data be removed to lower-cost, bulk storage? What criteria should be applied to data to determine whether it is a candidate for removal? Should the data be condensed (profile records, rolling summarization, etc.)? If so, what condensation technique should be used? How should the data be indexed once it is removed (if there is ever to be any attempt to retrieve the data)? Where and how should metadata be stored once the data is removed?
5 November 2012
114
ContinueShould metadata be allowed to be stored in the data warehouse for data that is not actively stored in the warehouse?
What release information should be stored for the base technology (i.e., the DBMS) so that the data as stored will not become stale and unreadable? How reliable (physically) is the media that the data will be stored on?What seek time to first record is there for the data upon retrieval?
5 November 2012
115
Implementation & Maintenance Agenda
A.Physical Design Steps B.Physical Design Considerations C.Physical Storage D.Indexing E.Performance Enhancement techniques F.Deployment Activities G.Security H.Backup and recovery I. Monitoring Data Warehouse J. User Training and Support K.Managing Data Warehouse5 November 2012 116
Monitoring the Data Warehouse
Collection of StatisticsUsing Statistics for growth planning. Using Statistics for Fine-Tuning. Publishing Trends for users.
5 November 2012
117
Implementation & Maintenance Agenda
A.Physical Design Steps B.Physical Design Considerations C.Physical Storage D.Indexing E.Performance Enhancement techniques F.Deployment Activities G.Security H.Backup and recovery I. Monitoring Data Warehouse J. User Training and Support K.Managing Data Warehouse5 November 2012 118
SupportHelp Desk support.Hotline Support Technical Support. User representative.
Note - We should always follow multi-tier support structure
5 November 2012
119
User TrainingUser Training ContentsShould provide enough Data Content. Should talk about all the applications involved. Should talks features and usage of tools used.
Identifying the users to be trained.Delivering the training program.
5 November 2012
120
Implementation & Maintenance Agenda
A.Physical Design Steps B.Physical Design Considerations C.Physical Storage D.Indexing E.Performance Enhancement techniques F.Deployment Activities G.Security H.Backup and recovery I. Monitoring Data Warehouse J. User Training and Support K.Managing Data Warehouse5 November 2012 121
Managing the Data Warehouse
Platform Upgrades.Managing Data Growth. Storage Management. ETL Management. Data Model Revisions. Information Delivery Enhancements. Ongoing fine tuning.
5 November 2012
122
Data Management
5 November 2012
12 123 3
Data Storage, Movement and Access
Data Governance
Master Data Management
Enterprise Data Management Framework
Data Architecture
Data Quality Management
Metadata Management
5 November 2012 124 124
Enterprise Data Management Framework explained Data Governance Data governance (DG) refers to the overall management of the availability, usability, integrity, and security of the data employed in an enterprise Data governance encompasses the people, processes and procedures required to create a consistent, enterprise view of a company's data Data Storage, Movement and Access Data Movement involves translating/moving data from one format/storage device to another. Data security is the means of ensuring that data is kept safe from corruption and that access to it is suitably controlled.
Metadata Management Metadata is Data about data. Metadata describes how and when and by whom a particular set of data was collected, and how the data is formatted. Metadata Management is becoming very important because as systems become more interdependent, it is vital to know the impact of altering data
Data Architecture
Data Quality Management Data quality assurance (DQA) is the process of verifying the reliability and effectiveness of data. Maintaining data quality requires going through the data periodically and scrubbing it. Typically this involves updating it, standardizing it, and deduplicating records to create a single view of the data, even if it is stored in multiple disparate systems5 November 2012 125 125
Data Quality Analysis
5 November 2012
12 126 6
Data Quality Analysis AgendaA.Why Data Quality Management ?
B.Elements for Data QualityC.Classification of Data Quality Issues D.Dimensions of Data Quality E.TCS DQM Approach F. DQM Architecture Options G.BIDSTM DQM Methodology H.Tools and Technology
5 November 2012
127
Why DQM?Why is this NULL?
Where can I get just one view for all the data? Is empid same as emp_id?
So many duplicate products on this list I am still not able to see latest data Returns on Investment are below expectations
Holland??? Is this customer in Europe or USA?
5 November 2012
128
Data Quality Analysis AgendaA.Why Data Quality Management ?
B.Elements for Data QualityC.Classification of Data Quality Issues D.Dimensions of Data Quality E.TCS DQM Approach F. DQM Architecture Options G.BIDSTM DQM Methodology H.Tools and Technology
5 November 2012
129
Elements for Data QualityData Quality can be hampered by errors in following elements: Definitions Domains Completeness Validity Data Flows Structural Integrity Business Rules Transformations
5 November 2012
130
Definition This indicates how entities are referenced throughout the enterprise Definition problems can be further categorized as below: Synonyms Homonyms The fields EMP_ID, EMPID, and EM01 may or may not all actually refer to the same type of data These indicate fields that are spelled the same, but really arent the same (id or ID)
Relationships Just because a field is named FK_INVOICE doesnt mean that is is really a foreign key to the invoice file.
5 November 2012
131
Domains Domains describe the range and types of values present that can be present in a data set Some examples of domain errors are: Unexpected values - e.g. Home State = one of {Kan, Mic, Min,) Cardinality - A Yes/No field can have only two credible values Uniqueness for a field, 98% of data is NULL Constant Outliers Length of field Precision Scale Internationalization Date formats, postal codes, time zones, etc
5 November 2012
132
CompletenessThis indicates whether or not all of the data is actually present Completeness of dataset can be gauged by its Integrity - Is actual data mapping to our definition of data? Accuracy Name and address matching, demographics check Reliability Zip code should match to city and state Redundancy Data duplication Consistency Is same invoice number referenced with different amounts?
5 November 2012
133
ValidityValidity indicates whether or not the data is valid Validity checks used to spot data problems are Acceptability Product part number should consist of 7character alphanumeric string with two characters and 5 digits Anomalies Timeliness
5 November 2012
134
Data FlowsThese checks are related to the aggregate results of movement of data from source to target Many data quality problems can be traced back to incorrect data loads, missed loads or system failures that go unnoticed Data flow checks to ensure data quality are Record counts Reconciliation of source and target record counts Checksums Timestamps Process Time
5 November 2012
135
Structural IntegrityThese checks ensure that when data is taken as a whole, you are getting correct results Structural integrity checks include Cardinality Checks between tables Primary keys Are these unique? Referential integrity Product available on invoice but missing from product catalog
5 November 2012
136
Business RulesBusiness rule checks measure the degree of compliance between actual data and expected data These checks constitute of Constraints Does the data comply to a known set of validations? Computational rules Is formula for deriving amount correct? Comparisons Functional dependencies Conditions
5 November 2012
137
TransformationsTransformation checks examine the impact of data transformations as data moves from system to system Quality of data can be affected by incorrect transformation logic Only way to identify these are to compare source data set with target data set and verify transformations for Computations Merging Filtering Relationships
5 November 2012
138
Data Quality Analysis AgendaA.Why Data Quality Management ?
B.Elements for Data QualityC.Classification of Data Quality Issues D.Dimensions of Data Quality E.TCS DQM Approach F. DQM Architecture Options G.BIDSTM DQM Methodology H.Tools and Technology
5 November 2012
139
Data Quality Issues - ClassificationData Quality Issues
Physical Issues
Logical Issues
Unmanaged Data Issues
Data Profile
Business Rules(for Cleansing)
Data Parsing(using Rules, Text Mining etc.)
5 November 2012
140
Data Quality Analysis AgendaA.Why Data Quality Management ?
B.Elements for Data QualityC.Classification of Data Quality Issues D.Dimensions of Data Quality E.TCS DQM Approach F. DQM Architecture Options G.BIDSTM DQM Methodology H.Tools and Technology
5 November 2012
141
Dimensions of Data Quality Sufficiency for the purpose of Business IntelligenceSufficiency
Consistency of definition of data across the Data WarehouseAccuracy Redundancy
Accuracy as defined by business rules No redundancy across the warehouse Latency - No major change of data between the instance of data capture and when processed
Data Quality
Latency
Consistency
The FIVE Dimensions of Data Quality
Data Quality is measured across these dimensions!!!5 November 2012 142
Data Quality Analysis AgendaA.Why Data Quality Management ?
B.Elements for Data QualityC.Classification of Data Quality Issues D.Dimensions of Data Quality E.TCS DQM Approach F. DQM Architecture Options G.BIDSTM DQM Methodology H.Tools and Technology
5 November 2012
143
TCS DQM Approach Analysis of source data quality User driven Data driven Characterization of quality data This implies identification of necessary and sufficient criteria that define quality data Domain validations, Business Rules validations are touched upon in this Feasibility Analysis Mapping of data elements to rules Mapping of relationships to rules Assessment of grain of data Design and Implementation BIDS DQM Methodology
5 November 2012
144
Data Quality Analysis AgendaA.Why Data Quality Management ?
B.Elements for Data QualityC.Classification of Data Quality Issues D.Dimensions of Data Quality E.TCS DQM Approach F. DQM Architecture Options G.BIDSTM DQM Methodology H.Tools and Technology
5 November 2012
145
DQM Architecture
DQM at Source DQM as part of ETL processes DQM in the target
5 November 2012
146
Data Quality Analysis AgendaA.Why Data Quality Management ?
B.Elements for Data QualityC.Classification of Data Quality Issues D.Dimensions of Data Quality E.TCS DQM Approach F. DQM Architecture Options G.BIDSTM DQM Methodology H.Tools and Technology
5 November 2012
147
DQM Methodology Modular approach to building solutions Clear and well defined guidelines, checklists and standards Supports the OnsiteOffshore delivery model Flexibility to adapt with other methodologies E-T-V-X criteria reenforced by best practices and TCS quality initiatives
5 November 2012
148
Data Quality Analysis AgendaA.Why Data Quality Management ?
B.Elements for Data QualityC.Classification of Data Quality Issues D.Dimensions of Data Quality E.TCS DQM Approach F. DQM Architecture Options G.BIDSTM DQM Methodology H.Tools and Technology
5 November 2012
149
Tools and TechnologyCommon software products for Name and Address cleansing Trillium First Logic TCS DataClean Common ETL Tools Informatica Ab Initio
5 November 2012
150
Tools and TechnologyTCS expertise in industry standard tools and products ranging from RDBMS, ETL, CGI & Web products in conjunction with in-house developed tools . TCS Knowledge Base also includes a number of specialised tools that are proficient in data cleansing, validation and trending The common software products in use Trillium Software : Used for cleansing the name and address data. The software is able to identifyand match households, business contacts and other relationships to eliminate duplicates in large databases using fuzzy matching techniques
Ab Initio : Provides a suite of software packages used for ETL in data warehouses. Its features includeParallel data transformation , validation and filtering, real time data capture and integration with relational DBMS systems and Data Profiling Capability
Unitech and Actuate : Comprises of a set of reporting tools used for trend analysis, point to pointreconciliation and detecting data inconsistency.
5 November 2012
151
Case Study: British Telecom Retail : SWIFT
5 November 2012
15 152 2
Client Profile and Business DriversClient Profile BT Retail is a significant player in the communications market in the UK BT Retail has three main customer groups, namely, consumer, business and major business or corporate. The products and services cover the entire range from traditional telephony service, mobile technology, internet access and web-based services. Business drivers To unify existing marketing systems for providing centralized customer repository to cater to Developing, targeting and presenting propositions Managing customer relationships Undertaking rapid tactical marketing Improving campaign effectiveness Reducing marketing operational costs
5 November 2012
153
Business Objectives Reduction in marketing operational costs Reduction of marketing cycle times 360 degree view of customers Delivery of consistent messages across all customer channels Increased customer focus, better understanding and segmentation of customers Event driven campaigns targeted at focused customer group Improvement in campaign effectiveness Maintaining a large data volume - One of the largest data warehouses in Europe 3.36 TB of data with a growth projection of 1% per week
5 November 2012
154
Challenges Data quality issues in the vital data attributes in BTs operational data store Cleansing a backlog of erroneous information stored in database Decommissioning and migration of data from legacy systems Decommissioning of 30 TB of RDBMS Maintaining data integrity
5 November 2012
155
Proposed Profiling and DQM Solution As part of the solution, the team deployed Business Rules Repository (BRR) to store all business rules scattered across the enterprise. This enabled Sharing of information within business and IT stakeholders in an effective and efficient manner Storage for basic information about each Business Rule with a history of changes applied to it over time Types of Business Rules Format check numeric, character or date with a specific pattern Cross Attribute Value Check within dataset (compare multiple attributes in a dataset) List of Value for small list of valid values Lookup for large list of values like list of country codes, etc Uniqueness or duplication check Data integrity check Cross Attribute Value Check across datasets Data Profiler reports were created to report Structure and statistics for each data element Data value, range, distribution, pattern and format of each attribute Relationship of various attributes within and across datasets join key, primary key, potential foreign key, data dependencies, etc Web based application Quest delivered as a value-add for Data Quality Management
5 November 2012
156
Integration of Data Profiler and BRRSource System Rules Common Rules Target System Rules BRR
BRR for Source & Target system
ADAPTOR Embed Business Rules into Data Profiler Analyze Profile
Data Profiler Schema Transformation
Profile
Analyze
Source System Test data
Target System Test data Compare
5 November 2012
157
Software and Hardware SoftwareDatabase Layer Oracle Application Layer Ab Initio, Trillium, Unitech and Actuate
HardwareIBM Sequent NUMA-Q server 16 quad machine and 2GB RAM
5 November 2012
158
Application ArchitectureProfile Full volume Source Data Legacy System Sales System ERP Profiled Information Data Analysts, Business Owners, Design document DQ reports
Reduces Data Assumption Use Profiler output for different Design phases Use Profiler output to build BRR Embed the rules defined in BRR into Data Profiler Data Audit and Data Quality Monitors (DQM)
Requirement Analysis Solution Design High/Low Source Level Design system Test Data
Business Rules Repository
DQ Monitors
CRM
Billing System
3rd Party DataSource Data
Various Test Phases
Test Target System
Profile Live data for Data Audit Deployment Live Target System
Profile Source Test Data
Profile Target Test Data Compare and analyse profiler output to validate the transformation process
5 November 2012
159
Benefits to Client A huge backlog of data quality issues resolved leading to millions of pounds worth of saving to BT A generic name and address data cleansing methodology designed that can be used as a prototype for similar requirement time effectively Profiling of Live data on an ongoing basis to check compliance over time BRR was used for developing Data Quality monitors (DQM) Development of uniform data dictionary for all disparate source systems Reduced Risks and accurate planning
Client Speak...We have made fantastic progress in managing to roll out some really big, complex deliveries... all thanks to your commitment, and your ability to work as a team in order to resolve issues quickly whilst under a lot of pressure. Well done everyone. - Simon Riley
5 November 2012
160
Metadata Management
5 November 2012
16 161 1
MetadataData about Data
For every data element definition characteristics relationships to other data elements Metadata categories Business Metadata Technical Metadata Metadata currency Static (Slowly changing) Metadata Dynamic Metadata Metadata types Control Metadata Process Metadata
5 November 2012
162
Metadata ProductsCandidate Products SuperGlue System Architect MetaStage Platinum Repository Advantage Rochade Microsoft Repository MetaCentre Repository Informatica Popkin Ascential Platinum Technologies Computer Associates Allen Systems Group Platinum/Microsoft Partnership Data Advantage Group Vendor
MDM
I2
5 November 2012
163
Data Governance
5 November 2012
16 164 4
Data GovernanceData governance (DG) refers to the overall management of the availability, usability, integrity, and security of the data employed in an enterprise
Why go in for it ?
Increase consistency & confidence in decision making
What is Data Governance ?
Decrease the risk of regulatory finesImprove data security
5 November 2012
165
Master Data Management
5 November 2012
16 166 6
What is Master Data Management Master Data Management (MDM), is a discipline in Information Technology (IT) that focuses on the management of reference or master data that is shared by several disparate IT systems and groups. MDM enables consistent computing between diverse system architectures and business functions. MDM integrates dimensional and master data across BI, data warehouse, financial & operational systems, providing for accurate, consistent and compliant enterprise reporting. MDM supplies meta-data for aggregating and integrating transactional data.Practices Processes Technologies
Master Data
Metadata
5 November 2012
167 167
Typical Requirements for MDM Role Definition Support: Support for definition of roles with access rights enforced depending on the responsibilities assigned for that role ETL: ETL capabilities for extracting master data/reference data files or tables from multiple sources and loading the data into the master data repository Data Cleansing: Data cleansing capabilities for de-duplication and matching of master data records Collaborative platform: A collaborative platform for coordinating decisions on master data reconciliation and rationalization. The platform should be supported by standards, if available, or via industry knowledge of a master data domain. An example is a standard product hierarchy for a particular industry Data synchronization and replication support: For applying changes established in a central server to each consuming application. Incremental change support is important for performance reasons Version control and Change monitoring: Version control at the central policy hub combined with change monitoring across all of the participating systems. This is needed in order to track changes to master data over time.
5 November 2012
168 168
Processes required for Master Data Management Master Data is managed Via a Policy Hub as shown in the figure The policy hub for master data management collects master data from participating analytical and transactional systems Collaborative applications run on the central policy hub to coordinate decisions among team members on master data policies The standard master data is published to each participating system (transactional and analytical) so that they are synchronized with the hub
5 November 2012
169 169
Processes required for Master Data Management Steps in the Process for Managing and Maintaining Master Data Assign business responsibility for each master data domain such as products, customers, suppliers, organizational structure Extract master data for a domain from separate operational and reporting systems to a central server Apply data quality standards, such as de-duplication and matching of master data records, to get a clean set of master data for the domain Reconcile and rationalize the master data records. This process entails setting policies pertaining to an optimal product hierarchy, organizational structure, or preferred supplier list Synchronize participating operational and reporting systems with the centrally managed, canonical master data Monitor changes or updates to master data in each participating system. Then repeat the preceding steps for ongoing maintenance of master data. Over time, with the centralization of master data management responsibilities, the origination of master data changes moves from the participating systems to the master data management hub or server
5 November 2012
170 170
Data Storage, Movement and Access
5 November 2012
17 171 1
Data Security Access Data security is the means of ensuring that data is kept safe from corruption and that access to it is suitably controlled. Thus data security helps to ensure privacy. It also helps in protecting personal data. It is the process of protecting data from unauthorized access, use, disclosure, destruction, modification, or disruption. Protecting confidential information is a business requirement, and in many cases, it is also a legal requirement, and some would say that it is the right thing to do.
5 November 2012
172
Question & Answer Session...
5 November 2012
173
5 November 2012
174
Recommended