Data warehousing - GEHU CS/IT Deptt Web viewAn effective operational governance solution ensures that only authorized ... or DBMS. Requirement ... the additional levels of attribute

Information as a Service

For a large enterprise to function effectively, all stakeholders and information systems need to have access to accurate data in real time. Traditionally, this has been a difficult goal to attain, as heterogeneous data formats, application integration costs, and batch processing lag times inhibited the free flow of accurate, timely information amongst systems. With Information-as-a-Service, sources of enterprise data are exposed as Enterprise Services and made available for consumption by virtually any application that needs them. This allows you to:

Distribute data across the enterprise as a shared service – With Information-as-a-Service, it is possible for ERP and CRM systems, business intelligence tools, mash-ups, and portals to interact with identical data in real time.

Create a single source of truth for major domains – Information-as-a-Service provides the ability to establish and maintain one trusted source of data for specific work flows, getting everyone on the same page.

Reduce operational problems that stem from batch data updates between systems - Even minor discrepancies in data between out-of-sync batches in enterprise systems can cause serious problems, especially in financial transactions. Information-as-a-Service enables business processes and users to work with the most up to date data in critical applications.

Simplify and streamline the sharing of data between enterprise systems – Information-as-a-Service reduces many of the cost, time, and hassle factors that have inhibited the thorough sharing of back-end data with consuming systems in the past. By establishing a single, trusted source of data as a shared service, it is possible to set up separate consumers of that data in a number of separate applications with comparatively little effort.

Information-as-a-Service is a boon to business agility and operational effectiveness. However, getting to success with information services requires making sure that the services are planned, built, and operated in alignment with the broader enterprise strategy and IT assets. This is a matter of governance.

Planning governance helps ensure that information service candidates are properly aligned with enterprise master data management (MDM) initiatives, and that these services provide the right level of information granularity.

Effective development governance follows on to validate that schemas and other information service definition artifacts comply with enterprise policies, and are correctly specified and used. Development governance also provides change management capabilities to minimize the impact that changes to information services might have on their consuming applications.

Operational governance for information services delivers a wide range of capabilities, the most important of which is to provide information assurance and data-protection. An effective operational governance solution ensures that only authorized applications and users can access sensitive data. This protects data privacy and integrity.

A unified governance solution provides consistent, uniform policy definition, validation, implementation and enforcement throughout the plan-build-run stages of the enterprise service lifecycle. For example, policy governance for information services will help ensure that enterprise services implement appropriate Common Data Models, and comply with relevant Common Data Element definitions. The policy governance solution should also provide for comprehensive security policy mediation to ensure that information services can be used securely by the widest possible set of applications, regardless of their technology or platform.

DSS and Data Warehousing =BI areas

Business intelligence as it is understood today is said to have evolved from the decision support systems (DSS) that began in the 1960s and developed throughout the mid-1980s. DSS originated in the computer-aided models created to assist with decision making and planning. From DSS, data warehouses, Executive Information Systems, OLAP and business intelligence came into focus beginning in the late 80s.

In 1989, Howard Dresner (later a Gartner Group analyst) proposed "business intelligence" as an umbrella term to describe "concepts and methods to improve business decision making by using fact-based support systems".

Data warehousing

Often BI applications use data gathered from a data warehouse (DW) or from a data mart, and the concepts of BI and DW sometimes combine as "BI/DW" or as "BIDW". A data warehouse contains a copy of analytical data that facilitates decision support. However, not all data warehouses serve for business intelligence, nor do all business intelligence applications require a data warehouse.

To distinguish between the concepts of business intelligence and data warehouses, Forrester Research defines business intelligence in one of two ways:

1. Using a broad definition: "Business Intelligence is a set of methodologies, processes, architectures, and technologies that transform raw data into meaningful and useful information used to enable more effective strategic, tactical, and operational insights and decision-making." Under this definition, business intelligence also includes technologies such as data integration, data quality, data warehousing, master-data management, text- and content-analytics, and many others that the market sometimes lumps into the "Information Management" segment. Therefore, Forrester refers to data preparation and data usage as two separate but closely linked segments of the business-intelligence architectural stack.

2. Forrester defines the narrower business-intelligence market as, "...referring to just the top layers of the BI architectural stack such as reporting, analytics and dashboards."

The Different Types of Knowledge

Understanding the different forms that knowledge can exist in, and thereby being able to distinguish between various types of knowledge, is an essential step for knowledge

http://en.wikipedia.org/wiki/Dashboards_(management_information_systems)

http://en.wikipedia.org/wiki/Information_Management

http://en.wikipedia.org/wiki/Forrester_Research

http://en.wikipedia.org/wiki/Forrester_Research

http://en.wikipedia.org/wiki/Data_mart

http://en.wikipedia.org/wiki/Data_warehouse

http://en.wikipedia.org/wiki/Gartner_Group

http://en.wikipedia.org/wiki/Online_analytical_processing

http://en.wikipedia.org/wiki/Executive_Information_System

http://en.wikipedia.org/wiki/Data_warehouse

http://en.wikipedia.org/wiki/Decision_making

management (KM). For example, it should be fairly evident that the knowledge captured in a document would need to be managed (i.e. stored, retrieved, shared, changed, etc.) in a totally different way than that gathered over the years by an expert craftsman.

Within business and KM, two types of knowledge are usually defined, namely explicit and tacit knowledge. The former refers to codified knowledge, such as that found in documents, while the latter refers to non codified and often personal/experience-based knowledge.

.

Explicit Knowledge

This type of knowledge is formalized and codified, and is sometimes referred to as know-what. It is therefore fairly easy to identify, store, and retrieve . This is the type of knowledge most easily handled by KMS, which are very effective at facilitating the storage, retrieval, and modification of documents and texts.

From a managerial perspective, the greatest challenge with explicit knowledge is similar to information. It involves ensuring that people have access to what they need; that important knowledge is stored; and that the knowledge is reviewed, updated, or discarded.

Explicit knowledge is found in: databases, memos, notes, documents, etc.

Tacit Knowledge

This type of knowledge was originally defined by Polanyi in 1966. It is sometimes referred to as know-how and refers to intuitive, hard to define knowledge that is largely experience based. Because of this, tacit knowledge is often context dependent and personal in nature. It is hard to communicate and deeply rooted in action, commitment, and involvement.

Tacit knowledge is also regarded as being the most valuable source of knowledge.

Using a reference by Polanyi (1966), imagine trying to write an article that would accurately convey how one reads facial expressions. It should be quite apparent that it would be near impossible to convey our intuitive understanding gathered from years of experience and practice. Virtually all practitioners rely on this type of knowledge. An IT specialist for example will troubleshoot a problem based on his experience and intuition. It would be very difficult for him to codify his knowledge into a document that could convey his know-how to a beginner. This is one reason why experience in a particular field is so highly regarded in the job market.

Tacit knowledge is found in: the minds of human stakeholders. It includes cultural beliefs, values, attitudes, mental models, etc. as well as skills, capabilities and expertise , generally tacit knowledge is knowledge embodied in people,

Knowledge Life cycle

Knowledge Management is the methodology, tools and techniques to gather, integrate and disseminate knowledge. It involves processes involving management of knowledge creation, acquisition, storage, organization, distribution, sharing and application. These can be further classified into organization and technology components.

The organization component consists of organization-wide strategy, standard and guidelines, policies, and socio-cultural environment.

The technology component consists of tools and techniques to implement effective knowledge management practice which provides values to its business, employees, customers and partners. The tools can furthers be classified into knowledge creation, knowledge integration, knowledge sharing and knowledge utilization.

The various steps are described here:

1. Knowledge Creation - Knowledge is created either as explicit or tacit knowledge. Explicit knowledge is put in paper or electronic format. It is recorded and made accessible to others. Tacit knowledge is created in minds of people. This knowledge resides within individuals. This knowledge needs to be transformed into explicit knowledge so that it can recorded and shared with others in the organization.

2. Knowledge Storage - Knowledge is stored and organized in a repository. The decision on how and where lies with the organization. But the objective of this phase to enable organization to be able to contribute, organize and share knowledge with.

3. Knowledge Sharing - Knowledge is shared and accessed by people. They can either search or navigate to the knowledge items.

4. Knowledge Utilization - This is end goal of knowledge practice. The knowledge management does not have any value if knowledge created is not utilized to its potential. The more knowledge is created as knowledge is applied and utilized.

Value of information (VOI or VoI) is the amount a decision maker would be willing to pay for information prior to making a decision Value-of-information (VOI) methods determine the worth of acquiring extra information to help the decision-maker. From a decision analysis perspective, acquiring extra information is only useful if it has a significant probability of changing the decision-makers currently preferred strategy. The penalty of acquiring more information is usually valued as the cost of that extra information, and sometimes also the delay incurred in waiting for the information.

VOI techniques are based on analyzing the revised estimates of model inputs that come with extra data, together with the costs of acquiring the extra data and a decision rule that can be converted into a mathematical formula to analyse whether the decision would alter.

http://2.bp.blogspot.com/_3jsesw2yfMI/Ro11jBaeE1I/AAAAAAAAALY/Pm0DIkhFMZY/s1600-h/km.gif

The usual starting point of a VOI analysis is to consider the value of perfect information (VOPI), i.e. answering the question 'What would be the benefit, in terms we are focusing on (usually money, but it could be lives saved, etc.), of being able to know some parameter(s) perfectly?' If perfect knowledge would not change a decision, the extra information is worthless, and if it does change a decision then the value is the difference between the expected net benefit of the new selected option compared to that previously favoured. VOPI is a useful limiting tool, because it tells us the maximum value that any data may have in better evaluating the input parameter of concern. If the information costs more than that maximum value, we know not to pursue it any further.

After a VOPI check, one then looks at the value of imperfect information (VOII). Usually, the collection of more data will decrease, not eliminate, uncertainty about an input parameter, so VOII focuses on whether the decrease in uncertainty is worth the cost of collecting extra information. In fact, if new data are inconsistent with previous data or beliefs that were used to estimate the parameter, new data may even increase the uncertainty.

An example

Your company wants to develop a new cosmetic but there is some concern that people will have a minor adverse skin reaction to the product. The cost of development of the product to market is $1.8 million. The revenue NPV (including the cost of development) if the product is of the required quality is $3.7 million.

Cosmetic regulations state that you will have to withdraw the product if 2% or more of consumers have an adverse reaction to your product. You have already performed some preliminary trails on 200 random people selected from the target demographic, at a cost/person of $500. Three of those people had an adverse reaction to the product.

Management decide the product will only be developed if they can be 85% confident that the product will affect less than the required 2% of the population. Decision question: Should we test more people or just abandon the product development now? If we should test more people, then how many more?

Data modeling in software engineering is the process of creating a data model for an information system by applying formal data modeling techniques.

There are three different types of data models produced while progressing from requirements to the actual database to be used for the information system.[3] The data requirements are initially recorded as a conceptual data model which is essentially a set of technology independent specifications about the data and is used to discuss initial requirements with the business stakeholders. The conceptual model is then translated into a logical data model, which documents structures of the data that can be implemented in databases. Implementation of one conceptual data model may require multiple logical data models. The last step in data modeling is transforming the logical data model to a physical data model that organizes the data into tables, and accounts for access, performance and storage details. Data modeling defines not just data elements, but also their structures and the relationships between them.

http://en.wikipedia.org/wiki/Physical_data_model

http://en.wikipedia.org/wiki/Logical_data_model

http://en.wikipedia.org/wiki/Conceptual_modeling

http://en.wikipedia.org/wiki/Conceptual_schema

http://en.wikipedia.org/wiki/Data_modeling#cite_note-RS001-3

http://en.wikipedia.org/wiki/Information_system

http://en.wikipedia.org/wiki/Data_model

http://en.wikipedia.org/wiki/Software_engineering

Conceptual Model Design Logical Model Design

Physical Model Design

Data modeling process

In the context of business process integration (see figure), data modeling complements business process modeling, and ultimately results in database generation.[7]

The process of designing a database involves producing the previously described three types of schemas - conceptual, logical, and physical. The database design documented in these schemas are converted through a Data Definition Language, which can then be used to generate a database. A fully attributed data model contains detailed attributes (descriptions) for every entity within it. The term "database design" can describe many different parts of the design of an overall database system. Principally, and most correctly, it can be thought of as the logical design of the base data structures used to store the data. In the relational model these are the tables and views. In an object database the entities and relationships map directly to object classes and named relationships. However, the term "database design" could also be used to apply to the overall process of designing, not just the base data structures, but also the forms and queries used as part of the overall database application within the Database Management System or DBMS.

http://en.wikipedia.org/wiki/Database_management_system

http://en.wikipedia.org/wiki/Database_management_system

http://en.wikipedia.org/wiki/Object_database

http://en.wikipedia.org/wiki/View_(database)

http://en.wikipedia.org/wiki/Table_(database)

http://en.wikipedia.org/wiki/Relational_model

http://en.wikipedia.org/wiki/Database_system

http://en.wikipedia.org/wiki/Data_Definition_Language

http://en.wikipedia.org/wiki/Data_modeling#cite_note-SS93-7

http://en.wikipedia.org/wiki/Business_process_modeling

http://en.wikipedia.org/wiki/Business_process_modeling

http://en.wikipedia.org/wiki/Business_process_modeling#Business_process_integration

http://en.wikipedia.org/wiki/File:Data_modeling_context.svg

Requirement capturingThe informational requirements of the organization need to be collected by means of a time box. Figure 4 shows the typical means by which thoseInformational requirements are identified and collected.

Typically informational requirements are collected by looking at:reports. Existing reports can usually be gathered quickly and inexpensively. In most cases the information displayed on these reports is easily discerned. But old reports represent yesterday's requirements and the underlying calculation of information may not be obvious ay all.spreadsheets. Spreadsheets are able to be easily gathered by asking the DSS analyst community. Like standard reports, the information on spreadsheets is able to be discerned easily. The problem with spreadsheets is that:

they are very fluid. Important spreadsheets may have been created several months ago that are not now available,

they change with no documentation whatsoever, they may not be able to be easily gathered unless the analyst

creating them wants them to be gathered, and their structure and usage of data may be obtuse. other existing analysis. Through EIS and other channels there

is usually quite a bit of other useful information analysis that has been created by the organization. This information is usually very unstructured and very informal (although in many cases it is still very valuable information.)

live interviews. Typically through interviews or JAD sessions, the end user can tell what the informational needs of the organization are. Unfortunately JAD sessions require an enormous amount of energy to conduct and assimilate. Furthermore, the effectiveness of JAD sessions depends in no small part on the imagination and spontaneity of the end user participating in the session.

In any case, gathering the obvious and easily accessed informational needs of the organization should be done and should be factored into the data warehouse data model prior to the development of the first iteration of the data warehouse.

Modelling Techniques

1. E R Modeling2. Dimensional Modeling

ER Modelling :ER modeling produces a data model of the specific area of interest, using two

basic concepts: entities and the relationships between those entities. Detailed ER models also contain attributes, which can be properties of either the entities or the relationships. The ER model is an abstraction tool because it can be used to understand and simplify the ambiguous data relationships in the business world and complex systems environments.

An ER model is represented by an ER diagram, which uses three basic graphic

symbols to conceptualize the data: entity, relationship, and attribute.

Entity

An entity is defined to be a person, place, thing, or event of interest to the business or the organization. An entity represents a class of objects, which are things in the real world that can be observed and classified by their properties and characteristics. Figure 12 shows an example of entities in an ER diagram. A rectangle represents an entity In Figure 12 there are four entities: PRODUCT, PRODUCT MODEL, PRODUCT COMPONENT, and COMPONENT.

Relationship

A relationship is represented with lines drawn between entities. The relationship between two entities can be defined in terms of the cardinality. This is the maximum number of instances of one entity that are related to a single instance in another table and vice versa. The possible

cardinalities are: one-to-one (1:1), one-to-many (1:M), and many-to-many (M:M).

Attributes

Attributes describe the characteristics of properties of the entities. In Figure 12, Product ID, Description, and Picture are attributes of the PRODUCT entity. For clarification, attribute naming conventions are very important. An attribute name should be unique in an entity and should be self-explanatory. When an instance has no value for an attribute, the minimum cardinality of the attribute is zero, which means either nullable or optional. In Figure 12, you can see the characters P, m, o, and F. They stand for primary key, mandatory,

optional, and foreign key.

Temporal Modelling

The temporal model is a lot less scary than it sounds. Usually in a database entities are represented by a row in a table, when this row is updated the old information is overwritten. The temporal model allows data to be referenced in time, it makes it possible to query the state of an entity at a given time.

More specifically the temporal aspects usually include valid time and transaction time. These attributes can be combined to form bitemporal data.

Valid time is the time period during which a fact is true with respect to the real world. Transaction time is the time period during which a fact stored in the database is considered

to be true. Bitemporal data combines both Valid and Transaction Time.

For illustration, consider the following short biography of a fictional man, John Doe:

John Doe was born on April 3, 1975 in the Kids Hospital of Medicine County, as son of Jack Doe and Jane Doe who lived in Smallville. Jack Doe proudly registered the birth of his first-born on April 4, 1975 at the Smallville City Hall. John grew up as a joyful boy, turned out to be a brilliant student and graduated with honors in 1993. After graduation he went to live on his own in Bigtown. Although he moved out on August 26, 1994, he forgot to register the change of address officially. It was only at the turn of the seasons that his mother reminded him that he had to register, which he did a few days later on December 27, 1994. Although John had a promising future, his story ends tragically. John Doe was accidentally hit by a truck on April 1, 2001. The coroner reported his date of death on the very same day.

http://en.wikipedia.org/wiki/Bitemporal

http://en.wikipedia.org/wiki/Transaction_time

http://en.wikipedia.org/wiki/Valid_time

Using a current database

To store the life of John Doe in a current (non-temporal) database we use a table Person (Name, Address). (In order to simplify Name is defined as the primary key of Person.)

John's father officially reported his birth on April 4, 1975. On this date a Smallville official inserted the following entry in the database: Person(John Doe, Smallville). Note that the date itself is not stored in the database.

After graduation John moves out, but forgets to register his new address. John's entry in the database is not changed until December 27, 1994, when he finally reports it. A Bigtown official updates his address in the database. The Person table now contains Person(John Doe, Bigtown). Note that the information of John living in Smallville has been overwritten, so it is no longer possible to retrieve that information from the database. An official accessing the database on December 28, 1994 would be told that John lives in Bigtown. More technically: if a database administrator ran the query SELECT ADDRESS FROM PERSON WHERE NAME='John Doe' on December 26, 1994, the result would be Smallville. Running the same query 2 days later would result in Bigtown.

Until his death the database would state that he lived in Bigtown. On April 1, 2001 the coroner deletes the John Doe entry from the database. After this, running the above query would return no result at all.

Date Real world event Database ActionWhat the database

shows

April 3, 1975 John is born NothingThere is no person called John Doe

April 4, 1975John's father officially reports John's birth

Inserted:Person(John Doe, Smallville)

John Doe lives in Smallville

August 26, 1994

After graduation, John moves to Bigtown, but forgets to register his new address

NothingJohn Doe lives in Smallville

December 26, 1994

Nothing NothingJohn Doe lives in Smallville

December 27, 1994

John registers his new addressUpdated:Person(John Doe, Bigtown)

John Doe lives in Bigtown

April 1, 2001 John dies Deleted:Person(John Doe)There is no person called John Doe

Using Valid time

Valid time is the time for which a fact is true in the real world. A valid time period may be in the past, span the current time, or occur in the future.

For the example above, to record valid time the Person table has two fields added, Valid-From and Valid-To. These specify the period when a person's address is valid in the real world. On April 4, 1975 John's father registered his son's birth. An official then inserts a new entry into the database stating that John lives in Smallville from April 3. Note that although the data was inserted on the 4th, the

http://en.wikipedia.org/wiki/Primary_key

database states that the information is valid since the 3rd. The official does not yet know if or when John will move to another place, so the Valid-To field is set to infinity (∞). The entry in the database is:

Person(John Doe, Smallville, 3-Apr-1975, ∞).

On December 27, 1994 John reports his new address in Bigtown where he has been living since August 26, 1994. A new database entry is made to record this fact:

Person(John Doe, Bigtown, 26-Aug-1994, ∞).

The original entry Person (John Doe, Smallville, 3-Apr-1975, ∞) is not deleted, but has the Valid-To attribute updated to reflect that it is now known that John stopped living in Smallville on August 26, 1994. The database now contains two entries for John Doe

Person(John Doe, Smallville, 3-Apr-1975, 26-Aug-1994). Person(John Doe, Bigtown, 26-Aug-1994, ∞).

When John dies his current entry in the database is updated stating that John does not live in Bigtown any longer. The database now looks like this

Person(John Doe, Smallville, 3-Apr-1975, 26-Aug-1994). Person(John Doe, Bigtown, 26-Aug-1994, 1-Apr-2001).

Using Transaction time

Transaction time records the time period during which a database entry is accepted as correct. This enables queries that show the state of the database at a given time. Transaction time periods can only occur in the past or up to the current time. In a transaction time table, records are never deleted. Only new records can be inserted, and existing ones updated by setting their transaction end time to show that they are no longer current.

To enable transaction time in the example above, two more fields are added to the Person table: Transaction-From and Transaction-To. Transaction-From is the time a transaction was made, and Transaction-To is the time that the transaction was superseded (which may be infinity if it has not yet been superseded). This makes the table into a bitemporal table.

What happens if the person's address as stored in the database is incorrect? Suppose an official accidentally entered the wrong address or date? Or, suppose the person lied about their address for some reason. Upon discovery of the error, the officials update the database to correct the information recorded.

For example, from 1-Jun-1995 to 3-Sep-2000 John Doe moved to Beachy. But to avoid paying Beachy's exorbitant residence tax, he never reported it to the authorities. Later during a tax investigation it is discovered on 2-Feb-2001 that he was in fact in Beachy during those dates. To record this fact the existing entry about John living in Bigtown must be split into two separate records, and a new record inserted recording his residence in Beachy. The database would then appear as follows:

Person(John Doe, Smallville, 3-Apr-1975, 26-Aug-1994). Person(John Doe, Bigtown, 26-Aug-1994, 1-Jun-1995). Person(John Doe, Beachy, 1-Jun-1995, 3-Sep-2000). Person(John Doe, Bigtown, 3-Sep-2000, 1-Apr-2001).

However, this leaves no record that the database ever claimed that he lived in Bigtown during 1-Jun-1995 to 3-Sep-2000. This might be important to know for auditing reasons, or to use as evidence in the official's tax investigation. Transaction time allows capturing this changing knowledge in the database, since entries are never directly modified or deleted. Instead, each entry records when it was entered and when it was superseded (or logically deleted). The database contents then look like this:

http://en.wikipedia.org/wiki/Temporal_database#Bitemporal_relations

http://en.wikipedia.org/wiki/Infinity

Person(John Doe, Smallville, 3-Apr-1975, ∞, 4-Apr-1975, 27-Dec-1994). Person(John Doe, Smallville, 3-Apr-1975, 26-Aug-1994, 27-Dec-1994, ∞ ). Person(John Doe, Bigtown, 26-Aug-1994, ∞, 27-Dec-1994, 2-Feb-2001 ). Person(John Doe, Bigtown, 26-Aug-1994, 1-Jun-1995, 2-Feb-2001, ∞ ). Person(John Doe, Beachy, 1-Jun-1995, 3-Sep-2000, 2-Feb-2001, ∞ ). Person(John Doe, Bigtown, 3-Sep-2000, ∞, 2-Feb-2001, 1-Apr-2001 ). Person(John Doe, Bigtown, 3-Sep-2000, 1-Apr-2001, 1-Apr-2001, ∞).

The database records not only what happened in the real world, but also what was officially recorded at different times.

Multi-Dimensional Modeling

The multidimensional data model is composed of logical cubes, measures, dimensions, hierarchies, levels, and attributes. The simplicity of the model is inherent because it defines objects that represent real-world business entities. Analysts know which business measures they are interested in examining, which dimensions and attributes make the data meaningful, and how the dimensions of their business are organized into levels and hierarchies.

Figure 2-1 Diagram of the Logical Multidimensional Model

Logical Cubes

Logical cubes provide a means of organizing measures that have the same shape, that is, they have the exact same dimensions.

Logical Measures

Measures populate the cells of a logical cube with the facts collected about business operations. Measures are organized by dimensions, which typically include a Time dimension.

Logical Dimensions

Dimensions contain a set of unique values that identify and categorize data. They form the edges of a logical cube, and thus of the measures within the cube. For example, the Sales measure has four dimensions: Time, Customer, Product, and Channel. A particular Sales value (43,613.50) only has meaning when it is qualified by a specific time period (Feb-01), a customer (Warren Systems), a product (Portable PCs), and a channel (Catalog).

Logical Hierarchies and Levels

A hierarchy is a way to organize data at different levels of aggregation. In viewing data, analysts use dimension hierarchies to recognize trends at one level, drill down to lower levels to identify reasons for these trends, and roll up to higher levels to see what affect these trends have on a larger sector of the business.

Each level represents a position in the hierarchy. Each level above the base (or most detailed) level contains aggregate values for the levels below it.. For example, Q1-02 and Q2-02 are the children of 2002, thus 2002 is the parent of Q1-02 and Q2-02.

Logical Attributes

An attribute provides additional information about the data like colors, flavors, or sizes.

https://web.stanford.edu/dept/itss/docs/oracle/10g/olap.101/b10333/awgloss.htm#i433187



SNOWFLAKE SCHEMA

A snowflake schema is a model for data configuration in a data warehouse or data mart in which a fact table is linked to multiple dimension tables that in turn are linked to other, related dimension tables, extending outward from the fact table at the center, much like the structure of a snowflake.

Snowflake schemata are similar to star schemata—in fact, the core of a snowflake schema is essentially a star schema. However, unlike a star schema, a dimension table in a snowflake schema is divided out into more than one table, and placed in relation to the center of the snowflake by cardinality. Breaking up dimension tables in this way provides many advantages in querying, as it eliminates redundancies, is more flexible and can take up less space.

In data warehousing, snow flaking is a form of dimensional modeling in which dimensions are stored in multiple related dimension tables. A snowflake schema is a variation of the star schema.

Benefits

The snowflake schema is in the same family as the star schema logical model. In fact, the star schema is considered a special case of the snowflake schema. The snowflake schema provides some advantages over the star schema in certain situations, including:

Some OLAP multidimensional database modeling tools are optimized for snowflake schemas.[2]

Normalizing attributes results in storage savings, the tradeoff being additional complexity in source query joins.

Disadvantages

http://searchdatamanagement.techtarget.com/definition/star-schema

http://searchdatamanagement.techtarget.com/definition/star-schema

The primary disadvantage of the snowflake schema is that the additional levels of attribute normalization add complexity to source query joins, when compared to the star schema.

Snowflake schemas, in contrast to flat single table dimensions, have been heavily criticized. Their goal is assumed to be an efficient and compact storage of normalized data but this is at the significant cost of poor performance when browsing the joins required in this dimension. This disadvantage may have reduced in the years since it was first recognized, owing to better query performance within the browsing tools.

When compared to a highly normalized transactional schema, the snowflake schema's de normalization removes the data integrity assurances provided by normalized schemas. Data loads into the snowflake schema must be highly controlled and managed to avoid update and insert anomalies.

Example

ERM versus MDDM(Entity relationship model versus multi dimensional model)-already discussed

Assignment-Star Schema

METADATA

Metadata (metacontent) is defined as the data providing information about one or more aspects of the data, such as:

Means of creation of the data Purpose of the data Time and date of creation Creator or author of the data

Location on a computer network where the data were created Standards used

For example, a digital image may include metadata that describe how large the picture is, the color depth, the image resolution, when the image was created, and other data. A text document's metadata may contain information about how long the document is, who the author is, when the document was written, and a short summary of the document.

Metadata is data. As such, metadata can be stored and managed in a database, often called a metadata registry or metadata repository. However, without context and a point of reference, it might be impossible to identify metadata just by looking at them. For example: by itself, a database containing several numbers, all 13 digits long could be the results of calculations or a list of numbers to plug into an equation - without any other context, the numbers themselves can be perceived as the data. But if given the context that this database is a log of a book collection, those 13-digit numbers may now be identified as ISBNs - information that refers to the book, but is not itself the information within the book.

Metadata types

Structural metadata is used to describe the structure of database objects such as tables, columns, keys and indexes. Guide metadata is used to help humans find specific items and is usually expressed as a set of keywords in a natural language.

Technical metadata correspond to internal metadata, and business metadata correspond to external metadata.

Descriptive metadata is typically used for discovery and identification, as information used to search and locate an object such as title, author, subjects, keywords, publisher.

Structural metadata give a description of how the components of an object are organized. An example of structural metadata would be how pages are ordered to form chapters of a book. Finally, administrative metadata give information to help manage the source. They refer to the technical information including file type or when and how the file was created. Two sub-types of administrative metadata are rights management metadata and preservation metadata. Rights management metadata explain intellectual property rights, while preservation metadata contain information that is needed to preserve and save a resource

There are three broad categories of metadata:

Business metadata includes definitions of data files and attributes in business terms. It may also contain definitions of business rules that apply to these attributes, data owners and stewards, data quality metrics, and similar information that helps business users to navigate the "information ocean." Some reporting and business intelligence tools provide and maintain an internal repository of business-level metadata definitions used by these tools.

Technical metadata is the most common form of metadata. This type of metadata is created and used by the tools and applications that create, manage, and use data. For example, some best-in-class ETL tools maintain internal metadata definitions used to create ETL directives or scripts. Technical metadata is a key metadata type used to build and maintain the enterprise data environment. Technical metadata typically includes database system names, table and column names and sizes, data types and allowed values, and structural information such as primary and foreign key attributes and indices. In the case of CDI

http://searchdatamanagement.techtarget.com/news/2240022264/Using-data-quality-metrics-to-build-a-data-quality-business-case

http://searchdatamanagement.techtarget.com/news/2240111219/Business-metadata-commentary-and-opinions

architecture, technical metadata will contain subject areas defining attribute and record location reference information.

Operational metadata contains information that is available in operational systems and run-time environments. It may contain data file size, date and time of last load, updates, and backups, names of the operational procedures and scripts that have to be used to create, update, restore, or otherwise access data, etc.

Advantages

One major advantage of metadata is that redundancy and inconsistencies can be identified more easily as the metadata is centralized. For example, the system catalog and data dictionary can help or guide developers at the conceptual or structural phase or for further maintenance.

Data Quality

Data quality is driven by a common set (and common understanding) of data standards, domain standards, business rules etc. (please refer Data Quality Assurance for the detailed list). If the systems follow the common standards (creating same checks, controls, table structure, field definitions...) there can be a big gain on data quality. Metadata repository:

Provides the details on the data standards to follow Enforces the adherence to the standards as defined in the repository.

IT systems productivity

Given that data standards, business rules, and models etc. exist in the metadata, one builds productivity on following counts:

Automatic creation of the tables and models: Systems can pick-up the details from the metadata repository and build the components. This will save time and effort to firstly creating the models and then build them.

Avoid cost of mistakes and iteration: One may not have to go through the pains of change controls, if your design is built from common standards

Avoiding duplication

If data is available in the system, one does not have to re-create it. If you are looking for ‘sales productivity MIS’, you may find it (OR something close to it) by searching through metadata. This gives a boost to the business effectiveness as other-wise, they will need to wait for their turn in the queue. This also helps in focusing the resources on fulfilling new business requirement, instead of re-creating the old ones.

Avoiding information conflict issues

By using metadata repositories and enforcing common standards and calculation formulae, the reports and dashboards will have a greater probability of reflecting same figures. This will avoid board room time waste on find which are the correct figures.

Regulatory compliance

With all the above benefits, one can expect that business will be able to produce correct reports faster and cheaper.

http://www.executionmih.com/data-quality/assurance-controls-monitoring.php

Business Process Management and its cascading impacts

With every change in business processes, one can find the cascading impact on various components like policies, business process documentation, business rules, configuration and set-up changes in IT systems . For example if a new business process allows a sales manager to manage more than one outlet, it will have a cascading impact on the set-ups, software changes, ETL and dimensional models..

Handling any kind of change management

Whenever anything changes with-in an organization environment, metadata repository helps you to understand the impact. For example, if you want to change ‘maker checker’ control policy, Metadata repository will be able to tell you on which all systems, database, business processes you have to change.

Better estimations and business case management

With metadata repository telling you the impact of a requirement and also providing some efficiency gains, one can do a better estimate of the cost of making a change.

Making scalable and extensible models

This is not a direct benefit of Metadata repository, but it supports it. Smart modelers (with solid business knowledge), can help create models (for example Foundation Dimensions and Facts in Dimensional Model of a data warehouse) which can quickly respond to the changes. A metadata helps you to manage this modeling.

Reduce redundancy

With all the data elements maps stored in the metadata repository, one can identify the redundant data and processes, and work on their reduction OR elimination.

From the above-said benefits, one can understand that Metadata management is core to building intelligent and high-performing enterprises. It benefits all facets of an organization including business process management, BI, IT management, performance management and so on. There is a cascading impact on better business performance, employee satisfaction and customer satisfaction.

Populating Data Warehouse and Data Tranformation

The extraction method is highly dependent on the source system and well as the business needs in the target data warehouse. Usually there is no possibility to add additional logic to the source system to help with the incremental extraction of data.

Here are the types of extraction methods:

http://www.executionmih.com/data-warehouse/foundation-conformed-dimension-fact.php

http://www.executionmih.com/data-warehouse/foundation-conformed-dimension-fact.php

Full Extraction: All the data is extracted completely from the source system. Because this extraction reflects all the data currently available on the source system, there is no need to keep track of changes to the source data since the last successful extraction. The source data will be provided as-is and no additional information (i.e., timestamps) is necessary from the source data. An example for a full extraction may be an export file of a complete table or a SQL SELECT statement that retrieves all the rows from a table. Many times a full extraction will be used for tables that will be used as dimension tables in a cube.

Incremental Extraction: Only the data that has changed from a specific point in time in history will be extracted. This point in time may be the time of the last extraction, or a business event like the last day of a fiscal period. To identify this delta change, there must be the possibility to identify all the changed information since this specific point in time (see “How to determine if the data that has changed since the last extraction?” below). Many times an incremental extraction will be used for tables that will be used as fact tables in a cube. After the data is extracted, you can use a control table to store the max date of the extracted records, and then for the next run you will get all the rows from the source system since that max date. Or you can just query the destination table in the data warehouse and return the max date and get all the rows from the source system since that max date, but this method could take a long time if you have a lot of records in the destination table. Instead of the max date, another option is to instead use the max ID if the source system has a unique integer.

For each of these methods, there are two ways to physically extract the data:

Online Extraction: The data is extracted directly from the source system itself. The extraction process can connect directly to the source system to access the source tables themselves, or to an intermediate system that stores the data in a preconfigured manner (i.e., transaction logs or change tables).

Offline Extraction: Many times direct access to the source system is not available, so instead the data is staged outside the original source system and created by an extraction routine. The data is usually in a flat file that is in a defined, generic format. Additional information about the source object is necessary for further processing.

How to determine the data that has changed since the last extraction?

If a data warehouse extracts data from an operational system on a nightly basis, then the data warehouse requires only the data that has changed since the last extraction (that is, the data that has been modified in the past 24 hours). When it is possible to efficiently identify and extract only the most recently changed data, the extraction process (as well as all downstream operations in the ETL process) can be much more efficient, because it must extract a much smaller volume of data. Unfortunately, for many source systems, identifying the recently modified data may be difficult or intrusive to the operation of the system. Incremental extraction is typically the most challenging technical issue in data extraction. Below are several techniques for implementing incremental extraction from source systems. These techniques are based upon the characteristics of the source systems, or may require modifications to the source systems. Thus, each of these techniques must be carefully evaluated by the owners or the source system prior to implementation. Each of these techniques can work in conjunction with the data extraction techniques discussed previously. For example, timestamps can be used whether the data is being pulled from a flat file or accessed through a query to the source system:

Change Data Capture (CDC): Change Data Capture records INSERTs, UPDATEs, and DELETEs applied to SQL Server tables, and makes a record available of what changed, where, and when, in simple relational ‘change tables’. The source of change data for change data capture is the SQL Server transaction log. As inserts, updates, and deletes

http://msdn.microsoft.com/en-us/library/bb522489.aspx

are applied to tracked source tables, entries that describe those changes are added to the log. The log then serves as input to the change data capture process. CDC reads the transaction log and adds information about changes to the tracked table’s associated change table.

Timestamps: The tables in some operational systems have timestamp columns. The timestamp specifies the time and date that a given row was last modified, making it easy to identify the latest data. This is usually the preferable option. In SQL Server, many times this column is given a timestamp data type, along with a column name of “Timestamp”. Or, the column is given a datetime data type, and a column name of “Last Modified”. You can also add database triggers to populate the “Last Modified” column.

Partitioning: Some source systems might use range partitioning, such that the source tables are partitioned along a date key, which allows for easy identification of new data. For example, if you are extracting from an orders table, and the orders table is partitioned by week, then it is easy to identify the current week’s data.

Database Triggers: Adding a trigger for INSERT, UPDATE, and DELETE on a single table and having those triggers write the information about the record change to ‘change tables’.

MERGE Statement: The least preferable option is to extract an entire table from the source system to the data warehouse or staging area, and compare these tables with a previous extract from the source system to identify the changed data (using the MERGE statement in SQL Server). You will need to compare all the fields in the source with all the fields in the destination to see if a record has changed.This approach will likely not have a significant impact on the source system, but it can place a considerable burden on the data warehouse, particularly if the data volumes are large. This option is usually the last resort if none of the other options are possible.

Column DEFAULT value: If you have a source table that won’t have updates to any rows, only inserts, you can add a “Created Date” column that has a default value of the current date. Of course this is only an option if you have permissions to add columns to the source system.

Anomalies in data fields

An update anomaly is a data inconsistency that results from data redundancy and a partial update. For example, each employee in a company has a department associated with them as well as the student group they participate in.

Employee_ID Name Department Student_Group 143 Sambu ssds dsds 123 J. Longfellow Accounting dBeta Alpha Psi 234 B. Rech Marketing Marketing Club 345 B. Rech Marketing Management Club 456 A. Bruchs CIS Technology Org. 567 A. Bruchs CIS Beta Alpha Psi

If A. Bruchs’ department is an error it must be updated at least 2 times or there will be inconsistent data in the database. If the user performing the update does not realize the data is stored redundantly the update will not be done properly.

http://en.wikipedia.org/wiki/Merge_(SQL)

http://en.wikipedia.org/wiki/Database_trigger

http://en.wikipedia.org/wiki/Partition_(database)

http://en.wikipedia.org/wiki/Timestamps

A deletion anomaly is the unintended loss of data due to deletion of other data. For example, if the student group Beta Alpha Psi disbanded and was deleted from the table above, J. Longfellow and the Accounting department would cease to exist. This results in database inconsistencies and is an example of how combining information that does not really belong together into one table can cause problems.

An insertion anomaly is the inability to add data to the database due to absence of other data. For example, assume Student_Group is defined so that null values are not allowed. If a new employee is hired but not immediately assigned to a Student_Group then this employee could not be entered into the database. This results in database inconsistencies due to omission.

Update, deletion, and insertion anomalies are very undesirable in any database. Anomalies are avoided by the process of normalization.

Data consolidation

Data consolidation refers to the collection and integration of data from multiple sources into a single destination. During this process, different data sources are put together, or consolidated, into a single data store. Because data comes from a broad range of sources, consolidation allows organizations to more easily present data, while also facilitating effective data analysis. Data consolidation techniques reduce inefficiencies, like data duplication, costs related to reliance on multiple databases and multiple data management points.

Data Standards

A data standard depicts the required content and format in which particular types of data are to be presented and exchanged. Data Standards are documented agreements on representation, format, definition, structuring, tagging, transmission, manipulation, use, and management of data.

Data Federation

Data federation (also known as data virtualization) is a process whereby data is collected from distinct databases without ever copying or transferring the original data itself.

Rather than collect all the information in a database, data federation collects metadata—data that describes the structure of the original data—and places it into a single database. Data federation gives users access to third-party data for organization, analysis etc., without having to go to the trouble and expense of full data integration or data warehouse creation.

Data federation technology can be used in place of a data warehouse to save the cost of creating a permanent, physical relational database. It can also be used as an enhancement to add fields or attributes that are not supported by the data warehouse application programming interface (API).

Making a single call to multiple data sources and then integrating and organizing the data in a middleware layer is also called data virtualization, enterprise information integration (EII) and information-as-a-service, depending on the vendor.

http://searchservervirtualization.techtarget.com/definition/virtualization

http://searchsoa.techtarget.com/definition/middleware

http://searchexchange.techtarget.com/definition/application-program-interface

http://searchsqlserver.techtarget.com/definition/relational-database

http://searchsqlserver.techtarget.com/definition/data-warehouse

Documents

Data warehousing - GEHU CS/IT Deptt Web viewAn effective operational governance solution ensures that only authorized ... or DBMS. Requirement ... the additional levels of attribute