Rando Veizi: Data warehouse and Pentaho suite

qwertyuiopasdfghjklzxcvbnmq

wertyuiopasdfghjklzxcvbnmqw

ertyuiopasdfghjklzxcvbnmqwer

tyuiopasdfghjklzxcvbnmqwerty

uiopasdfghjklzxcvbnmqwertyui

opasdfghjklzxcvbnmqwertyuiop

asdfghjklzxcvbnmqwertyuiopas

dfghjklzxcvbnmqwertyuiopasdf

ghjklzxcvbnmqwertyuiopasdfgh

jklzxcvbnmqwertyuiopasdfghjkl

zxcvbnmqwertyuiopasdfghjklzx

cvbnmqwertyuiopasdfghjklzxcv

bnmqwertyuiopasdfghjklzxcvbn

mqwertyuiopasdfghjklzxcvbnm

qwertyuiopasdfghjklzxcvbnmq

wertyuiopasdfghjklzxcvbnmqw

ertyuiopasdfghjklzxcvbnmrtyui

Innovation and New Technologies

Professor: Carlo Vaccari

2/4/2014

Student: Rando Veizi

Contents Data Warehouses .................................................................................................................................... 2

History ................................................................................................................................................. 2

Introduction ........................................................................................................................................ 3

Why DW ? ........................................................................................................................................... 5

DW environment ................................................................................................................................. 5

Bottom-up Design ............................................................................................................................... 6

Top-down Design ................................................................................................................................ 7

Top-down vs bottom-up ..................................................................................................................... 8

The hybrid design ................................................................................................................................ 9

DW vs OS ........................................................................................................................................... 10

Pentaho Suite ........................................................................................................................................ 11

Introduction ...................................................................................................................................... 11

Installing Pentaho Suite .................................................................................................................... 12

Starting the BI Platform: ............................................................................................................... 12

How to Log Into the Pentaho User Console .................................................................................. 12

Trying some tools… ........................................................................................................................... 13

Community Dashboard Editor (CDE) ............................................................................................. 13

Saiku .............................................................................................................................................. 14

Data warehouses and Pentaho Suite : .............................................................................................. 15

Data Warehouses

History

The DW notion dates to the late 80s when some IDM researches developed “business data

warehouse” . At first the idea of DW was intended to create a model of architecture for the data flow

that goes from the operational system to the decision support environments.

That concept wanted to support different problems associated with this flow such as the high costs

associated with it. Without DW , o good amount of redundancy was needed to support multiple

decision support environments. In bigger companies it was normal for multiple decision support

environments to operate independently.

Even if each environment served different users, the usually needed much of the same stored data.

The processes of managing data from different sources in most of the cases from long-term existing

OS-s (Legacy Systems) was partially replicated for each one of the environments. Moreover, the

operational systems were frequently re-examined as new decision support requirements emerged.

Often new requirements necessitated gathering, cleaning and integrating new data from DM that were

tailored for ready access by users.

Introduction

Figure 1: All data warehouses processes in one picture

Data warehouse (DW,DWH), or Enterprise DW (EDW),is a database that is used for

Reporting

Data analysis

A central repository of data (DW) is created by integrating data from different disparate sources DW

stores historical data and can be used to create trending reports for senior management reporting

such as annual and quarterly comparisons. The data that is stored in the DW gets uploaded by the

operational system such as sales or marketing. This data itself can pass through (but it is not always

this way) an operational data store for certain operations before it can be used in the DW for

reporting. The ETL-based DW uses staging ,data integration , and access layers to house its key

functions.

The staging database stores data that has been extracted from each of the data systems

The integration layer integrates data sets by transforming this data from the staging layer to

an ODS(operational data store) database

https://en.wikipedia.org/wiki/Extract,_transform,_load

The integrated data then will be moved to another location, to another database called data

warehouse database where it will be divided in groups(called dimensions) in facts and

aggregate facts all arranged into a hierarchical classification. The combination of these facts

and dimensions can also be called star schema.

The function of the access layer is to retrieve data

If a DW is constructed from an integrated data source systems it does not require nor ETL, staging

databases or even ODS databases.

These systems can be considered as a part of a distributed operational store layer. The integrated

data source systems and DW are all integrated since no transformation of dimensional or reference

data is done and this is different from ETL.

This integrated DW architecture supports the drill down from the aggregate of the DW to the

transcriptional data of the integrated source data systems.

A data mart is a DW in “miniature”, and it is focused on a specific area of interest. Essentially DW can

be subdivided in data marts for better performance and in ease of use(easy to use) within the area.

So basically an organization can create 1 to n data marts and it can go towards a larger and more

complex enterprise DW .

In this definition DW is focuses on data storage. To the main source of the data happens the

following:

Is cleaned

transformed

catalogued

made available for use (for managers, business professionals for data mining or analytical

processing)

Why DW ?

DW always keeps a copy from the source transaction systems. This kind of architecture gives us the

possibility to :

1. Group data from different sources into a single database and this way only one query engine

is needed to present the data.

2. Reduce the level of database isolation in the transaction processing systems that is caused

by trying to run large analysis queries in transaction processing databases.

3. Save and keep the data history, even though source transaction systems do not.

4. Takes and integrates the data from many source systems creating a central view across the

enterprise.

5. Provide consistent codes and descriptions that improves the quality of data.

6. Restructure the data so that it can be more user-friendly to the business users.

7. Structure the data so it can have a very good query performance, leaving the OS(Operative

System).

8. Make the decision-support queries user-friendly to write.

DW environment

The environment for DW and DM comprises the following :

Source systems that provide data to the DW or DM

Technologies and processes that prepare data to be used

Ample architectures that store data into an organization’s DW or DM.

Lots of tools and apps for a different range of users.

Metadata, data quality and governance processes should be in the place where they belong

to ensure that DW/DM meets its purposes.

These days the most successful companies are those that can act, respond very quickly and in a

flexible way to market changes and new opportunities. A key to this response is the good and efficient

use of data and the information by analysts and managers.

Bottom-up Design

Figure 2: Bottom-Up

By building a series of data marts to an agreed architecture, the enterprise data warehouse can be

assembled slice by slice, until it is complete enough to regard the data marts as subsets of the now

much greater whole. Architecture is key to success, as the data marts must not be built in isolation.

Users need therefore to design data marts in the knowledge that each will eventually form part of a

larger enterprise data warehouse.

Such an approach can prove attractive to businesses. Each data mart can be implemented within six

to nine months. Each can tackle an identifiable business problem making it possible to calculate

returns on investment (ROI). The approach also offers a valuable learning curve for the build team,

who can test out products and processes until they get it right.

An approach to data warehouse design known as bottom-up was designed by Ralph Kimball.

In this approach DM are first created to provide reporting and analytical capabilities for specific

business processes. Primarily, DM contains dimension and facts. Facts contain either atomic data

and summarized data if necessary. A data mart often models a precise business area that can be

sales or production. All there DM can be summarized(integrated) to create a comprehensive DW. The

DW bus architecture is primarily an “implementation of the bus” , a collection of conformed

dimensions and facts. Those are dimensions that are shared between facts in at least two DM. 7

The integration of the DM in the DW is centered on the conformed dimensions, that define possible

integration points between DM. The process that takes place when more than two DM integrate is

called DRILL-ACROSS(DA) . A DA summarizes the data along the keys of the conformed dimensions

of each fact that participates in the DA always followed by a join on the keys of these grouped facts.

The most important management task is to make sure that the DM dimensions among data marts are

consistent.

Business value can be returned as quickly as the first data marts can be created, and the method

lends itself well to an exploratory and iterative approach to building data warehouses.

https://en.wikipedia.org/wiki/Data_mart

Example: DW effort can start in the department of sales, if build a Sales DM. After this DM is

completed it can be expanded in another kind of DM that can be a production one for example. For

DM-ts to be integrable with each other is needed from them to share the same bus.

If the DM integration succeeds, than the DW through this 2 DM-s can deliver integrated information

about sales and production which usually is a very important value for the business.

Top-down Design

Figure 3: Top-down

The opposite of starting with individual business issues and expanding up the organisation hierarchy

is to start at the top. A top down enterprise data warehouse and a subset data marts strategy is "the

most elegant design approach", says Doug Hackney of business intelligence systems specialist, the

Enterprise Group. He says that such an approach would vastly ease maintenance, summarisation,

metadata management and extraction, transformation and loading (ETL) of data.

An approach to data warehouse design known as top-down was designed by Bill Inmon.

This approach is designed using “Atomic” data that is a normalized enterprise data model. Its

function is to store the type of data that is at the lowest level of detail in the DW. Dimensional DM

containing needed for specific business processes of departments are created from the DW.

According to Inmon the DW is the center of CIF(Corporate Information Factory), which provides a

logical framework for delivering business intelligence and business management capabilities.

Top-down vs bottom-up

All in one picture :

Figure 4: T-D vs B-U

The hybrid design

The Hybrid Data Warehouse (Hybrid) is uniquely suited to support both EDW and datamart

applications in one database. It can accommodate large volumes of historical data typically found in

the EDW, while also performing well for OLAP queries typically done in datamarts. The Hybrid

database structure contains both normalized snowflakes and de-normalized star schemas. The

controlled redundancy inherent in this design provides good response time for a variety queries.

The Hybrid architecture can also be used to implement the ODS in the same database, as long as

sub-second response times are not a requirement. Because the ODS can be used by operational

systems, the response time of the database can become an issue.

Because there is only one database schema, the Hybrid model significantly reduces the cost of

developing the ETL processes. Real-time (or near real-time) updates can be supported by pushing

data updates out immediately to the Hybrid Warehouse directly from the operational system, or by

connecting the ETL engine to an Enterprise Service Bus (ESB).

The Hybrid model was used to develop one of the largest databases in Canada. It includes 34

dimensional roles with multiple hierarchies, has over 1500 attributes, and handles 40 million

transactions per day in near real time, which translates into one billion rows per month.

The Hybrid model may not be able to fully replace an ODS requirement for sub-second response

time. But it can offer a one stop solution for organizations that have very large data volumes and are

looking for a cost effective way to support a variety of BI requirements across the organization.

DW vs OS

The fundamental difference between OS and DW system is that the OS are designed to support transaction processing whereas data warehousing systems are designed to support online analytical processing(OLAP).

Based on this fundamental difference, data usage patterns associated with operational systems are significantly different than usage patterns associated with data warehousing systems. As a result, data warehousing systems are designed and optimized using methodologies that drastically differ from that of operational systems.

The table below summarizes many of the differences between operational systems and data warehousing systems.

Operative Systems Data Warehousing Operational systems are generally designed to support high-volume transaction processing with minimal back-

end reporting.

Data warehousing systems are generally designed to support high-volume analytical processing (i.e. OLAP) and subsequent, often elaborate report generation.

Operational systems are generally process-oriented or process-driven, meaning that they are

focused on specific business processes or tasks. Example tasks include billing, registration, etc.

Data warehousing systems are generally subject-oriented, organized around business areas that the

organization needs information about. Such subject areas are usually populated with data from one or more operational systems. As an example, revenue may be a subject area of a data warehouse that incorporates data from operational systems that contain student tuition data, alumni gift data, financial aid data, etc.

Operational systems are generally concerned with current data.

Data warehousing systems are generally concerned with historical data.

Data within operational systems are generally updated regularly according to need.

Data within a data warehouse is generally non-volatile, meaning that new data may be added regularly, but once loaded, the data is rarely changed, thus preserving an ever-growing history of information. In short, data within a data warehouse is generally read-only.

Operational systems are generally optimized to perform fast inserts and updates of relatively small volumes of data.

Data warehousing systems are generally optimized to perform fast retrievals of relatively large volumes of data.

Operational systems are generally application-specific, resulting in a multitude of partially or non-integrated systems and redundant data (e.g. billing data is not integrated with payroll data).

Data warehousing systems are generally integrated at a layer above the application layer, avoiding data redundancy problems.

Operational systems generally require a non-trivial level of computing skills amongst the end-user community.

Data warehousing systems generally appeal to an end-user community with a wide range of computing skills, from novice to expert users.

Table 1: DW vs OS

Pentaho Suite

Introduction

Pentaho was founded in 2004. It is headquartered in Orlando, FL, USA. One of the most important

advantages that it has is that it offers a suite of open source business intelligence (BI) products.

These products called Pentaho Business Analytics provide data integration , OLAP(online analytical

processing) services, reporting dashboarding and, mining and ETL capabilities.

Pentaho is the open source business intelligence development platform which has different

components integrated with it. You have both open source and commercial versions available to

support your BI need. This article is scoped to help open source business intelligence developer to

integrate CTOOLS on CDF to fulfil their dashboard development BI needs.

Figure 5: Pentaho community edition vs pentaho enterprise edition

Installing Pentaho Suite

Now I will show you how to install Pentaho Suite community edition(CE) along with some tools and

explain their purpose.

a) Download Pentaho Server from http://community.pentaho.com/. Choose zip or

tar.gz according to preferences

b) Tomcat Install

c) Set up MySQL

d) Configure the BI Server

Starting the BI Platform:

In order to use and configure the Pentaho BI Platform, you must start the BI Server, then the

Pentaho Administration Console.

1. To start the BI Server, run the start-pentaho script in the /biserver-ce/ directory.

2. To start the Pentaho Administration Console, run the start script (on Windows) or startup script

(onLinux) in the /biserver-ce/administration-console/ directory.

How to Log Into the Pentaho User Console

1. Open a Web browser and type in the Web or IP address of the Pentaho server, which is

http://localhost:8080/pentaho/ by default.

You'll see an introductory screen with some Pentaho-related information and a Login button in the

center of the screen.

2. Click Login.

The login dialog will appear.

3. For the locally installed version of the BI Suite, select Joe from the user drop-down box, and type

in password into the password field, then click Login. For hosted demo users, select

Guest and type in guest as the password instead. You are now logged into the Pentaho User Console

and ready to start creating and running reports.

Figure 6: Pentaho’s Login interface

Trying some tools…

Community Dashboard Editor (CDE) is one of the plugins designed for Pentaho BI Server,

contributed and maintained by Pentaho Partner webdetails.

-The pourpose of this tool is to create dashboards

-Community Dashboard Editor (CDE) was born to simplify the creation, edition and rendering

processes of the CTools Dashboards.

-CDE is a very powerful and complete tool, combining front end with data sources and custom

components in a seamless way.

Now to create a Dashboard I followed some examples here and here.

First of all after we install CDE our Pentaho interface will change , and this icon will be added :

http://type-exit.org/adventures-with-open-source-bi/2011/06/creating-dashboards-with-cde/comment-page-2/#comments

https://hernandezpaul.wordpress.com/2012/02/10/dashboard-creation-with-pentaho-and-cde-step-by-step-screencast-tutorial/

By experimenting and following guides I was able create something(screenshots below):

And that is a dashboard about how many exams did I take every year in my bachelor degree.

Saiku

Another tool that I studied is saiku. Saiku is a modular open-source analysis suite offering lightweight OLAP which remains easily embeddable, extendable and configurable. It is similar in form and function to the Pentaho Analyzer Plugin. It allows a user to visually create queries by dragging parts of a previously defined OLAP schema onto a canvas, where other activities can take place like filtering, sorting, creating calculated members from other measures, exporting the result table to PDF or MS Excel, and optionally graphing the data. A restful server connects to existing OLAP systems, which then powers user-friendly, intuitive analytics via a lightweight JQuery-based frontend.

Turning data into information shouldn't be hard, it should be easy and fun. The Saiku project is all

about creating tools that are easy-to-use by anyone who wants to crunch numbers, visualize

information, gain insight from data and act on it.

Follow this link and you will understand much easier how does saiku work

I you are willing to understand more you can go to these web addresses http://pedroalves-

bi.blogspot.it/2011/06 or http://codeissue.com/articles/a04e87158bb8552/pentaho-bi-ctools-cdf-cda-

cde-saiku-analytics-etc-using-cygwin

Data warehouses and Pentaho Suite :

Open-source Pentaho provides business intelligence (BI) and data warehousing solutions at a

fraction of the cost of proprietary solutions. To know more about the fusion of data warehouses and

pentaho suite integration you might like to buy(or downoad) and take a look to Pentaho Solutions:

Business Intelligence and Data Warehousing with Pentaho and MySQL.

http://meteorite.bi/saiku

http://pedroalves-bi.blogspot.it/2011/06

http://pedroalves-bi.blogspot.it/2011/06

http://codeissue.com/articles/a04e87158bb8552/pentaho-bi-ctools-cdf-cda-cde-saiku-analytics-etc-using-cygwin

http://codeissue.com/articles/a04e87158bb8552/pentaho-bi-ctools-cdf-cda-cde-saiku-analytics-etc-using-cygwin

http://www.amazon.com/Pentaho-Solutions-Business-Intelligence-Warehousing/dp/0470484322

http://www.amazon.com/Pentaho-Solutions-Business-Intelligence-Warehousing/dp/0470484322

Technology

Rando Veizi: Data warehouse and Pentaho suite