Data Warehouse Modernization with Big Data

Data Warehouse Modernization with Big Data

Julio PereiraAdvisory ConsultantDell [email protected]

Juliano CarnielAdvisory ConsultantDell [email protected]

Ricardo SugawaraAdvisory ConsultantDell [email protected]

Knowledge Sharing Article © 2017 Dell Inc. or its subsidiaries.

2017 Dell EMC Proven Professional Knowledge Sharing 2

Abstract

Big Data projects are usually associated with analytics (e.g., sentiment analysis and social

networks), however this very inherent quality of dealing with huge amounts of data, flexibility

of treating different source types, makes it well suited to data warehouses applications.

By using Big Data tools and techniques, very complex tasks related to ETL/ELT, RDBMS

schema modeling and maintenance can be simplified. Data visualization and reporting also

benefit from this approach, since Big Data ecosystem is agnostic, providing a framework which

makes data accessing painless.

Putting in place a data lake where source data will land and a sandbox to business users

are important aspects underlining this approach, which will minimize complexity and costs.


Table of Contents

1 PROBLEM DESCRIPTION ............................................................................................... 5

1.1 DBMS Schema Limitations ....................................................................................... 5

1.2 New Data, New Interfaces ........................................................................................ 5

1.3 Long Development Life Cycles ................................................................................. 6

1.4 Bad Infrastructure Approach ..................................................................................... 6

2 OUR APPROACH............................................................................................................. 7

2.1 ETL vs. ELT ............................................................................................................. 7

2.2 Data Lake vs. Data Swamp ...................................................................................... 9

2.3 Creating and Structuring a Data Lake ....................................................................... 9

2.3.1 Business Rules Review..................................................................................... 10

2.3.2 Extracting or Receiving Data ............................................................................. 10

2.3.3 Loading Data .................................................................................................... 11

2.3.4 Transforming or Enriching Data ........................................................................ 11

2.3.5 Reporting and KPIs ........................................................................................... 12

2.3.6 Consuming and Exploring – Sandboxes ............................................................ 12

2.4 Data Temperature: Volume and Latency ................................................................ 13

3 REFERENCE ARCHITECTURE ..................................................................................... 15

3.1 Underlying Infrastructure ........................................................................................ 15

3.2 Structuring the Data Flow in the Data Lake............................................................. 17

3.3 Latency vs. Volume – Decision matrix .................................................................... 18

3.4 Transactional data .................................................................................................. 19

4 CONCLUSIONS ............................................................................................................. 21


Index of Figures

Figure 1: A simplified block diagram of an ETL-based architecture ........................................ 8

Figure 2: Data Temperature ................................................................................................ 14

Figure 3: Reference Architecture – Infrastructure ................................................................ 17

Figure 4: Reference Architecture – Components ................................................................. 17

Figure 5: Data Layers ......................................................................................................... 19

Figure 6: Latency vs. Volume .............................................................................................. 20

Figure 7: CDC or Extraction? .............................................................................................. 22

Disclaimer: The views, processes or methodologies published in this article are those of the

authors. They do not necessarily reflect Dell EMC’s views, processes or methodologies.


1 PROBLEM DESCRIPTION

Traditional Data Warehouse (DW) combined with Business Intelligence (BI) techniques and

tools provide enterprises with analysis based on private data.

As data grow larger in volume, diversity and velocity, infrastructure must also receive

significant investment to accommodate volume and different types of data.

1.1 DBMS Schema Limitations

Data Base Management Systems (DBMS) development techniques impose that data must

fit into predetermined tables organized by columns of well-defined types.

This fact limits the amount of data that can be uploaded into the data bases, either by

transformations that can be quite complex, or simply by skipping some fields, since loading

jobs can consume time and resources (“load and transform only what one will make use”).

As a result original data may be lost forever – though occasionally, it may be recovered at a

high cost (backups that need to be restored, lack of disk space to reprocess data, and so on).

1.2 New Data, New Interfaces

As systems change, or get upgraded, or even when business users are in need of fields

that were not accounted for in previous development, DBMS schemas must be reengineered

and interfaces must be rewritten, impacting not only source systems, but analysis and

reporting as well, due to time and money constraints.

Another important limitation arises when data from social networks become necessary in

combination with traditional transactional systems: how will this “new” data be ingested and

used?


1.3 Long Development Life Cycles

By limiting the frequency and types of data that are ingested, a “boxed” ecosystem is put in place,

so common issues related to long development lifecycles create well known side effects, such

as silos across enterprises leading to inconsistent data.

As business necessities change, priority shift, causing backlog to a point where requests must

be reviewed, developments are suspended and efforts are wasted.

1.4 Bad Infrastructure Approach

Using the wrong approach in data extraction and data exploration can significantly impact the

infrastructure that is serving the data.

When extracting data, one needs to be very careful on how and when to do it, to avoid

impacting the transactional databases.

For data exploration, one needs to be very conservative on how to size the hardware

serving this data, and must know in advance variables such as number of users consuming

the data and how much data is going to be stored.

Normally this infrastructure will be a single node relational database which will be able to grow

vertically up to a certain point and will require a huge enterprise storage to store all the

historical data which is not constantly used.

When considering a huge amount of data it is normal that users will try to access it all

through the network and save it in another application or even a parallel data base to explore

and merge it with other manually inputted data.


2 OUR APPROACH

The following topics describe the most relevant aspects of our approach to deploy a modern

DW design on a Data Lake, organizing data in such a way that business users will leverage

the greatest value. It enables direct data exploration and consumption by end users without

long and costly Information Technology (IT) processes.

We will start by describing some of the key concepts of a Data Lake formation and then

elaborate on the approach.

2.1 ETL vs. ELT

There are generally two approaches to enterprise data processing for BI and analytics: Extract

Transform and Load (ETL), which is a traditional approach, and Extract Load and Transform

(ELT), which is a mechanism enabled by big data tools and architectures.

In a traditional BI environment, data is first extracted from the source data base or file server,

passes through a possibly complex transformation pipeline, which includes filtering, data

enrichment, and then is loaded into a BI-dedicated data base, from which reports are

generated. These pipelines and its destination data bases are usually related to a specific

business context.

The figure below illustrates a simple block diagram of a traditional BI ETL and DW architecture.

The diagram also identifies a set of challenges which are common to each component of the

architecture:

1. Ingestion speed is usually a challenge. Traditional BI components, both for data

processing and storage, are based on large machines, scaling vertically.

2. As a consequence, all data management infrastructure relies on enterprise

grade hardware and high cost software.

3. Data is moved across components which cannot share data seamlessly, thus

multiple redundant copies of the data are present across the data flow path.

4. Data is isolated in siloes. This makes it hard to obtain answers related to

different sets of data. Also, data filtering is ubiquitous during the ETL process,

tying the reports and possible insights to a very limited spectrum related to the

data transformation pipeline.


5. After being archived, data is usually not available online. Historical data analysis

usually requires restoring backups in a cumbersome and costly process.

6. Because of the limited scalability, high cost, and limited spectrum of traditional

BI offerings, volume and variety of data actually processed is usually very low,

leaving important information unprocessed and unavailable to generate business

insights.

Figure 1: A simplified block diagram of an ETL-based architecture

In a Big Data environment, it is possible to extract data from the source systems and

load it into the data lake without any filtering or destructive transformation. Since the data

storage is highly integrated with the data processing components, this step may be deferred

and performed as required. This technique establishes the so- called ELT approach, as

data is first extracted, loaded in its entirety, and only then processed.

This approach makes it easier to guarantee data quality and process integrity, by validating

successful ingestion and transformation processes as well as providing complete non-

truncated information. This is a fundamental tenet that enables data consumption in a much

broader spectrum. This also ensures that no further rework in the extraction processes is

needed in the case of new data requirements, resulting in shorter delivery times.


2.2 Data Lake vs. Data Swamp

Even though the whole idea of big data is to bring everything from various sources, storing

and thereafter forming a Data Lake, enabling data processing and exploration in a flexible

paradigm, some precautions must be taken to avoid creating a “Data Swamp”.

A well-formed Data Lake normally observes the following aspects of data management

and governance:

Data Catalog

Data Lineage

Data Provenance

Data Classification

Data Security

2.3 Creating and Structuring a Data Lake

Ultimately, a data lake is a huge area where data lands with no limitations in layout, format,

or frequency – the latter is much more related to the capacity of source systems to

generate data, than of the receiving infrastructure.

Data should land with all possible information and since no transformations are performed

upfront, interfacing tends to be quite stable regardless of changes in business requirements.

The Data Lake formation and organization, and the consumption of the data normally follow

some basic steps as previously mentioned:

1. Business rules review

2. Extraction or reception

3. Loading

4. Transforming or enriching

5. Generating reports and Key Performance Indicators (KPIs)

6. Consuming or exploring by means of sandboxes


2.3.1 Business Rules Review

Instead of extracting, transforming and finally loading, data can now be extracted (or received),

loaded and finally transformed, which means there are no virtual limitations on what can be

used to create reports or business use cases: complexity is now interface- absent.

Since legacy systems usually lack clear documentation, complexity now resides on the task

of elucidating business rules used to create reports. One may argue this approach is too

risky, however with a functional data lake, enterprises are now able to make use of data which

will be available sooner and richer, so normally the tradeoff on consuming time in this

assessment normally pays off since it allows business to revisit some old truths and sometimes

changes it all over.

This is the key point that will enable DW/BI modernization: rather than reenacting a traditional

construct with a different infrastructure, enterprises can now start building a new understanding

of what could be achieved by crossing information from diverse sources.

2.3.2 Extracting or Receiving Data

Having at least a minimal understanding on how the data works in the source system is

crucial to determine how it is going to be extracted or received since there are technologies

for each characteristic and business requirement of the data.

For example, if Near Real Time (NRT) data processing is needed, a tool that supports streaming

must be put in place. If the source system cannot push the data, a dedicated extractor needs to

be implemented. In other cases, there might be a functional or technical requirement that

prevents early extraction of the data and even though the data is desired earlier it can’t be done.

Independent of the technology used to bring or receive data, defining a pre-configured landing

zone is also crucial for organizing data and support later maintenance, which belongs to Data

Governance discipline. This landing zone is also important to the underlying infrastructure, so

standardizing makes it easier for backup policies, storage tiering and data purging.


2.3.3 Loading Data

Main actions are:

Data Validation. Ingested or extracted data is validated in format and content

according to predefined rules to decide if the raw data is to be processed, or

diverted to manual validation. This is a simple first action to ensure a minimal

degree of data quality throughout the cluster.

Raw Data Archiving. A compressed copy is archived with a specific

retention period. This data is mostly used to enable reprocessing and ad-hoc

analysis.

These steps take place in a smaller cluster which operates independently from the main

production cluster.

2.3.4 Transforming or Enriching Data

After the data is validated and archived it is sent to be enriched with external and auxiliary

information and finally ingested in the semantic layer based on Hive. Great attention is

dedicated to the metadata describing this layer especially in the definition of complete and

accurate metadata information with a detailed description of each field.

Data in this layer is the foundation for all other processes that generate reports and KPIs being

structured, partitioned and de-normalized to optimize the consumer processes as much as

possible.

Given the importance of this layer and different business requirements, retention period for

each type of data is independently controlled in order to allow periods from a few days to

years.

This layer and cluster is dedicated to production processes and ad-hoc reporting. For end-

user access and exploration in sandboxes, it is common to provide a small portion of the data.


2.3.5 Reporting and KPIs

After the data is ingested in the semantic layer it can be consumed by reporting and use case

jobs. Jobs in this phase fetch data from the semantic layer and derive additional data which

are also stored in the Data Lake for later recovery if needed. Further processing or aggregations

are also possible.

The data generated by these jobs are also sent to Greenplum Data Base (GPDB) to be

accessed by external applications or explored by users in their sandboxes.

2.3.6 Consuming and Exploring – Sandboxes

Instead of limiting what can be constructed, why not make available to end users a set of

data that can be used as a playground?

Rather than defining what a sandbox is, let us define what it is not:

Unsupervised

Unlimited

Used for production

Widely open

A sandbox is a playground area which users can access to explore the data, test new ideas,

and prototype new projects, which will be properly defined and documented – as a side

benefit, it will shorten delivery cycles, and will lower user frustration.

The concept of sandboxes can be widespread within any company and it is possible that

any business area can profit their own sandbox. Data present in sandboxes can be shared

across different areas, so it is not siloed.

Main points to be considered in constructing a sandbox are:

Limited and monitored space given to each area

Limited concurrent queries by area

Limited memory allocation by query

Well-defined data access

Workload management


2.4 Data Temperature: Volume and Latency

Although the ELT approach makes it much more flexible to manage data in a Data Lake, new

analytical systems are required to respond in ever diminishing time frames. Everyone wants

“real time insights”. To provide a very simple framework to approach these often ill-specified

requirements, we employ an analogy based on “data temperatures”.

Not only has it helped users grasp the overall tenets to data velocity, it also guides the

system’s designers when determining which technical solution is most suited.

The concept is based on the latency of the processing and querying requirements. The

“hotter” the data, the lower its latency requirements. Hot data refers to near real time data

ingestion and processing. It is usually employed to implement operational use cases which

will generate insights and actions on real time events, usually related to relevant business

transactions or consumer activity.

Unless justified by some business requirement, it usually makes little sense to implement a real

time architecture for updates on auxiliary data bases which are by nature updated as a batch.

Therefore, because of the business context-related nature of these real time applications, it

usually operates on limited data volumes, which are filtered and aggregated at the beginning of

the ingestion process.

An intermediate data temperature is usually associated with interactive queries and ingestions

organized in micro-batches across the day. This makes data available faster to the user, both

in its availability in the data lake after ingestion, as well as the queries response times, providing

the user with an interactive experience. If the user has to wait for several hours for a query

to finish, the d a t a cannot b e classified as warm. Warm data requires high performance

query processing and reasonably quick data ingestion pipelines, but providing the user a

broader context and complete data sets.

For historical data in a long term storage, we require a cost efficient infrastructure which will

support data model trainings, reports generations in a processing window, etc. This is

achieved by cold data which, unlike in a traditional BI system, is readily available for queries.

The figure below illustrates this data temperature concept.


Figure 2: Data Temperature

With the data temperature concept, we can also inquire which types of use cases and data

usage patterns each temperature is usually related with. Two immediate usage modes come to

mind: operational and exploratory/analytical purposes.

When operational data usage typically occurs:

Limited data sets

Low latency and continuous flow

Integrated with applications used to generate business drivers

Input generation to provide feedback to analytical models

As opposed for exploratory and analytical purposes:

Complete data set

Batch processing and ingestion

Modeling and refinement through exploration iteration

Resulting model will be integrated to new or existing applications

Though one may say that typical DW/BI falls into “operational” classification, our big data

solution philosophy expands this definition.


3 REFERENCE ARCHITECTURE

This section describes the architectural decisions and components employed when designing

the Big Data DW solution. The architecture presents decision points ranging from the very

basic network and compute infrastructure up to the data modelling and consumption levels.

3.1 Underlying Infrastructure

The solution presented is based on the following components:

Isilon as a main storage

Commodity nodes for computing

Hadoop as the underlying software suite providing cluster computing and data

management

Hive as a massive data warehousing infrastructure, enabling data reporting

and KPI calculation, as well as serving as an exploratory layer, including

data discovery and ad-hoc queries

GPDB as a high performance exploratory layer, sandbox areas and serving

as a primary user interface

Analytical tools such as SAS, Tableau or QlikView

Other tools/technologies can be used together within the Hadoop ecosystem

to support specific use cases, such as NoSQL databases, micro services-

based APIs, etc.

The figure below illustrates in high level a potential architecture and the interaction between

the infrastructure components. The architecture is divided into three main verticals, with its

interactions highlighted in the illustration. Notice that the complete data cycle of data ingestion

and processing takes place inside the Storage/Data Lake layer, which is provided by Isilon

and Hadoop. A small portion of this data is also copied to the exploratory layer which is provided

by GPDB in order to provide specialized DataMart and sandbox areas. In this case there is

duplication of a fraction of the data available in the Data Lake.


Figure 3: Reference Architecture – Infrastructure

A logical diagram of the architecture is presented below. The reference architecture

associates the data temperature analogy to the components and data flow of a traditional

fast data and big data architecture. We identify the operational and exploratory usage

patterns and its supporting components.

Figure 4: Reference Architecture – Components


3.2 Structuring the Data Flow in the Data Lake

The overall data flow process is based on landing zones, ingestion into a semantic layer

forming the DW, and additional processing steps to generate sandboxes and contextualized

data marts, reports and KPIs. The data flow organization is presented below:

1. Raw data arrives at a landing area.

2. After basic validation data is ready to be processed. Eventually when

reaching determined time period, valid raw data is compressed and

archived. At this point, schema is flexible, context open and there is no

aggregation applied.

3. Enrichment is applied to valid data, simulating an ODS-like structure

(semantic layer). After a first retention period, data is archived into a lower

tier in the Isilon storage. Schema is still flexible in this level, and context is

open as no aggregations or filters are applied.

4. From the previous layer, specific aggregation and filtering processes may

be applied on its data, which is now is ready to be used by reporting, or

business use cases. Schema and context are dedicated after such processes.

5. By exporting to a GPDB instance (by means of external tables residing in

Hive), resulting data, either from the semantic layer or the contextualized

store, becomes accessible to SAS/Guide or SAS/VA servers.

Figure 5: Data Layers


3.3 Latency vs. Volume – Decision matrix

Some key aspects to consider when it comes to choosing technologies and components to

extract/receive and process the data include:

Volume

Velocity

Business requirements

Functional dependencies

Data source characteristics

While the first two items in the list are very technical, there are also non-technical

characteristics such as functional requirements and/or dependencies and non-functional

requirements by the end-users.

Although non-technical requirements related to business value and user experience may be

the most important factors, it is very difficult to define a general decision making process

when choosing the architecture and components.

This requires detailed analysis which is usually only available when dealing with real data after

a careful definition of the use case, including its data processing and workload requirements.

It is possible, however, to establish a few general ideas to guide which technology may be

applied by looking into simple variables (such as data volume and velocity), as shown in the

figure below.


Figure 6: Latency vs. Volume

When dealing with larger processing windows, especially with lower data volumes, the choice

of tool and general architecture, as well as the programming paradigm is not as critical to the

overall performance as it is in the case of higher volumes or velocities. Therefore, it may be

observed in the illustration that the upper diagonal of the graph does not contain any specific

guidelines on technical options (such as when you need to ingest a very low volume of data

in large time frames). The lower diagonal has some guidelines as data volume and/or velocity

will make the system performance highly sensitive to the architecture and tool choice.

3.4 Transactional data

Normally, there are not many architecture options f o r extracting data from transactional

systems. Either the system itself will support some kind of streaming or some extractions may

be performed directly from the database. When the system is not event-driven or does not

provide facilities to extract those, there are basically two options left: Changing Data Capture

(CDC), and Custom Extraction (CE).

CDC can be achieved by some tool that tracks the transaction logs of the database. These

tools normally support capturing data and sending them in streaming in near real-time or

consolidating a batch of it then sending to the destination. In the case of Oracle data bases,

the most common tool to use is the Oracle Golden Gate. It has been used to illustrate in the

next figure.


CE is also an option when there is no requirement of latency. In this case the source

system/database limits the frequency of the extractions based on many factors which are can

be summarize in those below:

Can the extraction affect the load of the database?

Is there any functional requirement that prevents the data to be loaded more

often?

How frequently is the data updated in the database?

Should the data be extracted in an intermediate state?

Considering all of this, the decision of the approach used to extract the data can be

summarized in the figure below.

Figure 7: CDC or Extraction?


4 CONCLUSIONS

By changing the way a traditional DW/BI is conceived enterprises are now able to produce

business cases that were not possible in the past, either because business users were

limited by existing techniques, or simply because of a lack of computational power.

Bringing the whole data from source systems in different formats, preparing it and exposing it

to the business area enables the company to be more agile, generate value faster and create a

unique partnership between business and IT.

The concept of sandbox applied to DW/BI brings the data closer to those who really need it,

eliminating the “middle man” represented by DW/BI developers. This represents a high value

delivered to the business specialist in a fast and consistent way.

These new empowered business users will not need to “explain” to others what is really

needed, and in return, s/he will realize what is really possible to be implemented.

Another upside of this approach is that data accuracy and business rules are validated faster

and use cases implemented are much more reliable, because they can be validated down to

the detail using the sandbox data directly by the user.


Glossary

BI Business Intelligence

CDC Change Data Capture

CE Custom Extraction

DW Data Warehouse

ETL Extract Transform and Load

ELT Extract Load and Transform

IT Information Technology

(R)DBMS (Relational) Data Base Management System

GPDB Greenplum Data Base

KPI Key Performance Indicator

NRT Near Real Time

Dell EMC believes the information in this publication is accurate as of its publication date. The

information is subject to change without notice.

THE INFORMATION IN THIS PUBLICATION IS PROVIDED “AS IS.” DELL EMC MAKES NO

RESPRESENTATIONS OR WARRANTIES OF ANY KIND WITH RESPECT TO THE

INFORMATION IN THIS PUBLICATION, AND SPECIFICALLY DISCLAIMS IMPLIED

WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

Use, copying and distribution of any Dell EMC software described in this publication requires

an applicable software license.

Dell, EMC and other trademarks are trademarks of Dell Inc. or its subsidiaries.

Documents

Data Warehouse Modernization with Big Data