Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Data Warehouse Modernization with Big Data
Julio PereiraAdvisory ConsultantDell [email protected]
Juliano CarnielAdvisory ConsultantDell [email protected]
Ricardo SugawaraAdvisory ConsultantDell [email protected]
Knowledge Sharing Article © 2017 Dell Inc. or its subsidiaries.
2017 Dell EMC Proven Professional Knowledge Sharing 2
Abstract
Big Data projects are usually associated with analytics (e.g., sentiment analysis and social
networks), however this very inherent quality of dealing with huge amounts of data, flexibility
of treating different source types, makes it well suited to data warehouses applications.
By using Big Data tools and techniques, very complex tasks related to ETL/ELT, RDBMS
schema modeling and maintenance can be simplified. Data visualization and reporting also
benefit from this approach, since Big Data ecosystem is agnostic, providing a framework which
makes data accessing painless.
Putting in place a data lake where source data will land and a sandbox to business users
are important aspects underlining this approach, which will minimize complexity and costs.
2017 Dell EMC Proven Professional Knowledge Sharing 3
Table of Contents
1 PROBLEM DESCRIPTION ............................................................................................... 5
1.1 DBMS Schema Limitations ....................................................................................... 5
1.2 New Data, New Interfaces ........................................................................................ 5
1.3 Long Development Life Cycles ................................................................................. 6
1.4 Bad Infrastructure Approach ..................................................................................... 6
2 OUR APPROACH............................................................................................................. 7
2.1 ETL vs. ELT ............................................................................................................. 7
2.2 Data Lake vs. Data Swamp ...................................................................................... 9
2.3 Creating and Structuring a Data Lake ....................................................................... 9
2.3.1 Business Rules Review..................................................................................... 10
2.3.2 Extracting or Receiving Data ............................................................................. 10
2.3.3 Loading Data .................................................................................................... 11
2.3.4 Transforming or Enriching Data ........................................................................ 11
2.3.5 Reporting and KPIs ........................................................................................... 12
2.3.6 Consuming and Exploring – Sandboxes ............................................................ 12
2.4 Data Temperature: Volume and Latency ................................................................ 13
3 REFERENCE ARCHITECTURE ..................................................................................... 15
3.1 Underlying Infrastructure ........................................................................................ 15
3.2 Structuring the Data Flow in the Data Lake............................................................. 17
3.3 Latency vs. Volume – Decision matrix .................................................................... 18
3.4 Transactional data .................................................................................................. 19
4 CONCLUSIONS ............................................................................................................. 21
2017 Dell EMC Proven Professional Knowledge Sharing 4
Index of Figures
Figure 1: A simplified block diagram of an ETL-based architecture ........................................ 8
Figure 2: Data Temperature ................................................................................................ 14
Figure 3: Reference Architecture – Infrastructure ................................................................ 17
Figure 4: Reference Architecture – Components ................................................................. 17
Figure 5: Data Layers ......................................................................................................... 19
Figure 6: Latency vs. Volume .............................................................................................. 20
Figure 7: CDC or Extraction? .............................................................................................. 22
Disclaimer: The views, processes or methodologies published in this article are those of the
authors. They do not necessarily reflect Dell EMC’s views, processes or methodologies.
2017 Dell EMC Proven Professional Knowledge Sharing 5
1 PROBLEM DESCRIPTION
Traditional Data Warehouse (DW) combined with Business Intelligence (BI) techniques and
tools provide enterprises with analysis based on private data.
As data grow larger in volume, diversity and velocity, infrastructure must also receive
significant investment to accommodate volume and different types of data.
1.1 DBMS Schema Limitations
Data Base Management Systems (DBMS) development techniques impose that data must
fit into predetermined tables organized by columns of well-defined types.
This fact limits the amount of data that can be uploaded into the data bases, either by
transformations that can be quite complex, or simply by skipping some fields, since loading
jobs can consume time and resources (“load and transform only what one will make use”).
As a result original data may be lost forever – though occasionally, it may be recovered at a
high cost (backups that need to be restored, lack of disk space to reprocess data, and so on).
1.2 New Data, New Interfaces
As systems change, or get upgraded, or even when business users are in need of fields
that were not accounted for in previous development, DBMS schemas must be reengineered
and interfaces must be rewritten, impacting not only source systems, but analysis and
reporting as well, due to time and money constraints.
Another important limitation arises when data from social networks become necessary in
combination with traditional transactional systems: how will this “new” data be ingested and
used?
2017 Dell EMC Proven Professional Knowledge Sharing 6
1.3 Long Development Life Cycles
By limiting the frequency and types of data that are ingested, a “boxed” ecosystem is put in place,
so common issues related to long development lifecycles create well known side effects, such
as silos across enterprises leading to inconsistent data.
As business necessities change, priority shift, causing backlog to a point where requests must
be reviewed, developments are suspended and efforts are wasted.
1.4 Bad Infrastructure Approach
Using the wrong approach in data extraction and data exploration can significantly impact the
infrastructure that is serving the data.
When extracting data, one needs to be very careful on how and when to do it, to avoid
impacting the transactional databases.
For data exploration, one needs to be very conservative on how to size the hardware
serving this data, and must know in advance variables such as number of users consuming
the data and how much data is going to be stored.
Normally this infrastructure will be a single node relational database which will be able to grow
vertically up to a certain point and will require a huge enterprise storage to store all the
historical data which is not constantly used.
When considering a huge amount of data it is normal that users will try to access it all
through the network and save it in another application or even a parallel data base to explore
and merge it with other manually inputted data.
2017 Dell EMC Proven Professional Knowledge Sharing 7
2 OUR APPROACH
The following topics describe the most relevant aspects of our approach to deploy a modern
DW design on a Data Lake, organizing data in such a way that business users will leverage
the greatest value. It enables direct data exploration and consumption by end users without
long and costly Information Technology (IT) processes.
We will start by describing some of the key concepts of a Data Lake formation and then
elaborate on the approach.
2.1 ETL vs. ELT
There are generally two approaches to enterprise data processing for BI and analytics: Extract
Transform and Load (ETL), which is a traditional approach, and Extract Load and Transform
(ELT), which is a mechanism enabled by big data tools and architectures.
In a traditional BI environment, data is first extracted from the source data base or file server,
passes through a possibly complex transformation pipeline, which includes filtering, data
enrichment, and then is loaded into a BI-dedicated data base, from which reports are
generated. These pipelines and its destination data bases are usually related to a specific
business context.
The figure below illustrates a simple block diagram of a traditional BI ETL and DW architecture.
The diagram also identifies a set of challenges which are common to each component of the
architecture:
1. Ingestion speed is usually a challenge. Traditional BI components, both for data
processing and storage, are based on large machines, scaling vertically.
2. As a consequence, all data management infrastructure relies on enterprise
grade hardware and high cost software.
3. Data is moved across components which cannot share data seamlessly, thus
multiple redundant copies of the data are present across the data flow path.
4. Data is isolated in siloes. This makes it hard to obtain answers related to
different sets of data. Also, data filtering is ubiquitous during the ETL process,
tying the reports and possible insights to a very limited spectrum related to the
data transformation pipeline.
2017 Dell EMC Proven Professional Knowledge Sharing 8
5. After being archived, data is usually not available online. Historical data analysis
usually requires restoring backups in a cumbersome and costly process.
6. Because of the limited scalability, high cost, and limited spectrum of traditional
BI offerings, volume and variety of data actually processed is usually very low,
leaving important information unprocessed and unavailable to generate business
insights.
Figure 1: A simplified block diagram of an ETL-based architecture
In a Big Data environment, it is possible to extract data from the source systems and
load it into the data lake without any filtering or destructive transformation. Since the data
storage is highly integrated with the data processing components, this step may be deferred
and performed as required. This technique establishes the so- called ELT approach, as
data is first extracted, loaded in its entirety, and only then processed.
This approach makes it easier to guarantee data quality and process integrity, by validating
successful ingestion and transformation processes as well as providing complete non-
truncated information. This is a fundamental tenet that enables data consumption in a much
broader spectrum. This also ensures that no further rework in the extraction processes is
needed in the case of new data requirements, resulting in shorter delivery times.
2017 Dell EMC Proven Professional Knowledge Sharing 9
2.2 Data Lake vs. Data Swamp
Even though the whole idea of big data is to bring everything from various sources, storing
and thereafter forming a Data Lake, enabling data processing and exploration in a flexible
paradigm, some precautions must be taken to avoid creating a “Data Swamp”.
A well-formed Data Lake normally observes the following aspects of data management
and governance:
Data Catalog
Data Lineage
Data Provenance
Data Classification
Data Security
2.3 Creating and Structuring a Data Lake
Ultimately, a data lake is a huge area where data lands with no limitations in layout, format,
or frequency – the latter is much more related to the capacity of source systems to
generate data, than of the receiving infrastructure.
Data should land with all possible information and since no transformations are performed
upfront, interfacing tends to be quite stable regardless of changes in business requirements.
The Data Lake formation and organization, and the consumption of the data normally follow
some basic steps as previously mentioned:
1. Business rules review
2. Extraction or reception
3. Loading
4. Transforming or enriching
5. Generating reports and Key Performance Indicators (KPIs)
6. Consuming or exploring by means of sandboxes
2017 Dell EMC Proven Professional Knowledge Sharing 10
2.3.1 Business Rules Review
Instead of extracting, transforming and finally loading, data can now be extracted (or received),
loaded and finally transformed, which means there are no virtual limitations on what can be
used to create reports or business use cases: complexity is now interface- absent.
Since legacy systems usually lack clear documentation, complexity now resides on the task
of elucidating business rules used to create reports. One may argue this approach is too
risky, however with a functional data lake, enterprises are now able to make use of data which
will be available sooner and richer, so normally the tradeoff on consuming time in this
assessment normally pays off since it allows business to revisit some old truths and sometimes
changes it all over.
This is the key point that will enable DW/BI modernization: rather than reenacting a traditional
construct with a different infrastructure, enterprises can now start building a new understanding
of what could be achieved by crossing information from diverse sources.
2.3.2 Extracting or Receiving Data
Having at least a minimal understanding on how the data works in the source system is
crucial to determine how it is going to be extracted or received since there are technologies
for each characteristic and business requirement of the data.
For example, if Near Real Time (NRT) data processing is needed, a tool that supports streaming
must be put in place. If the source system cannot push the data, a dedicated extractor needs to
be implemented. In other cases, there might be a functional or technical requirement that
prevents early extraction of the data and even though the data is desired earlier it can’t be done.
Independent of the technology used to bring or receive data, defining a pre-configured landing
zone is also crucial for organizing data and support later maintenance, which belongs to Data
Governance discipline. This landing zone is also important to the underlying infrastructure, so
standardizing makes it easier for backup policies, storage tiering and data purging.
2017 Dell EMC Proven Professional Knowledge Sharing 11
2.3.3 Loading Data
Main actions are:
Data Validation. Ingested or extracted data is validated in format and content
according to predefined rules to decide if the raw data is to be processed, or
diverted to manual validation. This is a simple first action to ensure a minimal
degree of data quality throughout the cluster.
Raw Data Archiving. A compressed copy is archived with a specific
retention period. This data is mostly used to enable reprocessing and ad-hoc
analysis.
These steps take place in a smaller cluster which operates independently from the main
production cluster.
2.3.4 Transforming or Enriching Data
After the data is validated and archived it is sent to be enriched with external and auxiliary
information and finally ingested in the semantic layer based on Hive. Great attention is
dedicated to the metadata describing this layer especially in the definition of complete and
accurate metadata information with a detailed description of each field.
Data in this layer is the foundation for all other processes that generate reports and KPIs being
structured, partitioned and de-normalized to optimize the consumer processes as much as
possible.
Given the importance of this layer and different business requirements, retention period for
each type of data is independently controlled in order to allow periods from a few days to
years.
This layer and cluster is dedicated to production processes and ad-hoc reporting. For end-
user access and exploration in sandboxes, it is common to provide a small portion of the data.
2017 Dell EMC Proven Professional Knowledge Sharing 12
2.3.5 Reporting and KPIs
After the data is ingested in the semantic layer it can be consumed by reporting and use case
jobs. Jobs in this phase fetch data from the semantic layer and derive additional data which
are also stored in the Data Lake for later recovery if needed. Further processing or aggregations
are also possible.
The data generated by these jobs are also sent to Greenplum Data Base (GPDB) to be
accessed by external applications or explored by users in their sandboxes.
2.3.6 Consuming and Exploring – Sandboxes
Instead of limiting what can be constructed, why not make available to end users a set of
data that can be used as a playground?
Rather than defining what a sandbox is, let us define what it is not:
Unsupervised
Unlimited
Used for production
Widely open
A sandbox is a playground area which users can access to explore the data, test new ideas,
and prototype new projects, which will be properly defined and documented – as a side
benefit, it will shorten delivery cycles, and will lower user frustration.
The concept of sandboxes can be widespread within any company and it is possible that
any business area can profit their own sandbox. Data present in sandboxes can be shared
across different areas, so it is not siloed.
Main points to be considered in constructing a sandbox are:
Limited and monitored space given to each area
Limited concurrent queries by area
Limited memory allocation by query
Well-defined data access
Workload management
2017 Dell EMC Proven Professional Knowledge Sharing 13
2.4 Data Temperature: Volume and Latency
Although the ELT approach makes it much more flexible to manage data in a Data Lake, new
analytical systems are required to respond in ever diminishing time frames. Everyone wants
“real time insights”. To provide a very simple framework to approach these often ill-specified
requirements, we employ an analogy based on “data temperatures”.
Not only has it helped users grasp the overall tenets to data velocity, it also guides the
system’s designers when determining which technical solution is most suited.
The concept is based on the latency of the processing and querying requirements. The
“hotter” the data, the lower its latency requirements. Hot data refers to near real time data
ingestion and processing. It is usually employed to implement operational use cases which
will generate insights and actions on real time events, usually related to relevant business
transactions or consumer activity.
Unless justified by some business requirement, it usually makes little sense to implement a real
time architecture for updates on auxiliary data bases which are by nature updated as a batch.
Therefore, because of the business context-related nature of these real time applications, it
usually operates on limited data volumes, which are filtered and aggregated at the beginning of
the ingestion process.
An intermediate data temperature is usually associated with interactive queries and ingestions
organized in micro-batches across the day. This makes data available faster to the user, both
in its availability in the data lake after ingestion, as well as the queries response times, providing
the user with an interactive experience. If the user has to wait for several hours for a query
to finish, the d a t a cannot b e classified as warm. Warm data requires high performance
query processing and reasonably quick data ingestion pipelines, but providing the user a
broader context and complete data sets.
For historical data in a long term storage, we require a cost efficient infrastructure which will
support data model trainings, reports generations in a processing window, etc. This is
achieved by cold data which, unlike in a traditional BI system, is readily available for queries.
The figure below illustrates this data temperature concept.
2017 Dell EMC Proven Professional Knowledge Sharing 14
Figure 2: Data Temperature
With the data temperature concept, we can also inquire which types of use cases and data
usage patterns each temperature is usually related with. Two immediate usage modes come to
mind: operational and exploratory/analytical purposes.
When operational data usage typically occurs:
Limited data sets
Low latency and continuous flow
Integrated with applications used to generate business drivers
Input generation to provide feedback to analytical models
As opposed for exploratory and analytical purposes:
Complete data set
Batch processing and ingestion
Modeling and refinement through exploration iteration
Resulting model will be integrated to new or existing applications
Though one may say that typical DW/BI falls into “operational” classification, our big data
solution philosophy expands this definition.
2017 Dell EMC Proven Professional Knowledge Sharing 15
3 REFERENCE ARCHITECTURE
This section describes the architectural decisions and components employed when designing
the Big Data DW solution. The architecture presents decision points ranging from the very
basic network and compute infrastructure up to the data modelling and consumption levels.
3.1 Underlying Infrastructure
The solution presented is based on the following components:
Isilon as a main storage
Commodity nodes for computing
Hadoop as the underlying software suite providing cluster computing and data
management
Hive as a massive data warehousing infrastructure, enabling data reporting
and KPI calculation, as well as serving as an exploratory layer, including
data discovery and ad-hoc queries
GPDB as a high performance exploratory layer, sandbox areas and serving
as a primary user interface
Analytical tools such as SAS, Tableau or QlikView
Other tools/technologies can be used together within the Hadoop ecosystem
to support specific use cases, such as NoSQL databases, micro services-
based APIs, etc.
The figure below illustrates in high level a potential architecture and the interaction between
the infrastructure components. The architecture is divided into three main verticals, with its
interactions highlighted in the illustration. Notice that the complete data cycle of data ingestion
and processing takes place inside the Storage/Data Lake layer, which is provided by Isilon
and Hadoop. A small portion of this data is also copied to the exploratory layer which is provided
by GPDB in order to provide specialized DataMart and sandbox areas. In this case there is
duplication of a fraction of the data available in the Data Lake.
2017 Dell EMC Proven Professional Knowledge Sharing 16
Figure 3: Reference Architecture – Infrastructure
A logical diagram of the architecture is presented below. The reference architecture
associates the data temperature analogy to the components and data flow of a traditional
fast data and big data architecture. We identify the operational and exploratory usage
patterns and its supporting components.
Figure 4: Reference Architecture – Components
2017 Dell EMC Proven Professional Knowledge Sharing 17
3.2 Structuring the Data Flow in the Data Lake
The overall data flow process is based on landing zones, ingestion into a semantic layer
forming the DW, and additional processing steps to generate sandboxes and contextualized
data marts, reports and KPIs. The data flow organization is presented below:
1. Raw data arrives at a landing area.
2. After basic validation data is ready to be processed. Eventually when
reaching determined time period, valid raw data is compressed and
archived. At this point, schema is flexible, context open and there is no
aggregation applied.
3. Enrichment is applied to valid data, simulating an ODS-like structure
(semantic layer). After a first retention period, data is archived into a lower
tier in the Isilon storage. Schema is still flexible in this level, and context is
open as no aggregations or filters are applied.
4. From the previous layer, specific aggregation and filtering processes may
be applied on its data, which is now is ready to be used by reporting, or
business use cases. Schema and context are dedicated after such processes.
5. By exporting to a GPDB instance (by means of external tables residing in
Hive), resulting data, either from the semantic layer or the contextualized
store, becomes accessible to SAS/Guide or SAS/VA servers.
Figure 5: Data Layers
2017 Dell EMC Proven Professional Knowledge Sharing 18
3.3 Latency vs. Volume – Decision matrix
Some key aspects to consider when it comes to choosing technologies and components to
extract/receive and process the data include:
Volume
Velocity
Business requirements
Functional dependencies
Data source characteristics
While the first two items in the list are very technical, there are also non-technical
characteristics such as functional requirements and/or dependencies and non-functional
requirements by the end-users.
Although non-technical requirements related to business value and user experience may be
the most important factors, it is very difficult to define a general decision making process
when choosing the architecture and components.
This requires detailed analysis which is usually only available when dealing with real data after
a careful definition of the use case, including its data processing and workload requirements.
It is possible, however, to establish a few general ideas to guide which technology may be
applied by looking into simple variables (such as data volume and velocity), as shown in the
figure below.
2017 Dell EMC Proven Professional Knowledge Sharing 19
Figure 6: Latency vs. Volume
When dealing with larger processing windows, especially with lower data volumes, the choice
of tool and general architecture, as well as the programming paradigm is not as critical to the
overall performance as it is in the case of higher volumes or velocities. Therefore, it may be
observed in the illustration that the upper diagonal of the graph does not contain any specific
guidelines on technical options (such as when you need to ingest a very low volume of data
in large time frames). The lower diagonal has some guidelines as data volume and/or velocity
will make the system performance highly sensitive to the architecture and tool choice.
3.4 Transactional data
Normally, there are not many architecture options f o r extracting data from transactional
systems. Either the system itself will support some kind of streaming or some extractions may
be performed directly from the database. When the system is not event-driven or does not
provide facilities to extract those, there are basically two options left: Changing Data Capture
(CDC), and Custom Extraction (CE).
CDC can be achieved by some tool that tracks the transaction logs of the database. These
tools normally support capturing data and sending them in streaming in near real-time or
consolidating a batch of it then sending to the destination. In the case of Oracle data bases,
the most common tool to use is the Oracle Golden Gate. It has been used to illustrate in the
next figure.
2017 Dell EMC Proven Professional Knowledge Sharing 20
CE is also an option when there is no requirement of latency. In this case the source
system/database limits the frequency of the extractions based on many factors which are can
be summarize in those below:
Can the extraction affect the load of the database?
Is there any functional requirement that prevents the data to be loaded more
often?
How frequently is the data updated in the database?
Should the data be extracted in an intermediate state?
Considering all of this, the decision of the approach used to extract the data can be
summarized in the figure below.
Figure 7: CDC or Extraction?
2017 Dell EMC Proven Professional Knowledge Sharing 21
4 CONCLUSIONS
By changing the way a traditional DW/BI is conceived enterprises are now able to produce
business cases that were not possible in the past, either because business users were
limited by existing techniques, or simply because of a lack of computational power.
Bringing the whole data from source systems in different formats, preparing it and exposing it
to the business area enables the company to be more agile, generate value faster and create a
unique partnership between business and IT.
The concept of sandbox applied to DW/BI brings the data closer to those who really need it,
eliminating the “middle man” represented by DW/BI developers. This represents a high value
delivered to the business specialist in a fast and consistent way.
These new empowered business users will not need to “explain” to others what is really
needed, and in return, s/he will realize what is really possible to be implemented.
Another upside of this approach is that data accuracy and business rules are validated faster
and use cases implemented are much more reliable, because they can be validated down to
the detail using the sandbox data directly by the user.
2017 Dell EMC Proven Professional Knowledge Sharing 22
Glossary
BI Business Intelligence
CDC Change Data Capture
CE Custom Extraction
DW Data Warehouse
ETL Extract Transform and Load
ELT Extract Load and Transform
IT Information Technology
(R)DBMS (Relational) Data Base Management System
GPDB Greenplum Data Base
KPI Key Performance Indicator
NRT Near Real Time
Dell EMC believes the information in this publication is accurate as of its publication date. The
information is subject to change without notice.
THE INFORMATION IN THIS PUBLICATION IS PROVIDED “AS IS.” DELL EMC MAKES NO
RESPRESENTATIONS OR WARRANTIES OF ANY KIND WITH RESPECT TO THE
INFORMATION IN THIS PUBLICATION, AND SPECIFICALLY DISCLAIMS IMPLIED
WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
Use, copying and distribution of any Dell EMC software described in this publication requires
an applicable software license.
Dell, EMC and other trademarks are trademarks of Dell Inc. or its subsidiaries.