Upload
michele-sloan
View
13
Download
4
Embed Size (px)
Citation preview
Data Preparation in the Hadoop Data Lake
Michael Lang, Teradata 10.14 EB8458 WHITE PAPER
Data PreParation in the haDooP Data Lake2 teradata.com
Table of Contents
2 Welcome to the Hadoop Data Lake
3 Data Preparation in the Hadoop Data Lake
4 Data Preparation with Teradata Loom and
Weaver
4 Structure
4 Explore
5 Transform
6 Conclusion
7 For More Information
Welcome to the Hadoop Data Lake
Enterprises are looking at Apache™ Hadoop® for many
reasons, such as low cost, scalability, and flexibility. The
latter in particular holds out new possibilities for data
scientists and other users across the enterprise. The
Hadoop Distributed File System (HDFS) accepts files of
any type and format, accelerating a new and revolutionary
use case known as the “data lake”.
In the data lake, enterprises use HDFS to store and
process previously unused data and combine legacy
data in new ways. Data scientists use the tools of the
Hadoop ecosystem, such as Hive, Pig, and MapReduce,
to explore the data and investigate relationships,
looking for patterns and trends in data of all sizes,
from megabytes to petabytes. This process of discovery,
preparation, analysis, and reporting is the workflow of
data science. In the data lake, analysts can study log
files and geolocation data, social media feeds and
sensor data. They can crunch through neat tabular
data, completely unstructured text, and everything in
between. This data preparation phase is highly iterative
and exploratory, and the aim is to transform data into
forms suitable for meaningful statistical analysis. All the
preparatory work leads up to more formal descriptive
analyses, predictive models and visualizations for internal
and external audiences. Ultimately, the data lake and the
data science workflow form the basis for data-driven,
company-wide decisions.
The question is, how do the enterprise and the analyst
actually make sense of the files pouring into the data lake
and manage the data effectively? The same flexible file
system that makes the data lake possible can create a
confusing warren of directories with proliferating file
types and unknown provenance. The extensible registry
of Teradata Loom® and the Teradata Loom Activescan
service provide part of the solution with metadata
management capabilities found nowhere else in the
Hadoop ecosystem. The Teradata Loom framework of
sources, datasets, transforms, and jobs gives the data
scientist an integrated view of the workflow.
For the analyst and data scientist, Teradata Loom allows
for faster discovery, and the statistics calculated by
Teradata Loom Activescan provide a solid starting point
for further analysis. Once an analyst has the right data for
Data PreParation in the haDooP Data Lake3 teradata.com
the task, much of the remaining time in the data science
workflow is spent on data preparation. Practitioners
testify that getting the data in the right form often
takes up seventy, eighty, or even ninety percent of their
time. In addition to exploring the data and developing
an approach, it can also be time-consuming just to find
the right tool for the job. Reducing time required for
data preparation and enabling analysts to work more
effectively with “big data” is the next frontier for Hadoop,
and Teradata Loom is blazing the trail.
Data Preparation in the Hadoop Data Lake
There is no settled terminology for the set of activities
between acquiring and modeling data. We use the phrase
“data preparation” to describe these activities. Data
preparation seeks to turn newly-acquired “raw” data
into clean data that can be analyzed and modeled in a
meaningful way. This phase of the data science workflow,
and subsets of it, have been variously labeled wrangling,
munging, reduction, and cleansing. Teradata uses “data
preparation” to avoid jargon.
Data preparation has not always been a focus for business
analysts. Because traditional data warehouses require
orderly, fully-defined data up front, business analysts
pick up with mainly descriptive analysis after much of
the necessary data preparation is done. Data engineers
are responsible for acquiring new data as well as turning
the data into a form suitable for analysis. Here, the
mantra for data engineering is “extract, transform, and
load”. In this paradigm, data engineers and business
analysts split responsibilities for the data science
workflow. This creates gaps in the data lake environment,
because data engineers lack a foundation in analytics,
while business analysts are unfamiliar with tools and
approaches for data preparation.
With the data lake, the mantra becomes “extract, load, and
transform”, and the data scientist bridges the gaps between
the data engineer and business analyst. The data engineer
imports and manages data of varied size, provenance, and
frequency in the data lake. The data scientist prepares data
in the data lake and conducts advanced analysis, such as
data mining and predictive modeling. The business analyst
produces final visualizations and reports from prepared data.
Of course, these roles represent a set of competencies,
and they are not mutually exclusive. Enterprises may
complete the data science workflow by hiring data
scientists with the skills necessary to bridge data
preparation and analysis. Alternatively, they can empower
data engineers and business analysts to work effectively
in the data lake. In either case, these data workers need
the right kind of tool to get through the data science
workflow and produce insights for the enterprise.
The central challenge for a data preparation tool in the
data lake is interactivity. When an analyst prepares
data in-memory on a single machine or server,
transformations often take place in near real-time.
The analyst obtains the transformed dataset almost
The Data Science Workflow in the Data Lake
Acquire Prepare Analyze Report
Describe
Visualize
PredictTransform Join
Figure 1.
Data PreParation in the haDooP Data Lake4 teradata.com
immediately and continues to iterate with additional
exploration and transformations. In a data lake, where the
sheer volume of some data means that transformations
cannot proceed in real-time, data preparation calls for a
new approach. The key to this approach is an intuitive
user interface that balances the ultimate need for batch
processing with interactive sampling and iterative review
of transformations. This interface should be designed to
supplement, rather than replace, the existing ecosystem
of tools for preparing data.
Having established a strong foundation in metadata
management for Hadoop datasets and transformations,
Teradata Loom now provides a new approach for data
preparation with a feature called Weaver. Data scientists
finally have a power tool for the data lake: an interactive
method for preparing big data incrementally and iteratively.
Data Preparation with Teradata Loom and Weaver
Data preparation has three essential competencies:
structuring, exploring, and transforming. These com-
petencies, especially exploring and transforming, are
iterative and overlapping. Teradata Loom Activescan and
Weaver support the full spectrum of data preparation
tasks, while giving the user the flexibility to incorporate
other in-Hadoop and in-memory tools.
StructureAfter finding the right data for a given task, a data
scientist must structure it. In the context of data
preparation, structuring the data typically means creating
a tabular structure from a flat file or collection of files.
Tools for transformation and analysis tend to expect a
tabular or matrix-like structure with observations in the
rows and fields in the columns, although the contents of
any particular cell may be arbitrarily complex. Teradata
Loom Activescan provides the framework and mechanism
for structuring data in the data lake.
Depending on the task at hand, data may or may not be
available in an easily-accessible form. Many data sources
have easy-to-read formats such as delimited text or fixed-
width text. In addition to tabular sources, data may be
available in nested formats, such as XML or JSON. Data
may be compressed or stored in a binary or proprietary
formats. Finally, data may reside in “unstructured” text.
In the data lake, all of these file formats can coexist.
Structuring data may involve extracting particular
elements from the raw data. For example, nested
structures can be flattened, but some data may have to
be ignored to simplify and reduce the dimensions of the
resulting table.
Teradata Loom Activescan helps the user find the right kind
of data and structure it appropriately. Based on user set-
tings, Activescan identifies new files in specified directories
at a specified interval. To supplement standard formats
such as delimited text and Hive databases, Activescan
applies custom plugins to recognize, parse, and format
data. For example, Activescan uses text patterns known
as regular expressions to recognize log files and parse the
files accordingly. The resulting tables are cleanly formatted
for subsequent transformations. Similarly, users might use
Activescan to recognize sequences of ten digits separated
by two hyphens and create a column in the resulting table
for phone numbers. Activescan can also leverage Hive
SerDes for complex structuring tasks.
Roles in the Data Science Workflow
Data Engineer Business Analyst
Data Scientist
Acquire Prepare Analyze Report
Figure 2.
Data PreParation in the haDooP Data Lake5 teradata.com
ExploreHaving created a table or collection of tables from a data
source, a data scientist must learn more about the now-
structured data. The aim is to learn enough about the
data to know what transformations will make it suitable
for statistical analysis. The analyst’s understanding
of a dataset is founded on three things: descriptive
statistics, data samples, and visualizations. Teradata
Loom Activescan helps the user understand important
aspects of the data up front, while Teradata Loom Weaver
provides an intuitive interface for viewing samples and
planning changes necessary for analysis.
When a new table is created, Activescan automatically
calculates descriptive statistics, which indicate data
quality and guide exploration of the data. For numeric
columns, statistics such as minimum, maximum, and
mean provide a sanity check. For example, a numeric
column that contains data on a person’s age should not
have negative values. For string columns, the number
and distribution of distinct values or categories are often
statistics of interest. Across column types, Activescan
informs the user of missing or null values.
Beyond the descriptive statistics, the user needs to engage
directly with the data. When tables have many rows and
columns, the ability to navigate the data effectively is
essential. To start, Teradata Loom provides a flexible data
preview with a variety of subsetting functions. In addition,
Weaver gives the user access to built-in samples. Samples
are taken from the first or last rows of the table, somewhere
in the middle, or a random selection of rows. Filters allow
the user to scrutinize and ultimately transform subsets of
the data based on value of one or more columns or fields.
Lastly, visualizations are essential to highlight meaningful
patterns in data and metadata. Weaver builds on statistics
calculated by Activescan with simple visualizations, such
as histograms and bar plots for viewing the distribution of
values in numeric and string columns, respectively.
TransformHaving explored the data, the data scientist proceeds
iteratively to transform columns and tables until the
data is ready for final statistical analysis. Teradata Loom
Weaver is a power tool for transformations, or transforms,
including built-in functions for column- and row-based
Figure 3.
Data PreParation in the haDooP Data Lake6 teradata.com
transforms of strings, numbers, and date/time objects.
In addition, Weaver enables the user to transform the
structure of tables. To create new tables from multiple
tables through join or union operations, Loom leverages
SQL/HiveQL. Teradata Loom automatically tracks and
displays the lineage of these transforms.
The following examples illustrate the range of transforms.
String transforms create new or coherent categorical
variables. For instance, the first three digits of a column
containing phone numbers can be split out to make a
new column for area codes. Other string transforms, such
as capitalization, substitution, and trimming whitespace,
can clean up inconsistent data. For example, the strings
“usa”, “U.S.A.”, and “ USA” can be standardized as “USA”.
Numeric transforms are mathematical or statistical
functions, such as taking the logarithm of a numeric
column. Date/time operations take a string or numeric
value as input and produce an object with specific
information about date and time. The input string
may be something like “January 1, 2013 10:35:00” or
“20130101103500”. The converted data/time object
allows the user to extract elements that might not appear
in the original string, such as day of the week. Table-level
transforms change the layout of rows and columns to
facilitate exploration, cleaning, and analysis. Columns
can be reordered, removed, or renamed. More intensive
operations include filling values and transposing rows
into columns.
A Weaver session starts with a sample of a larger table.
The user applies transforms to the sample iteratively
until the sample reflects all of the necessary changes.
Weaver assists the user by providing suggestions for the
next transform, such as eliminating non-numeric or non-
matching values to turn a string column into a numeric
or date/time column. All of the executed transforms are
recorded in the Weaver session history. When the user
is satisfied with the results of the transforms, Weaver
executes the same transforms over the full table by
initiating MapReduce jobs in HDFS. The user reviews the
new table with reference to Activescan statistics and
updated Weaver samples and continues with additional
transformations as necessary. Metadata associated with
this iterative process is fully captured in the Teradata
Loom registry, and data transformations are reflected in
the lineage graph.
To get the most out of the data lake, data scientists also
need to combine tables. To create and execute joins and
unions, Teradata Loom provides a direct interface to
SQL/HiveQL. These query languages provide a familiar
abstraction over MapReduce for relational transforms.
The user can add descriptions, keywords, and other
metadata to the transforms as needed. As with Weaver
transforms, inputs and outputs are tracked automatically
in the Teradata Loom lineage graph.
Teradata Loom Data
Preparation Workflow
Dataset
Lineage Activescan
HiveQL Weaver
SamplesTransforms
JoinsUnions
Input/OutputReporting
Table statsColumn stats
HDFS
+
ProvenanceSuitability
Figure 4.
Data PreParation in the haDooP Data Lake7 teradata.com
10000 Innovation Drive, Dayton, OH 45342 Teradata.com
Teradata and the Teradata logo are registered trademarks of Teradata Corporation and/or its affiliates in the U.S. and worldwide. Teradata continually improves products as new
technologies and components become available. Teradata, therefore, reserves the right to change specifications without prior notice. All features, functions, and operations
described herein may not be marketed in all parts of the world. Consult your Teradata representative or Teradata.com for more information.
Copyright © 2014 by Teradata Corporation All Rights Reserved. Produced in U.S.A.
10.14 EB8458
Conclusion
Teradata Loom provides the first complete data
management solution for Hadoop. Data engineers,
business analysts, and data scientists have the right
tools to work effectively and efficiently in the data lake.
Teradata Loom enables data workers to find, structure,
explore, and transform data faster while maintaining clear
records of provenance, lineage, and other metadata.
As a result, enterprises benefit receive better and faster
insights from a continuous data science workflow. Hadoop
has never been more enterprise-ready.
For More Information
To find out more about data and metadata management
in Hadoop and Teradata Loom and how Teradata can help
you drive more value out of your Hadoop Investments,
please contact your local Teradata representative, or visit
Teradata.com/loom.