7
Data Preparation in the Hadoop Data Lake Michael Lang, Teradata 10.14 EB8458 WHITE PAPER

Data Prep in Hadoop Data Lake

Embed Size (px)

Citation preview

Page 1: Data Prep in Hadoop Data Lake

Data Preparation in the Hadoop Data Lake

Michael Lang, Teradata 10.14 EB8458 WHITE PAPER

Page 2: Data Prep in Hadoop Data Lake

Data PreParation in the haDooP Data Lake2 teradata.com

Table of Contents

2 Welcome to the Hadoop Data Lake

3 Data Preparation in the Hadoop Data Lake

4 Data Preparation with Teradata Loom and

Weaver

4 Structure

4 Explore

5 Transform

6 Conclusion

7 For More Information

Welcome to the Hadoop Data Lake

Enterprises are looking at Apache™ Hadoop® for many

reasons, such as low cost, scalability, and flexibility. The

latter in particular holds out new possibilities for data

scientists and other users across the enterprise. The

Hadoop Distributed File System (HDFS) accepts files of

any type and format, accelerating a new and revolutionary

use case known as the “data lake”.

In the data lake, enterprises use HDFS to store and

process previously unused data and combine legacy

data in new ways. Data scientists use the tools of the

Hadoop ecosystem, such as Hive, Pig, and MapReduce,

to explore the data and investigate relationships,

looking for patterns and trends in data of all sizes,

from megabytes to petabytes. This process of discovery,

preparation, analysis, and reporting is the workflow of

data science. In the data lake, analysts can study log

files and geolocation data, social media feeds and

sensor data. They can crunch through neat tabular

data, completely unstructured text, and everything in

between. This data preparation phase is highly iterative

and exploratory, and the aim is to transform data into

forms suitable for meaningful statistical analysis. All the

preparatory work leads up to more formal descriptive

analyses, predictive models and visualizations for internal

and external audiences. Ultimately, the data lake and the

data science workflow form the basis for data-driven,

company-wide decisions.

The question is, how do the enterprise and the analyst

actually make sense of the files pouring into the data lake

and manage the data effectively? The same flexible file

system that makes the data lake possible can create a

confusing warren of directories with proliferating file

types and unknown provenance. The extensible registry

of Teradata Loom® and the Teradata Loom Activescan

service provide part of the solution with metadata

management capabilities found nowhere else in the

Hadoop ecosystem. The Teradata Loom framework of

sources, datasets, transforms, and jobs gives the data

scientist an integrated view of the workflow.

For the analyst and data scientist, Teradata Loom allows

for faster discovery, and the statistics calculated by

Teradata Loom Activescan provide a solid starting point

for further analysis. Once an analyst has the right data for

Page 3: Data Prep in Hadoop Data Lake

Data PreParation in the haDooP Data Lake3 teradata.com

the task, much of the remaining time in the data science

workflow is spent on data preparation. Practitioners

testify that getting the data in the right form often

takes up seventy, eighty, or even ninety percent of their

time. In addition to exploring the data and developing

an approach, it can also be time-consuming just to find

the right tool for the job. Reducing time required for

data preparation and enabling analysts to work more

effectively with “big data” is the next frontier for Hadoop,

and Teradata Loom is blazing the trail.

Data Preparation in the Hadoop Data Lake

There is no settled terminology for the set of activities

between acquiring and modeling data. We use the phrase

“data preparation” to describe these activities. Data

preparation seeks to turn newly-acquired “raw” data

into clean data that can be analyzed and modeled in a

meaningful way. This phase of the data science workflow,

and subsets of it, have been variously labeled wrangling,

munging, reduction, and cleansing. Teradata uses “data

preparation” to avoid jargon.

Data preparation has not always been a focus for business

analysts. Because traditional data warehouses require

orderly, fully-defined data up front, business analysts

pick up with mainly descriptive analysis after much of

the necessary data preparation is done. Data engineers

are responsible for acquiring new data as well as turning

the data into a form suitable for analysis. Here, the

mantra for data engineering is “extract, transform, and

load”. In this paradigm, data engineers and business

analysts split responsibilities for the data science

workflow. This creates gaps in the data lake environment,

because data engineers lack a foundation in analytics,

while business analysts are unfamiliar with tools and

approaches for data preparation.

With the data lake, the mantra becomes “extract, load, and

transform”, and the data scientist bridges the gaps between

the data engineer and business analyst. The data engineer

imports and manages data of varied size, provenance, and

frequency in the data lake. The data scientist prepares data

in the data lake and conducts advanced analysis, such as

data mining and predictive modeling. The business analyst

produces final visualizations and reports from prepared data.

Of course, these roles represent a set of competencies,

and they are not mutually exclusive. Enterprises may

complete the data science workflow by hiring data

scientists with the skills necessary to bridge data

preparation and analysis. Alternatively, they can empower

data engineers and business analysts to work effectively

in the data lake. In either case, these data workers need

the right kind of tool to get through the data science

workflow and produce insights for the enterprise.

The central challenge for a data preparation tool in the

data lake is interactivity. When an analyst prepares

data in-memory on a single machine or server,

transformations often take place in near real-time.

The analyst obtains the transformed dataset almost

The Data Science Workflow in the Data Lake

Acquire Prepare Analyze Report

Describe

Visualize

PredictTransform Join

Figure 1.

Page 4: Data Prep in Hadoop Data Lake

Data PreParation in the haDooP Data Lake4 teradata.com

immediately and continues to iterate with additional

exploration and transformations. In a data lake, where the

sheer volume of some data means that transformations

cannot proceed in real-time, data preparation calls for a

new approach. The key to this approach is an intuitive

user interface that balances the ultimate need for batch

processing with interactive sampling and iterative review

of transformations. This interface should be designed to

supplement, rather than replace, the existing ecosystem

of tools for preparing data.

Having established a strong foundation in metadata

management for Hadoop datasets and transformations,

Teradata Loom now provides a new approach for data

preparation with a feature called Weaver. Data scientists

finally have a power tool for the data lake: an interactive

method for preparing big data incrementally and iteratively.

Data Preparation with Teradata Loom and Weaver

Data preparation has three essential competencies:

structuring, exploring, and transforming. These com-

petencies, especially exploring and transforming, are

iterative and overlapping. Teradata Loom Activescan and

Weaver support the full spectrum of data preparation

tasks, while giving the user the flexibility to incorporate

other in-Hadoop and in-memory tools.

StructureAfter finding the right data for a given task, a data

scientist must structure it. In the context of data

preparation, structuring the data typically means creating

a tabular structure from a flat file or collection of files.

Tools for transformation and analysis tend to expect a

tabular or matrix-like structure with observations in the

rows and fields in the columns, although the contents of

any particular cell may be arbitrarily complex. Teradata

Loom Activescan provides the framework and mechanism

for structuring data in the data lake.

Depending on the task at hand, data may or may not be

available in an easily-accessible form. Many data sources

have easy-to-read formats such as delimited text or fixed-

width text. In addition to tabular sources, data may be

available in nested formats, such as XML or JSON. Data

may be compressed or stored in a binary or proprietary

formats. Finally, data may reside in “unstructured” text.

In the data lake, all of these file formats can coexist.

Structuring data may involve extracting particular

elements from the raw data. For example, nested

structures can be flattened, but some data may have to

be ignored to simplify and reduce the dimensions of the

resulting table.

Teradata Loom Activescan helps the user find the right kind

of data and structure it appropriately. Based on user set-

tings, Activescan identifies new files in specified directories

at a specified interval. To supplement standard formats

such as delimited text and Hive databases, Activescan

applies custom plugins to recognize, parse, and format

data. For example, Activescan uses text patterns known

as regular expressions to recognize log files and parse the

files accordingly. The resulting tables are cleanly formatted

for subsequent transformations. Similarly, users might use

Activescan to recognize sequences of ten digits separated

by two hyphens and create a column in the resulting table

for phone numbers. Activescan can also leverage Hive

SerDes for complex structuring tasks.

Roles in the Data Science Workflow

Data Engineer Business Analyst

Data Scientist

Acquire Prepare Analyze Report

Figure 2.

Page 5: Data Prep in Hadoop Data Lake

Data PreParation in the haDooP Data Lake5 teradata.com

ExploreHaving created a table or collection of tables from a data

source, a data scientist must learn more about the now-

structured data. The aim is to learn enough about the

data to know what transformations will make it suitable

for statistical analysis. The analyst’s understanding

of a dataset is founded on three things: descriptive

statistics, data samples, and visualizations. Teradata

Loom Activescan helps the user understand important

aspects of the data up front, while Teradata Loom Weaver

provides an intuitive interface for viewing samples and

planning changes necessary for analysis.

When a new table is created, Activescan automatically

calculates descriptive statistics, which indicate data

quality and guide exploration of the data. For numeric

columns, statistics such as minimum, maximum, and

mean provide a sanity check. For example, a numeric

column that contains data on a person’s age should not

have negative values. For string columns, the number

and distribution of distinct values or categories are often

statistics of interest. Across column types, Activescan

informs the user of missing or null values.

Beyond the descriptive statistics, the user needs to engage

directly with the data. When tables have many rows and

columns, the ability to navigate the data effectively is

essential. To start, Teradata Loom provides a flexible data

preview with a variety of subsetting functions. In addition,

Weaver gives the user access to built-in samples. Samples

are taken from the first or last rows of the table, somewhere

in the middle, or a random selection of rows. Filters allow

the user to scrutinize and ultimately transform subsets of

the data based on value of one or more columns or fields.

Lastly, visualizations are essential to highlight meaningful

patterns in data and metadata. Weaver builds on statistics

calculated by Activescan with simple visualizations, such

as histograms and bar plots for viewing the distribution of

values in numeric and string columns, respectively.

TransformHaving explored the data, the data scientist proceeds

iteratively to transform columns and tables until the

data is ready for final statistical analysis. Teradata Loom

Weaver is a power tool for transformations, or transforms,

including built-in functions for column- and row-based

Figure 3.

Page 6: Data Prep in Hadoop Data Lake

Data PreParation in the haDooP Data Lake6 teradata.com

transforms of strings, numbers, and date/time objects.

In addition, Weaver enables the user to transform the

structure of tables. To create new tables from multiple

tables through join or union operations, Loom leverages

SQL/HiveQL. Teradata Loom automatically tracks and

displays the lineage of these transforms.

The following examples illustrate the range of transforms.

String transforms create new or coherent categorical

variables. For instance, the first three digits of a column

containing phone numbers can be split out to make a

new column for area codes. Other string transforms, such

as capitalization, substitution, and trimming whitespace,

can clean up inconsistent data. For example, the strings

“usa”, “U.S.A.”, and “ USA” can be standardized as “USA”.

Numeric transforms are mathematical or statistical

functions, such as taking the logarithm of a numeric

column. Date/time operations take a string or numeric

value as input and produce an object with specific

information about date and time. The input string

may be something like “January 1, 2013 10:35:00” or

“20130101103500”. The converted data/time object

allows the user to extract elements that might not appear

in the original string, such as day of the week. Table-level

transforms change the layout of rows and columns to

facilitate exploration, cleaning, and analysis. Columns

can be reordered, removed, or renamed. More intensive

operations include filling values and transposing rows

into columns.

A Weaver session starts with a sample of a larger table.

The user applies transforms to the sample iteratively

until the sample reflects all of the necessary changes.

Weaver assists the user by providing suggestions for the

next transform, such as eliminating non-numeric or non-

matching values to turn a string column into a numeric

or date/time column. All of the executed transforms are

recorded in the Weaver session history. When the user

is satisfied with the results of the transforms, Weaver

executes the same transforms over the full table by

initiating MapReduce jobs in HDFS. The user reviews the

new table with reference to Activescan statistics and

updated Weaver samples and continues with additional

transformations as necessary. Metadata associated with

this iterative process is fully captured in the Teradata

Loom registry, and data transformations are reflected in

the lineage graph.

To get the most out of the data lake, data scientists also

need to combine tables. To create and execute joins and

unions, Teradata Loom provides a direct interface to

SQL/HiveQL. These query languages provide a familiar

abstraction over MapReduce for relational transforms.

The user can add descriptions, keywords, and other

metadata to the transforms as needed. As with Weaver

transforms, inputs and outputs are tracked automatically

in the Teradata Loom lineage graph.

Teradata Loom Data

Preparation Workflow

Dataset

Lineage Activescan

HiveQL Weaver

SamplesTransforms

JoinsUnions

Input/OutputReporting

Table statsColumn stats

HDFS

+

ProvenanceSuitability

Figure 4.

Page 7: Data Prep in Hadoop Data Lake

Data PreParation in the haDooP Data Lake7 teradata.com

10000 Innovation Drive, Dayton, OH 45342 Teradata.com

Teradata and the Teradata logo are registered trademarks of Teradata Corporation and/or its affiliates in the U.S. and worldwide. Teradata continually improves products as new

technologies and components become available. Teradata, therefore, reserves the right to change specifications without prior notice. All features, functions, and operations

described herein may not be marketed in all parts of the world. Consult your Teradata representative or Teradata.com for more information.

Copyright © 2014 by Teradata Corporation All Rights Reserved. Produced in U.S.A.

10.14 EB8458

Conclusion

Teradata Loom provides the first complete data

management solution for Hadoop. Data engineers,

business analysts, and data scientists have the right

tools to work effectively and efficiently in the data lake.

Teradata Loom enables data workers to find, structure,

explore, and transform data faster while maintaining clear

records of provenance, lineage, and other metadata.

As a result, enterprises benefit receive better and faster

insights from a continuous data science workflow. Hadoop

has never been more enterprise-ready.

For More Information

To find out more about data and metadata management

in Hadoop and Teradata Loom and how Teradata can help

you drive more value out of your Hadoop Investments,

please contact your local Teradata representative, or visit

Teradata.com/loom.