Datalake Architecture

DATA LAKE ARCHITECTURE

Monojit Basu, Founder & Director TechYugadi IT Solutions & Consulting

OSI DAYS 2016, BANGALORE

Data Never Sleeps

Every minute

Facebook users share 216,302 photos

Dropbox users upload 833,333 new files

Youtube users share 400 hours of new video

Twitter users send 350,000 tweets

A Boeing 737 Aircraft in flight generates 40 TB of data

EDW vs Data Lake

Data Lake is built on the premise that every drop of

data is valuable

Its a place for capturing and exploring huge volumes of raw data that a business generates

Explorers are diverse: business analysts, data scientists, …

even business managers (using self-service)

Goals of exploration may be loosely defined

EDW vs Data Lake

EDW stores filtered and processed data

For pre-meditated usage scenarios

Traditionally structured in the form of ‘cubes’ Analogy

Difference between a college library (focused on curriculum) and the US Library of Congress

EDW vs Data Lake

Schema-on-Read

Schema-on-Write

DATA LAKE

XML

JSON

CSV

PDF

TRADING PARTNER

REST API

INVOICING

ORDERS DB

READ / EXTRACT

READ / EXTRACT

READ / EXTRACT

CRM ANALYTICS

SCM ANALYTICS

RECO ENGINE

ENTERPRISE DATA WAREHOUSE

XML

JSON

CSV

PDF

TRADING PARTNER

REST API

INVOICING

ORDERS DB

SALES

OPERATIONS

MARKETING

ETL

Why Think of Data Lake

Business Drivers

Diverse sources of data: transactions, interactions, human and machine-generated

Routine analysis not enough – deeper insights lead to differentiation

Agile and Adaptive Business Models

Technology Drivers Fast, cheap and scalable storage (eg. HDFS)

Diverse data-processing engines (eg. NoSQL)

Infinitely elastic processing power (cluster of commodity servers)

Application Domains

Healthcare

IoT

E-Governance

Insurance

What Features Should It Support

Scalable Storage Layer

3 V’s of Data Inflow

Data Discovery

Data Governance

Pluggable and Extensible Analytics

Elastic Processing Power

Multi-stakeholder and Multi-tenant Access

Building It On Top Of Hadoop

Data Lake doesn’t have to be Hadoop

But Hadoop has proven its prowess on planet-scale data, in terms of:

Data Volumes

Elastic Data Processing Power

Probably the idea of a Data Lake was inspired by Hadoop

Naturally most often a Data Lake Architecture is built around Hadoop

Storage Capacity: Metrics

Normally HDFS scales even with one NameNode

Unless you have hundreds of Petabytes data

But you need to monitor the usage pattern

Are you creating too many small files (what’s the average number of blocks per file)?

How much RAM would you need for the NameNode? (a high value could mean larger GC pauses)

Internal Load (heartbeats and block reports) vs External Get and Create Requests

Storage Capacity: HDFS Federation

Single Name Node

NameNode Federation

NameNode

Data Node1

Data Node2

Data NodeN

MR Client

Get / Create

Internal Load

…

NameNode1

NameNode2

Block Pool1 Block Pool2

Data Node1

Data Node2

Data NodeN

…

Storage Capacity: Availability

NameNode Federation does not ensure HA

Even if you don’t go for Federation, configuring high availability is recommended

Essentially set up a Standby NameNode

Active NameNode shares state with the Standby

Using a shared Journal Manager, or

Simply using a NFS-mounted shared File directory

Synchronization frequency is configurable

Compute Capacity

Hadoop 1.0 supported 1 type of Job (Map-Reduce)

MR jobs were scheduled by a ‘JobTracker’ process Hadoop 2.0 offers a Resource Manager (YARN)

It is intended to replace JobTracker and better the Hadoop cluster size limit from 3000 to 10000

But more important: YARN supports different types of

Jobs including MR to run on Hadoop

Hence Data Lake should preferably be built on YARN

Compute Capacity: YARN

YARN ARCHITECTURE

RESOURCE MANAGER

NODE MANAGER

MR APP MASTER

SPARK TASK

NODE MANAGER

SPARK APP MASTER

MR TASK

N O D E 1

N O D E 2

MR CLIENT

SPARK CLIENT

Data Inflow

The goal is to build a pipeline into Hadoop-native

data stores

HDFS, mandatorily

Hive and Hbase, preferably

Considering the variety of data formats that a Data Lake must accommodate:

A general purpose Data Integration Tool must be chosen

For example, Pentaho Data Integration (PDI)

Data Inflow

Pipelines specialized for specific data formats may also be plugged in

HDFS

FLAT FILE INPUT CONNECTOR

WEB SERVICE INPUT CONNECTOR

HDFS OUTPUT CONNECTOR

.txt .json

SQOOP FLUME

DB log

Data Inflow: Streaming Data

Streaming Data may be processed in two ways

Simply store in the Data Lake for future analysis

Interesting tweets for building a sentiment analysis model

Store and Forward to a Real-time Analytics Engine

Even as real-time processing occurs, the source data in raw format may be useful in future

To build / update machine learning models, for example in fraud analytics

HDFS

STORE STORE & FORWARD

Data Analytics

A Data Lake built on HDFS will most likely use a Hadoop cluster to analyze data

Sometimes the result of the analysis may be stored back into HDFS (or possibly Hive / Hbase)

But Data Visualization and Reporting / Dashboards may work only on structured data cubes

Hence on the Analytics side, a Data Lake may need outflow paths from HDFS into structured data stores

Plugging In Data Analytics Engine

Jaspersoft Reporting with HDFS

HDFS ANALYZED DATA

JASPERSOFT ETL

HDFS INPUT CONNECTOR OLAP

CUBE

JASPERSOFT REPORTING

ENGINE

Data Governance

Data Lake does not conform to a schema

Data Governance makes it possible to make sense of the data

To both analysts and administrators

Data Governance is a fairly open-ended subject

Vendors offer different techniques to solve each governance use case

But common patterns are emerging across the landscape

Data Governance: Analyst Use Cases

To search and retrieve ‘relevant’ data for analysis Common Techniques

Metadata Management

Data tagging

Text Search

Data Classification

Metadata can include technical as well as business information (linked to a Business Glossary)

Data tags are often created by users collaboratively

Data Governance: Admin Use Cases

Track data flow from source to end applications

Retain, replicate and archive based on usage

Track access and usage information for compliance

Lineage

Data Life-cycle Management

Auditing

Automated Metadata Generation

As data is ingested, suitable attributes are extracted and stored into a metadata repository

Data type (XML, PDF, text, etc)

Data size

Creation and Last Access time, etc

Even data tags can be inserted at the time of ingest

Unconditionally, eg. ‘sales’ Conditionally, eg. ‘holiday_sales’

Apache Atlas For Data Governance

Source: http://atlas.incubator.apache.org/Architecture.html

http://atlas.incubator.apache.org/Architecture.html

Data Access And Security

By default HDFS is secured using

Kerberos for authentication, and

Unix-style file permissions for authorization

In a large data repository with diverse stakeholders you may need more control

If so, a couple of products may be considered for augmenting Data Security:

Apache Knox

Apache Ranger

Data Access And Security

HDFS

Perimeter Security: Knox

KERBEROS

Authentication Authorization (rwx)

RANGER Federated Access Control

NODE 1

NODE N

Why Use Ranger

Supports Federated Access Control Can fall-back upon default HDFS file permissions

Manages Access Control over several Hadoop-based components, like Hive, Storm, etc.

Advanced fine-grained access control, like Deny policies for user or group

Tag-based access control, where a collection of resources share a common access tag

For example, a few columns in a Hive table and a certain files in HDFS could share a tag: ‘internal_audit’

Steps To Build A Data Lake

Set up a scalable data storage layer

Set up a Compute Cluster capable of running a diverse mix of Jobs

Create data flow pipeline(s) for batch jobs

Create data flow pipeline(s) for streaming data


Plug in one or more Analytics Engine(s)

Set up mechanisms for efficient data discovery and data governance

Implement Data Access Controls

Design a Monitoring Infrastructure for Jobs and Resources (not covered today)

Building A Data Lake: Starting Points

Set up a scalable data storage layer: HDFS

Set up a Compute Cluster capable of running a diverse mix of Jobs: YARN

Create data flow pipeline(s) for batch jobs: Pentaho HDFS Connector

Create data flow pipeline(s) for streaming data: Pentaho Messaging Connector


Plug in one or more Analytics Engine(s): Pentaho Reporting and Spark MLib

Set up mechanisms for efficient data discovery and data governance: Apache Atlas

Implement Data Access Controls: Apache Ranger

Design a Monitoring Infrastructure for Jobs and Resources: Apache Ambari

Taking The Plunge

Do you need to plan for and build a Data Lake?

Ask yourself: what fraction of your data are you analyzing today ?

What value might the unused data offer ?

For marketing campaigns

For product lifecycle management

For regulatory compliance, and so on …

Engage your stakeholders from different LoBs

Is decision making being hampered by lack of data ?

Taking The Plunge

Start small: There is a learning curve

Storing data is not enough – maintaining the stewarding the data is all important

Design for extensibility and plugability

Minimize vendor lock-in

Be open to change as you scale your infrastructure

[email protected]

Data & Analytics

Datalake Architecture