13
Destroying Data Silos Hellmar Becker Senior IT Specialist Hadoop Summit 2015, Brussels

Destroying Data Silos

Embed Size (px)

Citation preview

Page 1: Destroying Data Silos

Destroying Data Silos

Hellmar Becker

Senior IT Specialist

Hadoop Summit 2015, Brussels

Page 2: Destroying Data Silos

Who am I?

2

Page 3: Destroying Data Silos

3

Datalake in ING NL

Integrate all data sources

within the bank into

one processing platform

• Batch data streams

• Live transactions

• Model building for customer interaction

Open source software where possible!

Page 4: Destroying Data Silos

Zoom in: Datalake Archive

4

Today, let’s focus on one specific part of the story:

• Collect data in a unified format

• Store these data secure from manipulation and

• unauthorized access

• Make data available to analytical applications

• Business Intelligence, Data Science

Hadoop based cluster is a good solution

to address these targets

Page 5: Destroying Data Silos

Circa 2000: Data Warehouse

• Based on relational database technology (Oracle, DB2, …)

• Challenge 1: Data model is difficult to adapt after the fact

• Challenge 2: Resilience and fault tolerance are not built in

• Challenge 3: Scaling proves difficult and expensive (specialized hardware)

• Challenge 4: RDBMS brings a lot of overhead – e. g. referential integrity

Modern data platforms (Hadoop, Spark, Cassandra) address many of these issues

Old world vs. New World

5

Operational

data

Staging

Files

ETL Operational

data

Data Mart

Data Mart

Data Mart

Metadata

Detail data Aggregated

data

Reporting

Analytics

Predictive

Modeling

Page 6: Destroying Data Silos

6

Target: Data Lake Architecture

Page 7: Destroying Data Silos

Pick your battles

• Toolset in the bank has grown around RDBMS and mainframe

• We cannot sweep out everything, have to handle legacy

• Plant a seed: Replace one component and connect it to all legacy interfaces

• Grow from there!

7

Operational

data

Staging

Files

ETL Operational

data

Data Mart

Data Mart

Data Mart

Metadata

Detail data Aggregated

data

Reporting

Analytics

Predictive

Modeling

Page 8: Destroying Data Silos

Challenges

• Zero Touch Deployment

• Risk issues with deployment tools that require admin (root) access to servers

• Policies within the organization

• Example: The unit of consideration is a single server, but we need to look at entire

clusters

• Legacy protocols – Mainframe data formats, e. g. character sets

• Security is paramount – protect sensitive data

8

Page 9: Destroying Data Silos

Security Concept

Authentication Management

• Using Kerberos – proven technology, secure but hard to configure

• Need to align access with HR database – connect to corporate directory

Authorization Management

• Uniform views across all components of a cluster

• Using Ranger to secure all services with a uniform set of policies

Auditing

• Ranger logs all interactions in order to exterminate threats

Connecting the Pieces

• Sideline challenge: Linux world and Windows world need to be connected

9

Page 10: Destroying Data Silos

Security Concept

10

Page 11: Destroying Data Silos

Agile Working

11

• Setup of this kind of project requires interdisciplinary

cooperation

• DevOps teams provide a lot of the required skills

with short communication paths

• Cooperation across department boundaries can be a

challenge

• Agile delivery vs. Expectations and timelines

• Manage external dependencies in a Scrum setting

Page 12: Destroying Data Silos

Shaping the Future

12

Existing standards do not always fit our goals and tools

Work with interdepartmental teams – DevOps, Infra,

DBAs, Business, Risk, Legal

We are influencing the standards that the bank will set

for coming systems!

Page 13: Destroying Data Silos

Attributions

• Hellmar in Nîmes / With Python in Mindanao, by the author

• Domtoren in het oranje licht by helena_is_here is licensed under CC BY 2.0

• Data Pipeline, ING OIB Image Bank

• Data Pipeline, ING OIB Image Bank, edited (cropped) by the author

• Baby Elephant with mother by David Rosen is licensed under CC BY 2.0

• Bruarfoss Waterfall in winter, Iceland by Diana Robinson is licensed under CC BY-

ND 2.0

• Elephants at Pinnawala by Jan Arendtsz is licensed under CC BY-NC 2.0

13