13
AGENDA Introduction Data Lake discussion Data Governance & Prevention of Data Swamps Q & A ENSURING YOUR DATA LAKE DOESN’T BECOME A DATA SWAMP DAMA CHICAGO 2.17.2016

DAMA Chicago - Ensuring your data lake doesn’t become a data swamp

  • Upload
    nvisia

  • View
    793

  • Download
    1

Embed Size (px)

Citation preview

Page 1: DAMA Chicago - Ensuring your data lake doesn’t become a data swamp

AGENDA

• Introduction

• Data Lake discussion

• Data Governance & Prevention of Data Swamps

• Q & A

ENSURING YOUR DATA LAKE DOESN’T BECOME A DATA SWAMP

DAMA CHICAGO – 2.17.2016

Page 2: DAMA Chicago - Ensuring your data lake doesn’t become a data swamp

DATA LAKE DEFINITION

NVISIA® Confidential 20162

What is a “data lake”?Big data has been around long enough now that pretty much everybody in the field can rattle off a list of tools used in the Big Data world. For example: Hadoop, NoSQL, Hortonworks, Spark, Pig, Hive, Cassandra, Cloudera, Storm, HBASE, and Data Lake just to name a few. One of them that caught my eye recently that never came up in my research on Big Data was Data Swamp.

“A data lake is a storage repository that holds a vast amount of raw data in its native format until it is needed.” – techtarget

“Data Lake”: centrally managed repository using low cost technologies to land any and all data that might potentially be valuable for analysis and operationalizing that insight.”- O’Reilly

“The data lake dream is of a place with data-centered architecture, where silos are minimized, and processing happens with little friction in a scalable, distributed environment. Data itself is no longer restrained by initial schema decisions, and can be exploited more freely by the enterprise.” – Forbes

“A data lake, as opposed to a data warehouse, contains the mess of raw unstructured or multi-structured data that for the most part has unrecognized value for the firm. While traditional data warehouses will clean up and convert incoming data for specific analysis and applications, theraw data residing in lakes are still waiting for applications to discover ways to manufacture insights.” – Wall Street & Technology

“A data lake is a massive, easily accessible, centralized repository of large volumes of structured and unstructured data.” – Technopedia

And you cannot forget everyone’s go to for information…Wikipedia.

“A Data lake is a large storage repository that ‘holds data until it is needed’”

Page 3: DAMA Chicago - Ensuring your data lake doesn’t become a data swamp

DATA LAKE PROMISE

NVISIA® Confidential 20163

Data lake – PromiseThe promise of a data lake is a place that you can store data in its raw form, unencumbered by validation, mastering, or quality processes, so as to allow consumers to choose what data is of value to them with a quick time to market.

Page 4: DAMA Chicago - Ensuring your data lake doesn’t become a data swamp

DATA LAKE TYPICAL REALIZATION

NVISIA® Confidential 20164

Data Lake – Typical Realization aka Data SwampUnfortunately, the best laid plans can go awry, especially with encroaching delivery deadlines, ill-defined purpose for the data lake, lack of definition of desired analytics, ill-defined data sources…

“Without descriptive metadata and a mechanism to maintain it, the data lake risks turning into a data swamp. And without metadata, every subsequent use of data means analysts start from scratch.” (source: Garner “Beware the Data Lake Fallacy)

Page 5: DAMA Chicago - Ensuring your data lake doesn’t become a data swamp

DATA SWAMP CHARACTERISTICS AND MY DEFINITION

NVISIA® Confidential 20165

Data Swamp – Characteristics• Large volume of data

• Unrestrained data structures

• Lack of governance around the data (“until it is needed”)

Data Swamp – My Definition• Unstructured, ungoverned, and out of control data

lake

• …where data is hard to find, hard to use, and is consumed out of context

Page 6: DAMA Chicago - Ensuring your data lake doesn’t become a data swamp

DATA SWAMP PREVENTION

NVISIA® Confidential 20166

• Keep up the velocity of delivering data to your data lake to ensure usage can be evaluated by potential consumers – lest it appear in shadow IT instances

• Develop safe zones, where data can be guaranteed fit-for-use, complete with validation and mastering processes – in short “governed”

• Focus should be about giving consumers choices that are in their self-interest – encourage use of “trusted” data in safe zones, as opposed to “use at your own risk” data that will lead to decisions based on inconsistent, ill-defined, unmanaged data

Techniques to prevent your Data Lake from becoming “Swamp-ish”

Page 7: DAMA Chicago - Ensuring your data lake doesn’t become a data swamp

DATA SWAMP CLEANING TECHNIQUES

NVISIA® Confidential 20167

Techniques to clean your Data Swamp• Work with your consumers and integration teams early in their Data Lake integration initiatives (using

sprint-ahead approach)

• Introduce data governance processes that address their consumption scenarios

• Collaborate early and often with data scientists and analysts to operationalize new consumption ideas

• Evangelize safe zones where “trusted” data lives – partner with business consumers early and often

Finance safe zone

Sales safe zone

Quality

Mastering

Validation

Quality

Mastering

Validation

Page 8: DAMA Chicago - Ensuring your data lake doesn’t become a data swamp

DATA SWAMP SAFE ZONES

NVISIA® Confidential 20168

Data Swamp “safe zones”• Subject area / consumer focused locations where data can be guaranteed fit-for-use –

“trusted”.

• Data governance processes (including validation, mastering, and quality) are applied to give context and consistency to data, converting it to trust-worthy information

• To maintain time-to-market and relevancy to changing business objectives, these processes should be applied using an agile, sprint-ahead approach

• Early participation with business consumers is key to minimizing the impact to delivery velocity

Finance safe zone

Sales safe zone

Page 9: DAMA Chicago - Ensuring your data lake doesn’t become a data swamp

DATA SWAMP CLEANING PROCESSES

NVISIA® Confidential 20169

Cleaning your Data Swamp (in a hurry)The key to ensuring you will actually get to provide “trusted” data is to delivery timely , relevant solutions, without significantly slowing the time to market

• Establish expectations of “trusted data” for stakeholders

• Gather information on how data is currently managed

• Align with stakeholders on the value and implementation approach for pragmatic Data Governance

• Architect a pragmatic solution that produces “trusted” data, without significantly affecting delivery velocity

• Validate that changes to people, processes and artifacts align with stakeholder goals

• Reach consensus on Data Governance implementation strategy and approach

… and do so in a way that’s palatable to your organization

… within a timely fashion (to ensure relevancy to business stakeholders)

Quality

Mastering

Validation

Page 10: DAMA Chicago - Ensuring your data lake doesn’t become a data swamp

DATA GOVERNANCE IN A HURRY (SHAMELESS PLUG)

NVISIA® Confidential 201610

Cleaning your Data Swamp (in a hurry)The key to ensuring you will actually get to provide “trusted” data is to delivery timely , relevant solutions, without significantly slowing the time to market

• Establish expectations of “trusted data” for stakeholders

• Gather information on how data is currently managed

• Align with stakeholders on the value and implementation approach for pragmatic Data Governance

• Architect a pragmatic solution that produces “trusted” data, without significantly affecting delivery velocity

• Validate that changes to people, processes and artifacts align with stakeholder goals

• Reach consensus on Data Governance implementation strategy and approach

… and do so in a way that’s palatable to your organization

… within a timely fashion (to ensure relevancy to business stakeholders)

Quality

Mastering

Validation

Page 11: DAMA Chicago - Ensuring your data lake doesn’t become a data swamp

DATA GOVERNANCE IN A HURRY

NVISIA® Confidential 201611

Page 12: DAMA Chicago - Ensuring your data lake doesn’t become a data swamp

DATA SWAMP ALTERNATIVES TO CLEANING

NVISIA® Confidential 201612

Ungoverned data encourages people to interpret their data out of context

Page 13: DAMA Chicago - Ensuring your data lake doesn’t become a data swamp

QUESTIONS?

THANKS FOR YOUR TIME

Michael Vogt

Managing Director, Data Management

NVISIA

[email protected]

NVISIA® Confidential 201613