19
Practical Guide to Architecting Data Lakes Presented By Avinash Ramineni

Practical guide to architecting data lakes - Avinash Ramineni - Phoenix Data Conference 2016

Embed Size (px)

Citation preview

Page 1: Practical guide to architecting data lakes -  Avinash Ramineni - Phoenix Data Conference 2016

Practical Guide to Architecting Data

LakesPresented By Avinash Ramineni

Page 2: Practical guide to architecting data lakes -  Avinash Ramineni - Phoenix Data Conference 2016

Agenda• About Clairvoyant• What is Data Lake ?• Features of Data Lake • Tools • Implementation Challenges• Questions

Page 3: Practical guide to architecting data lakes -  Avinash Ramineni - Phoenix Data Conference 2016

3Page

Clairvoyant

Page 4: Practical guide to architecting data lakes -  Avinash Ramineni - Phoenix Data Conference 2016

4Page

Clairvoyant Services

Page 5: Practical guide to architecting data lakes -  Avinash Ramineni - Phoenix Data Conference 2016

5Page

What is a Data Lake“ A data lake is an enterprise-wide system for storing and analyzing disparate sources of data in their native formats”

“A data lake is a central location in which to store all your data, regardless of its source or format.”

“Is Data lake a replacement or complimentary to EDW ? ”

“Is Data lake just a storage layer ? ”

“ Just having a Hadoop environment is a data lake ? ”

Page 6: Practical guide to architecting data lakes -  Avinash Ramineni - Phoenix Data Conference 2016

6Page

Data Lake Attributes• Data Democratization

• Data Discovery

• Data Lineage

• Self-Service capabilities

• Metadata Management

Page 7: Practical guide to architecting data lakes -  Avinash Ramineni - Phoenix Data Conference 2016

7Page

Data Lake

Page 8: Practical guide to architecting data lakes -  Avinash Ramineni - Phoenix Data Conference 2016

8Page

Self Service Analytics

Page 9: Practical guide to architecting data lakes -  Avinash Ramineni - Phoenix Data Conference 2016

9Page

Data Governance• Data Acquisition - what, when, where of data• Data Organization – Structure, format• Data Catalog – what data exists in the lake• Capturing Metadata

• Data Lineage• Data Quality• Data Profile• Provenance of data at file and record levels• Business names, descriptions

• Data Provisioning

Page 10: Practical guide to architecting data lakes -  Avinash Ramineni - Phoenix Data Conference 2016

10Page

Page 11: Practical guide to architecting data lakes -  Avinash Ramineni - Phoenix Data Conference 2016

11Page

Data Lineage

Page 12: Practical guide to architecting data lakes -  Avinash Ramineni - Phoenix Data Conference 2016

12Page

Data Lake Challenges

Page 13: Practical guide to architecting data lakes -  Avinash Ramineni - Phoenix Data Conference 2016

13Page

Guidelines• Expect structured , semi-structure, unstructured data

• store a metadata or tag for location of schema, unstructured

• Store a copy of raw input

• Raw first mile copy of the data so that we can recover our business or almost

• Replay the business if we need to

• Data Standardization – data clensing as a workflow after ingest

• Use a format that supports your data

• Automate metadata management

Page 14: Practical guide to architecting data lakes -  Avinash Ramineni - Phoenix Data Conference 2016

14Page

Data Lake Security

Page 15: Practical guide to architecting data lakes -  Avinash Ramineni - Phoenix Data Conference 2016

15Page

Data Security

Page 16: Practical guide to architecting data lakes -  Avinash Ramineni - Phoenix Data Conference 2016

16Page

Implementation Challenges• Change Data Capture

• Mysql – binlog readers• Oracle - tungsten

• Updating the deltas on to the data lake• Reusable Data movement workflows

• One workflow for table ? (Generate Dynamic workflows based on metadata)• Needs to be driven of metadata

• Schema changes on the Source end• Streaming Data • Partitioning Strategies on the Data Lake

• Configure them into metadata

Page 17: Practical guide to architecting data lakes -  Avinash Ramineni - Phoenix Data Conference 2016

17Page

Tools / Products• Smart Catalogs

• Waterline Data Inventory• Collibra Catalog

• Data Lake Management• Zaloni Bedrock• Informatica Intelligent Data Lake

• Data Governance and Metadata Management• Cloudera Navigator• Apache Atlas• Collibra Data Governance• Oracle BigData Catalog

Page 18: Practical guide to architecting data lakes -  Avinash Ramineni - Phoenix Data Conference 2016

18Page

Data Lake Trends• Data Lakes on Cloud• IOT Data Lakes• Logical Data Lakes

• Unified View of data that exists across data stores

• Data Discovery Portals

Page 19: Practical guide to architecting data lakes -  Avinash Ramineni - Phoenix Data Conference 2016

19Page

Questions

• Principal @ Clairvoyant • Email: [email protected]• LinkedIn: https://www.linkedin.com/in/avinashramineni