Upload
pravin-kumar-singh-pmp-psm
View
126
Download
0
Embed Size (px)
Citation preview
Introducing
Data Lakes
Pravin Singh
Why?• Once upon a time, there was a Data Warehouse
– Data pre-categorized at the point of entry– Data well organized, but in silos– Common, predetermined data model for “optimal” analysis– Upfront DB modelling and ETL effort– A single-source-of-truth, but at the cost of flexibility– Complex system with low tolerance for human error, IT help required
for even the smallest enhancements– Not to forget, the high costs
• Then came the Big Bang, of Information!• Data Lake to the Rescue
What?
Source: PwC
Benefits
• Breaks the silos• Flexible Data Model (Schema on Read)• Data Provenance• No upfront modeling and data cleansing• Low cost of ownership• Focused on exploration, not on operations• Can work as staging area for ETL
Pitfalls and Challenges
• Data Lake as Data Graveyard• Metadata• Governance• Information Lifecycle Management (ILM)• Security and Privacy• Training
Lake Maturity
Source: PwC
Four Stages of Data Lake Adoption1: Life Before Hadoop
– Applications stand alone with their databases– Some applications contribute data to a data warehouse– Analysts run reporting and analytics in data warehouse
Four Stages of Data Lake Adoption2: Hadoop is Introduced
– Applications contribute data to Hadoop– Hadoop runs batch MapReduce jobs– Hadoop used for ETL into warehouse or analytic databases– Hadoop data reintroduced into applications
Four Stages of Data Lake Adoption3: Growing the Data Lake
– Newly built systems center around Hadoop by default– Applications use each other’s data via Hadoop– Hadoop becomes a default data destination; governance and metadata
become important– Data warehouse use becomes the exception, where legacy or special
requirements dictate
Four Stages of Data Lake Adoption4: Data Lake and Application Cloud
– New applications are built on a Hadoop application platform around the data lake
– Hadoop matures as an elastic distributed data computing platform– Data lake adds security and governance layers– Data availability increases, application deployment time decreases– Some apps still have special or legacy needs and execute independently
Questions?