The Data Lake: Managing Big Data Variety & Velocitykirkborne.net/BDE2016/KirkBorne-BDE2016.pdf · • Apache Drill can quickly serve up a fast snapshot of a statistics set to initiate

The Data Lake:

Managing Big Data Variety & Velocity (presented by Kirk Borne, Principal Data Scientist)

D I S C O V E R Y A T T H E S P E E D O F B U S I N E S S

Booz | Allen | Hamilton

T H E D A T A L A K E B O O Z A L L E N H A M I L T O N

• “Data Lakes are Marketing B.S.”, says 2014 Turing Award winner Michael Stonebraker* – He does believe the concept is real and absolutely necessary (due to

outrageous growth in unstructured data), but the marketing is hyped! – *Source: http://www.storagereview.com/key_takeaways_from_hp_big_data_2015

• Therefore, we present here the scientific rationale for the Data Lake!


Why a Data Lake?

2

http://www.storagereview.com/key_takeaways_from_hp_big_data_2015

http://www.storagereview.com/key_takeaways_from_hp_big_data_2015

Data Lake – The Business Value Proposition Information management systems must increasingly cope with very large data volumes, from many heterogeneous sources, most of which are streaming into a variety of organizational decision systems, thus requiring rapid analytics by diverse teams of analysts.

The Data Lake provides a framework for data analytics including data storage for quick end-user-friendly data exploration and data exploitation:

1. User-friendly data analytics Aggregation of multiple disparate data sets into a single “system”,

capable of responding to rapid on-the-fly queries based on evolving user-defined criteria

2. Data storage Real-time integration of large and disparate data sets


3

• The 3 V’s of Big Data are not just hype – they represent really big challenges:

1. Volume

2. Velocity

3. Variety

• But… Volume is not the problem! Storage is manageable!

• Analytics (integration and combining disparate data sources for discovery and data science) is hard…

• … especially on complex (diverse, high-Variety) and fast-moving (real-time, high-Velocity) data!

• So, collect all of your data in a large storage cluster (HDFS) and focus on making the hard stuff easier.


Why a Data Lake?

Source for graphic: http://www.vitria.com/blog/Big-Data-Analytics-Challenges-Facing-All-Communications-Service-Providers/ 4

http://www.vitria.com/blog/Big-Data-Analytics-Challenges-Facing-All-Communications-Service-Providers/



















• The Data Lake smashes silos, stores all of your data in their raw form (CSV, Excel, XML, JSON, NoSQL,…) in one big repository (e.g., the HDFS Hadoop File System).

• Any data can be added at any time, and then becomes immediate available, because there is no data modeling or schema design or structure imposed…

• Data storage is schema-less (not exactly! … “Schema On Read”, whereas the DBMS and Enterprise Data Warehouse are “Schema On Write”).

• Typical Data Warehouse ETL (Extract, Transform, Load) is handcrafted, fragile, and expensive, time-consuming & painful to maintain!

• The Data Lake supports agile, near real-time data query, processing, analytics.

• The Data Lake brings together all of the disparate data sources into one data hub for multiple organizational units and programs, with multi-tenancy and security:

– Multi-tenancy (single instance of the application serves multiple groups) helps segregate data, users, and groups for each program & application.

– Security is provided at the cell (data object) level (e.g., Accumulo in Hadoop).

• The Analytics tasks are unrestricted, agile, transformative, game-changing.

• https://www.mapr.com/definitive-guide-data-lake

Why a Data Lake?

https://www.mapr.com/definitive-guide-data-lake










• Three E-characteristics of the Internet of Things (IoT): – Everywhere a sensor

– Everything quantified and tracked (temporally and spatially)

– Enormous variety of data types, sources, and applications

• The Data Analytics Triage on IoT data streams: – Discovery: Offline analytics, classic Hadoop + Spark (or MapReduce)

– Decision: Near-real-time, interactive queries (Drill) and fast analytics (Spark)

– Action: Embedded analytics, at the point of data collection (encoded IFTTT)

• The Data Lake:

– The launchpad for IoT inquiry, advanced analytics, and business insights

– Enables Discovery and Decision across all data collections, effectively and efficiently

– Becomes the hub and repository of all business rules (IFTTT) for Actions needed, Actions taken, and results of Actions (IF-This-Then-That)

IoT and Data Lake – A Perfect Union

6


• General example of Data Analytics Triage: IoT Event Mining in the Data Lake for Actionable Intelligence:

Behavior modeling (anomaly & trend detection) and ad hoc inquiry for Discovery

Identifying, characterizing, & responding to events for data-driven Decisions

Deciding which events need immediate investigation and/or intervention = Action

• Many other examples: Predictive Maintenance alerts (from machine / engine sensors)

Infrastructure Monitoring alerts (from ubiquitous sensors)

Supply chain monitoring (from manufacturing & shipping sensors)

Web user engagement & recommendations (from web analytics data)

Cybersecurity alerts (from network logs)

Preventive Fraud alerts (from financial applications)

Health alerts (from EHRs and health systems)

Tsunami alerts (from geo sensors everywhere)

Social event alerts or early warnings (from social media)

IoT and Data Lake – sample use cases

7

Drilling Across Data Silos: SQL-on-Hadoop • Apache Drill is perfectly suited for low-latency, fast-turnaround,

performance-demanding, high-volume, hypothesis-driven tasks such as data discovery, exploration, ad hoc BI queries, and especially “day zero” analytics across all data sources, particularly large unstructured data collections.

• Apache Drill supports interactive queries rather than batch-oriented requests.

• Apache Drill can quickly serve up a fast snapshot of a statistics set to initiate an extended, explorative analysis of a Data Lake, perhaps with Apache Spark (e.g., MLlib, GraphX, or Spark Streaming libraries)

• “No more coffee runs while waiting for your query to run!”

• Apache Drill (like every other Apache product) is open source.

• https://drill.apache.org/ + https://www.mapr.com/products/apache-drill


8

https://drill.apache.org/

https://drill.apache.org/

https://www.mapr.com/products/apache-drill






Agile Data Objects in the Data Lake – 1

Migrate from this:

…to this:

Source: https://www.mssqltips.com/sqlservertip/3038/compare-big-data-platforms-vs-sql-server/ 9

https://www.mssqltips.com/sqlservertip/3038/compare-big-data-platforms-vs-sql-server/
















DATA VALUE

DATA VALUE

10



DATA VALUE

11


What a Data Lake is not… … your Enterprise Data Warehouse!

Source: http://www.smartdatacollective.com/tamaradull/317681/big-data-cheat-sheet-what-executives-want-know 12

http://www.smartdatacollective.com/tamaradull/317681/big-data-cheat-sheet-what-executives-want-know

















What a Data Lake is not… … your Enterprise Data Warehouse!

Source: http://www.smartdatacollective.com/tamaradull/317681/big-data-cheat-sheet-what-executives-want-know 13


















More data means less uncertainty, and more laser-focused intelligence!

5 data points 10 data points 50 data points

100 data points 1000 data points 10000 data points

Source for graphics: https://rexplorations.wordpress.com/2015/09/05/animated-mean-and-sample-size/ 14

https://rexplorations.wordpress.com/2015/09/05/animated-mean-and-sample-size/










Data Source #1: Satellite (LANDSAT)

Data Source #4:

Models

Data Source #2:

Aerial photos

Data Source #3:

in situ sensors

Information Extracted:

Regional events (drought)


Local events (land use)


Situational data

(development activities)


Predictions & Forecasts

(e.g., changes in climate,

forestation, agriculture,…)

KDD

tools

Understanding:

Develop new knowledge

on causal connections

and interdependencies

between geospatial

events at the

Human-Earth system

interface

• Association & Link Analysis

• Correlation Discovery

• Anomaly/Novelty Discovery

• Clustering Analysis

• Principal Components

• Neural Networks / Deep Learning

• Support Vector Machines

• Bayesian Networks

• Markov Models

• Decision Trees (Random Forests)

Supervised

methods

Unsupervised

methods

Example: KDD (Knowledge Discovery from Data) in the Data Lake Environment

Use case: Early Warning and Monitoring Systems for Geospatial Event Discovery


15

ASK your data = Applications, Services, Knowledge Delivery from your Data Lake

Caption: The flow from data to information to knowledge accrues enhanced value and utility at each stage of the ASK pipeline in the Data Lake.

Data Collections: Landsat, EO-1, MODIS, AVHRR, ASTER, SRTM, NEON, NLIP, Aerial, Satellite, Maps, LIDAR, Sensor Networks, Multi-mission, Elevation Datasets & Models,…

Applications: Change-monitoring (Land Use, Drought, Environmental), Carbon Tracking, Vegetation Stress, Emergency Response to Natural Hazard Events, Human Needs, Education, Planning & Development…

Knowledge: Essential Climate Variables Land Science Agriculture Forecasting Climate Change & Variability Environmental Science Famine, Forestation, Energy

Data Services: Visual Analytics Predictive & Prescriptive Analytics Ad hoc data mining & analysis Knowledge Discovery from Data Queryable information products Enhanced imagery products Data product recommendations

INFORMATION PRODUCTS

Algorithms: Machine Learning Data Mining Visual Analytics Machine Vision Data Characterization

Standards & Frameworks: BPEL, PMML, SIEM, VOevent, DDDAS

Scientific Expertise: User-generated content User-annotations data User groups Academic researchers Professional societies

New Technologies: Data Lakes Cloud / PaaS Linked Data (RDF) Data-as-a-Service SQL-on-Hadoop


16


Data Lake in Healthcare – Success Story

• Their goals were achieved:

– to have a 360-degree view of the patient in near real-time so they can consistently offer high levels of care and service, and

– to detect erroneous or fraudulent claims before payment.

• Their technical challenges for these two business needs were (before the data lake solution):

– They had siloed data sources and no real-time access to the data for their multiple business units.

– This was made even more difficult with the organization’s constant growth.

– Every time there was an acquisition, it added another data source and more complexity.

Source for graphic: http://www.datanami.com/2015/08/26/medical-insight-set-to-flow-from-semantic-data-lakes/ 17

http://www.datanami.com/2015/08/26/medical-insight-set-to-flow-from-semantic-data-lakes/


















Summary – The Benefits of the Data Lake • Relevance: the biggest challenge of large data sets is not the Volume,

but the data’s Velocity and Variety! Storage is affordable but fast complex data are hard to integrate & make accessible to analysts.

• Agility: add new complex diverse data sources on-demand; perform ad hoc queries at the speed of new questions; test new hypotheses independent of data model (e.g., no more 16-way joins!)

• Disruptive (game-changer): ingest, integrate, and gather insights from new data sources on Day Zero! (no more waiting for data model changes, schema re-engineering, and re-building the DB indices)

• Performance: Dramatically lower cost of data storage, access, and discovery, with accompanying significant performance increases in taking data-to-knowledge/intelligence (from weeks to minutes!)

• Multi-domain: Revolutionize capabilities to enhance insights and communications across multi-domain business operations.

• Maturing: strategically placed right now on rapid growth and technology development curve (it’s brilliant future is still to come!).


18

Documents

The Data Lake: Managing Big Data Variety & Velocitykirkborne.net/BDE2016/KirkBorne-BDE2016.pdf · • Apache Drill can quickly serve up a fast snapshot of a statistics set to initiate