104
Introduction to Big Data William El Kaim Oct. 2016 V3.0

Introduction to big data

Embed Size (px)

Citation preview

Page 1: Introduction to big data

Introduction to Big Data

William El Kaim

Oct. 2016 – V3.0

Page 2: Introduction to big data

This Presentation is part of the

Enterprise Architecture Digital Codex

http://www.eacodex.com/Copyright © William El Kaim 2016 2

Page 3: Introduction to big data

Plan

• Taming The Data Deluge

• What is Big Data?

• What is Hadoop?

• When to use Hadoop?

• How to Implement Hadoop?

• What is a Data Lake?

• What is BI on Hadoop?

• What is Data Science?

• Hadoop Processing Paradigms

• Hadoop Tools Landscape

Copyright © William El Kaim 2016 3

Page 4: Introduction to big data

Taming the Data Deluge

Copyright © William El Kaim 2016 4Source: Domo

Page 5: Introduction to big data

Copyright © William El Kaim 2016 5

Page 6: Introduction to big data

Taming the Data Deluge

Copyright © William El Kaim 2016 6

Page 7: Introduction to big data

Taming the Data Deluge

Copyright © William El Kaim 2016 7

Page 8: Introduction to big data

At The Same Time …

Copyright © William El Kaim 2016 8

Page 10: Introduction to big data

Plan

• Taming The Data Deluge

• What is Big Data?

• What is Hadoop?

• When to use Hadoop?

• How to Implement Hadoop?

• What is a Data Lake?

• What is BI on Hadoop?

• What is Data Science?

• Hadoop Processing Paradigms

• Hadoop Tools Landscape

Copyright © William El Kaim 2016 10

Page 11: Introduction to big data

What is Big Data?

• A collection of data sets so large and complex that it becomes difficult to

process using on-hand database management tools or traditional data

processing applications

• Due to its technical nature, the same challenges arise in Analytics at much lower

volumes than what is traditionally considered Big Data.

• Big Data Analytics is:

• The same as ‘Small Data’ Analytics, only with the added challenges (and potential) of

large datasets (~50M records or 50GB size, or more)

• Challenges :

• Data storage and management

• De-centralized/multi-server architectures

• Performance bottlenecks, poor responsiveness

• Increasing hardware requirements

Copyright © William El Kaim 2015 11Source: SiSense

Page 12: Introduction to big data

12Copyright © William El Kaim 2015

Page 13: Introduction to big data

What is Big Data: the “Vs” to Nirvana

Copyright © William El Kaim 2016 13

Visualization

Source: James Higginbotham

Big Data: A collection of data sets so large and complex that it becomes difficult to process

using on-hand database management tools or traditional data processing applications

Big Data: When the data could not fit in Excel. Used to be 65,536 lines, Now 1,048,577.

Big Data: When it's cheaper to keep everything than spend the effort to decide what to throw

away (David Brower @dbrower)

Page 14: Introduction to big data

Six V to Nirvana

Copyright © William El Kaim 2016 14Source: Bernard Marr

Page 15: Introduction to big data

Six V to Nirvana

Copyright © William El Kaim 2016 15Source: IBM

Page 16: Introduction to big data

Six V to Nirvana

Copyright © William El Kaim 2016 16Source: IBM

Page 17: Introduction to big data

Six V to Nirvana

Copyright © William El Kaim 2016 17Source: IBM

Page 18: Introduction to big data

Six V to Nirvana

Copyright © William El Kaim 2016 18Source: IBM

Page 19: Introduction to big data

Six V to Nirvana

Copyright © William El Kaim 2016 19Source: Bernard Marr

Visualization

Page 20: Introduction to big data

Plan

• Taming The Data Deluge

• What is Big Data?

• What is Hadoop?

• When to use Hadoop?

• How to Implement Hadoop?

• What is a Data Lake?

• What is BI on Hadoop?

• What is Data Science?

• Hadoop Processing Paradigms

• Hadoop Tools Landscape

Copyright © William El Kaim 2016 20

Page 21: Introduction to big data

Hadoop Genesis

• Scalability issue when running jobs processing terabytes of data

• Could span dozen days just to read that amount of data on 1 computer

• Need lots of cheap computers

• To Fix speed problem

• But lead to reliability problems

• In large clusters, computers fail every day

• Cluster size is not fixed

• Need common infrastructure

• Must be efficient and reliable

Copyright © William El Kaim 2016 21

Page 22: Introduction to big data

Hadoop Genesis

Copyright © William El Kaim 2016 22

Apache Nutch

Doug Cutting

“Map-reduce”2004

“It is an important technique!”

Extended

Source: Xiaoxiao Shi

Page 23: Introduction to big data

Hadoop Genesis

Copyright © William El Kaim 2016 23

“Hadoop came from the name my kid gave a stuffed yellow elephant. Short, relatively easy to spell and pronounce, meaningless, and not used elsewhere: those are my naming criteria.”Doug Cutting, Hadoop project creator

• Open Source Apache Project

• Written in Java

• Running on Commodity hardware and all major OS

• Linux, Mac OS/X, Windows, and Solaris

Page 24: Introduction to big data

Enters Hadoop the Big Data Refinery

• Hadoop is not replacing anything.

• Hadoop has become another component in an organizations enterprise data platform.

• Hadoop (Big Data Refinery) can ingest data from all types of different sources.

• Hadoop then interacts and has data flows with traditional systems that provide transactions and interactions (relational databases) and business intelligence and analytic systems (data warehouses).

Source: DBA Journey Blog 24Copyright © William El Kaim 2015

Page 25: Introduction to big data

Hadoop Platform

Copyright © William El Kaim 2016 25Source: Octo Technology

Page 26: Introduction to big data

Plan

• Taming The Data Deluge

• What is Big Data?

• What is Hadoop?

• When to use Hadoop?

• How to Implement Hadoop?

• What is a Data Lake?

• What is BI on Hadoop?

• What is Data Science?

• Hadoop Processing Paradigms

• Hadoop Tools Landscape

Copyright © William El Kaim 2016 26

Page 27: Introduction to big data

When to Use Hadoop?Two Main Use Cases

2

1

Data Mgt. And Storage

Big Data Analytics and Use

Copyright © William El Kaim 2016 27

Page 28: Introduction to big data

When to Use Hadoop?

Copyright © William El Kaim 2016 28Source: Octo Technology

1

2

Data Mgt and Storage

Data Analytics and Use

1

2

Page 29: Introduction to big data

Hadoop for Data Mgt and StorageETL Pre-processor

Copyright © William El Kaim 2016 29

ETL Pre-processor

• Shift the pre-processing of ETL in staging data warehouse to Hadoop

• Shifts high cost data warehousing to lower cost Hadoop clusters

Source: Microsoft

1

Page 30: Introduction to big data

Hadoop for Data Mgt and StorageMassive Storage

Copyright © William El Kaim 2016 30

Massive Storage

• Offloading large volume of historical data into cold storage with Hadoop

• Keep data warehouse for hot data to allow BI and analytics

• When data from cold storage is needed, it can be moved back into the warehouse

Source: Microsoft

1

Page 31: Introduction to big data

Hadoop for Data Analytics and UseData Discovery

Copyright © William El Kaim 2016 31

Data Discovery

• Keep data warehouse for operational BI and analytics

• Allow data scientists to gain new discoveries on raw data (no format or structure)

• Operationalize discoveries back into the warehouse

Source: Microsoft

2

Page 32: Introduction to big data

Hadoop for Data Analytics and UseFrom Hindsight to Insight to Foresight

• Traditional analytics tools are not well suited to capturing the full value of big data.• The volume of data is too large for

comprehensive analysis.

• The range of potential correlations and relationships between disparate data sources are too great for any analyst to test all hypotheses and derive all the value buried in the data.

• Basic analytical methods used in business intelligence and enterprise reporting tools reduce to reporting sums, counts, simple averages and running SQL queries.

• Online analytical processing is merely a systematized extension of these basic analytics that still rely on a human to direct activities specify what should be calculated.

Copyright © William El Kaim 2016 32

2

Page 33: Introduction to big data

Plan

• Taming The Data Deluge

• What is Big Data?

• What is Hadoop?

• When to use Hadoop?

• How to Implement Hadoop?

• What is a Data Lake?

• What is BI on Hadoop?

• What is Data Science?

• Hadoop Processing Paradigms

• Hadoop Tools Landscape

Copyright © William El Kaim 2016 33

Page 34: Introduction to big data

How to Implement Hadoop?The Rise of Big Data Fabric

• Definition:

• Bringing together disparate big data sources automatically, intelligently, and securely,

and processing them in a big data platform technology, such as Hadoop and Apache

Spark, to deliver a unified, trusted, and comprehensive view of customer and business

data.

• Big data fabric focuses on automating the process of ingestion, curation, and

integrating big data sources to deliver intelligent insights that are critical for

businesses to succeed.

• The platform minimizes complexity by automating processes, generating big data

technology and platform code automatically, and integrating workflows to simplify the

deployment.

• Big data fabric is not just about Hadoop or Spark

• It comprises several components, all of which must work in tandem to deliver a flexible,

integrated, secure, and scalable platform.

Copyright © William El Kaim 2016 34Source: Forrester

Page 35: Introduction to big data

How to Implement Hadoop?Make Big Data Fabric Part Of Your Big Data Strategy!

• Enterprise architects whose companies are pursuing a big data strategy can benefit from a big data fabric implementation that automates, secures, integrates, and curates big data sources intelligently.• Most enterprises that have a big data fabric platform are building it themselves by

integrating various core open source technologies, however it requires significant time and effort.

• Your big data fabric strategy should:• Integrate only a few big data sources at first.

• Start top-down rather than bottom-up, keeping the end in mind.

• Separate analytics from data management. Analytics tools should focus primarily on data visualization and advanced statistical/data mining algorithms with limited dependence on data management functions. Decoupling data management from data analytics reduces the time and effort needed to deliver trusted analytics.

• Create a team of experts to ensure success.

• Use automation and machine learning to accelerate deployment.

Copyright © William El Kaim 2016 35Source: Forrester

Page 36: Introduction to big data

How to Implement Hadoop?Typical Project Development Steps

Copyright © William El Kaim 2016 36

Dashboards,

Reports,

Visualization, …

CRM, ERP

Web, Mobile

Point of saleBig Data

Platform

Business

Transactions

& Interactions

Business

Intelligence

& Analytics

UnstructuredData

Log files

DB data

Exhaust Data

Social Media

Sensors, devices

Classic Data

Integration & ETL

Capture Big DataCollect data from all sources

structured &unstructured

ProcessTransform, refine,

aggregate, analyze, report

Distribute ResultsInteroperate and share data

with applications/analytics

FeedbackUse operational data w/in

the big data platform1 2 3 4

Source: HortonWorks

Page 37: Introduction to big data

How to Implement Hadoop?Typical Three Steps Process

Copyright © William El Kaim 2016 37Source: HortonWorks

Collect Explore Enrich

Page 38: Introduction to big data

How to Implement Hadoop?Typical Project Development Steps

Copyright © William El Kaim 2016 38

1. Data Source

Layer

3. Data Processing /

Analysis Layer

2. Data Storage

Layer

4. Data Output

Layer

LakeshoreRaw Data Data Lake

Bus. or Usage

oriented

extractions

Data Science

Correlations,

Analytics,

Machine

Learning

Data

Visualization

& Bus.

intelligence

Internal Data

External Data

1 2 3 3 4

Page 39: Introduction to big data

How to Implement Hadoop?CRISP Methodology

Copyright © William El Kaim 2016 39Source: The Modeling Agency and sv-europe

Page 40: Introduction to big data

Plan

• Taming The Data Deluge

• What is Big Data?

• What is Hadoop?

• When to use Hadoop?

• How to Implement Hadoop?

• What is a Data Lake?

• What is BI on Hadoop?

• What is Data Science?

• Hadoop Processing Paradigms

• Hadoop Tools Landscape

Copyright © William El Kaim 2016 40

Page 41: Introduction to big data

What is a Data Lake?

• A data lake is:

• Typically built using Hadoop.

• Supports structured or

unstructured data.

• Benefiting from a variety of

storage and processing tools to

extract value quickly.

• Requiring little or no processing

for adapting the structure to an

enterprise schema

• A central location in which to

store all your data in its native

form, regardless of its source or

format.

Copyright © William El Kaim 2016 41Source: Zaloni

Page 42: Introduction to big data

Who is Using Data Lakes?

Copyright © William El Kaim 2016 42Source: PWC

Page 43: Introduction to big data

Data Flow In The Data Lake

Copyright © William El Kaim 2016 43Source: PWC

Page 44: Introduction to big data

Schema On Read or Schema On Write?

• The Hadoop data lake concept can be summed up as, “Store it all in one

place, figure out what to do with it later.”

• But while this might be the general idea of your Hadoop data lake, you won’t

get any real value out of that data until you figure out a logical structure for it.

• And you’d better keep track of your metadata one way or another. It does no

good to have a lake full of data, if you have no idea what lies under the shiny

surface.

• At some point, you have to give that data a schema, especially if you want to

query it with SQL or something like it. The eternal Hadoop question is

whether to apply the brave new strategy of schema on read, or to stick with

the tried and true method of schema on write.

Copyright © William El Kaim 2016 44Source: AdaptiveSystems

Page 45: Introduction to big data

Schema On Write …

• Before any data is written in the database, the structure of that data is strictly

defined, and that metadata stored and tracked.

• Irrelevant data is discarded, data types, lengths and positions are all

delineated.

• The schema; the columns, rows, tables and relationships are all defined first for the

specific purpose that database will serve.

• Then the data is filled into its pre-defined positions. The data must all be cleansed,

transformed and made to fit in that structure before it can be stored in a process

generally referred to as ETL (Extract Transform Load).

• That is why it is called “schema on write” because the data structure is

already defined when the data is written and stored.

• For a very long time, it was believed that this was the only right way to

manage data.

Copyright © William El Kaim 2016 45Source: AdaptiveSystems

Page 46: Introduction to big data

Schema On Read …

• Revolutionary concept: “You don’t have to know what you’re going to do with

your data before you store it.”

• Data of many types, sizes, shapes and structures can all be thrown into the Hadoop

Distributed File System, and other Hadoop data storage systems.

• While some metadata needs to be stored, to know what’s in there, no need yet to know

how it will be structured!

• Therefore, the data is stored in its original granular form, with nothing thrown

away

• In fact, no structural information is defined at all when the data is stored.

• So “schema on read” implies that the schema is defined at the time the data

is read and used, not at the time that it is written and stored.

• When someone is ready to use that data, then, at that time, they define what pieces are

essential to their purpose, where to find those pieces of information that matter for that

purpose, and which pieces of the data set to ignore.

Copyright © William El Kaim 2016 46Source: AdaptiveSystems

Page 47: Introduction to big data

Schema On Read or Schema On Write?

• Schema on read options tend to be a better choice:

• for exploration, for “unknown unknowns,” when you don’t know what kind of questions

you might want to ask, or the kinds of questions might change over time.

• when you don’t have a strong need for immediate responses. They’re ideal for data

exploration projects, and looking for new insights with no specific goal in mind.

• Schema on write options tend to be very efficient for “known unknowns.”

• When you know what questions you’re going to need to ask, especially if you will need

the answers fast, schema on write is the only sensible way to go.

• This strategy works best for old school BI types of scenarios on new school big data

sets.

Copyright © William El Kaim 2016 47Source: AdaptiveSystems

Page 49: Introduction to big data

Data Lake Tools: Platfora

Copyright © William El Kaim 2016 49Source: Platfora

Page 50: Introduction to big data

Data Lake Tool: Waterline

Copyright © William El Kaim 2016 50Source: Waterline

Page 51: Introduction to big data

Plan

• Taming The Data Deluge

• What is Big Data?

• What is Hadoop?

• When to use Hadoop?

• How to Implement Hadoop?

• What is a Data Lake?

• What is BI on Hadoop?

• What is Data Science?

• Hadoop Processing Paradigms

• Hadoop Tools Landscape

Copyright © William El Kaim 2016 51

Page 52: Introduction to big data

What is BI on Hadoop?Three Options

Source: DremioCopyright © William El Kaim 2016 52

Page 53: Introduction to big data

What is BI on Hadoop?Three Options: ETL to Data Warehouse

• Pros

• Relational databases and their BI integrations are very mature

• Use your favorite tools

• Tableau, Excel, R, …

• Cons

• Traditional ETL tools don’t work well with modern data

• Changing schemas, complex or semi-structured data, …

• Hand-coded scripts are a common substitute

• Data freshness

• How often do you replicate/synchronize?

• Data resolution

• Can’t store all the raw data in the RDBMS (due to scalability and/or cost)

• Need to sample, aggregate or time-constrain the data

Source: DremioCopyright © William El Kaim 2016 53

Page 54: Introduction to big data

What is BI on Hadoop?Three Options: Monolithic Tools

• Indexima

• Jethro

• Looker

• Arcadia

• Atscale

• Datameer

• Platfora

• Tamr

• ZoomData

Source: Modified from Platfora

• Single piece of software on top of

Big Data

• Performs both data

visualization (BI) and execution

• Utilize sampling or manual pre-

aggregation to reduce the data

volume that the user is

interacting with

Copyright © William El Kaim 2016 54

Page 55: Introduction to big data

What is BI on Hadoop?Three Options: Monolithic Tools

• Pros

• Only one tool to learn and operate

• Easier than building and maintain ETL-to-RDBMS pipeline

• Integrated data preparation in some solutions

• Cons

• Can’t analyze the raw data

• Rely on aggregation or sampling before primary analysis

• Can’t use your existing BI or analytics tools (Tableau, Qlik, R, …)

• Can’t run arbitrary SQL queries

Source: DremioCopyright © William El Kaim 2016 55

Page 56: Introduction to big data

What is BI on Hadoop?Three Options: SQL-on-Hadoop

• The combination of a familiar interface (SQL) along with a modern computing

architecture (Hadoop) enables people to manipulate and query data in new

and powerful ways.

• There’s no shortage of SQL on Hadoop offerings, and each Hadoop

distributor seems to have its preferred flavor.

• Not all SQL-on-Hadoop tools are equal, so picking the right tool is a challenge.

Source: Datanami & Cloudera & DremioCopyright © William El Kaim 2016 56

Page 57: Introduction to big data

What is BI on Hadoop?Three Options: SQL-on-Hadoop

• SQL on Hadoop tools could be categorized as

• Interactive or Native SQL

• Batch & Data-Science SQL

• OLAP Cubes (In-memory) on Hadoop

Copyright © William El Kaim 2016 57

Page 58: Introduction to big data

What is BI on Hadoop?SQL-on-Hadoop: Native SQL

• When to use it?

• Excel at executing ad-hoc SQL queries and performing self-service data exploration

often used directly by data analysts or at executing the machine-generated SQL code

from BI tools like Qlik and Tableau.

• Latency is usually measured in seconds to minutes.

• One of the key differentiator among the interactive SQL-on-Hadoop tools is how they

were built.

• Some of the tools, such as Impala and Drill, were developed from the beginning to run on

Hadoop clusters, while others are essentially ports of existing SQL engines that previously ran

on vendors’ massively parallel processing (MPP) databases

Source: DatanamiCopyright © William El Kaim 2016 58

Page 59: Introduction to big data

What is BI on Hadoop?SQL-on-Hadoop: Native SQL

• Pros

• Highest performance for Big Data

workloads

• Connect to Hadoop and also NoSQL

systems

• Make Hadoop “look like a database”

• Cons

• Queries may still be too slow for

interactive analysis on many TB/PB

• Can’t defeat physics

Source: Datanami & Dremio

• Interactive• In 2012, Cloudera rolled out the first

release of Apache Impala

• MapR has been pushing the schema-less bounds of SQL querying with Apache Drill, which is based on Google‘s Dremel.

• Presto (created by Facebook, now backed by Teradata)

• VectorH (backed by Actian)

• Apache Hawq (backed by Pivotal)

• Apache Phoenix.

• BigSQL (backed by IBM)

• Big Data SQL (backed by Oracle)

• Vertica SQL on Hadoop (backed by Hewlett-Packard).

Copyright © William El Kaim 2016 59

Page 60: Introduction to big data

What is BI on Hadoop?SQL-on-Hadoop: Batch & Data Science SQL

• When to use it?• Most often used for running big and complex jobs, including ETL and production data

“pipelines,” against massive data sets.

• Apache Hive is the best example of this tool category. The software essentially recreates a relational-style database atop HDFS, and then uses MapReduce (or more recently, Apache Tez) as an intermediate processing layer.

• Tools• Apache Hive, Apache Tez, Apache Spark SQL

• Pros• Potentially simpler deployment (no daemons)

• New YARN job (MapReduce/Spark) for each query

• Check-pointing support enables very long-running queries• Days to weeks (ETL work)

• Works well in tandem with machine learning (Spark)

• Cons• Latency prohibitive for for interactive analytics

• Tableau, Qlik Sense, …

• Slower than native SQL engines

Source: Datanami & DremioCopyright © William El Kaim 2016 60

Page 61: Introduction to big data

What is BI on Hadoop?SQL-on-Hadoop: OLAP Cubes (In-memory) on Hadoop

• When to use it?• Data scientists doing self-service data exploration needing performance (in milliseconds to

seconds).• Apache Spark SQL pretty much owns this category, although Apache Flink could provide Spark SQL with

competition in this category.

• Often require an In-memory computing architecture,

• Tools• Apache Kylin, AtScale, Kyvos Insights,

• In-memory: Spark SQL, Apache Flink

• Pros• Fast queries on pre-aggregated data

• Can use SQL and MDX tools

• Cons• Explicit cube definition/modeling phase

• Not “self-service”

• Frequent updates required due to dependency on business logic

• Aggregation create and maintenance can be long (and large)

• User connects to and interacts with the cube• Can’t interact with the raw data

Source: DremioCopyright © William El Kaim 2016 61

Page 62: Introduction to big data

What is BI on Hadoop?SQL-on-Hadoop: OLAP Cubes (In-memory) on Hadoop

• Apache Kylin lets you query massive data set at sub-second latency in 3 steps.

1. Identify a Star Schema data on Hadoop.

2. Build Cube on Hadoop.

3. Query data with ANSI-SQL and get results via ODBC, JDBC or RESTful API.

Copyright © William El Kaim 2016 62Source: Apache Kylin

Page 63: Introduction to big data

What is BI on Hadoop?SQL-on-Hadoop: Synthesis

• Pros

• Continue using your favorite BI tools and SQL-based clients

• Tableau, Qlik, Power BI, Excel, R, SAS, …

• Technical analysts can write custom SQL queries

• Cons

• Another layer in your data stack

• May need to pre-aggregate the data depending on your scale

• Need a separate data preparation tool (or custom scripts)

Source: DremioCopyright © William El Kaim 2016 63

Page 64: Introduction to big data

What is BI on Hadoop?SQL-on-Hadoop: Synthesis

Source: DremioCopyright © William El Kaim 2016 64

Page 65: Introduction to big data

What is BI on Hadoop?SQL-on-Hadoop: Encoding Formats

• The different encoding standards result in different block sizes, and that can

impact performance.

• ORC files compress smaller than Parquet files which can be a decisive choice factor.

• Impala, for example, accesses HDFS data that’s encoded in the Parquet format, while

Hive and others support optimized row column (ORC) files, sequence files, or plain text.

• Semi-structured data format like JSON is gaining traction

• Previously Hadoop users were using MapReduce to pound unstructured data into a

more structured or relational format.

• Drill opened up SQL-based access directly to semi-structured data, such as JSON,

which is a common format found on NoSQL and SQL databases. Cloudera also recently

added support for JSON in Impala.

Source: DatanamiCopyright © William El Kaim 2016 65

Page 66: Introduction to big data

What is BI on Hadoop?SQL-on-Hadoop: Decision Tree

Source: DremioCopyright © William El Kaim 2016 66

Page 67: Introduction to big data

Plan

• Taming The Data Deluge

• What is Big Data?

• What is Hadoop?

• When to use Hadoop?

• How to Implement Hadoop?

• What is a Data Lake?

• What is BI on Hadoop?

• What is Data Science

• Hadoop Processing Paradigms

• Hadoop Tools Landscape

Copyright © William El Kaim 2016 67

Page 68: Introduction to big data

What is Data Science?

• Data Science is an interdisciplinary field about processes and systems to

extract knowledge or insights from data in various forms

Copyright © William El Kaim 2016 68

Page 69: Introduction to big data

From Hindsight to Insight to Foresight

Copyright © William El Kaim 2016 69

Page 70: Introduction to big data

What is Data Science?

Copyright © William El Kaim 2016 70Source: wikipedia

Source: Forrester

Page 71: Introduction to big data

What is Data Science?Algorithms In Decision Making

Copyright © William El Kaim 2016 71

Page 72: Introduction to big data

What is Data Science?Tools: Anaconda

Copyright © William El Kaim 2016 72https://www.continuum.io/

Page 73: Introduction to big data

What is Data Science?Tools: Dataiku DSS

http://www.dataiku.com/dss/

Create Machine Learning ModelsCombine and Join Datasets

Copyright © William El Kaim 2016 73

Page 74: Introduction to big data

What is Data Science?Tools: IBM Data Science Experience

• Cloud-based development

environment for near real-time,

high performance analytics• Available on IBM Cloud Bluemix

platform

• Provides

• 250 curated data sets

• Open source tools and a

collaborative workspace like H2O,

RStudio, Jupyter Notebooks on

Apache Spark

Copyright © William El Kaim 2016 74http://datascience.ibm.com/

Page 75: Introduction to big data

What is Data Science?Tools: Tamr

http://www.tamr.com/Copyright © William El Kaim 2016 75

Page 76: Introduction to big data

Plan

• Taming The Data Deluge

• What is Big Data?

• What is Hadoop?

• When to use Hadoop?

• How to Implement Hadoop?

• What is a Data Lake?

• What is BI on Hadoop?

• What is Data Science?

• Hadoop Processing Paradigms

• Hadoop Tools Landscape

Copyright © William El Kaim 2016 76

Page 77: Introduction to big data

Hadoop Processing Paradigms

Batch processing

• Large amount of statics data

• Generally incurs a high-latency / Volume

Real-time processing

• Compute streaming data

• Low latency

• Velocity

Hybrid computation

• Lambda Architecture

• Volume + Velocity

Copyright © William El Kaim 2016 77Source: Rubén Casado & Cloudera

Page 78: Introduction to big data

Storage Technologies: Cost & Speed

Copyright © William El Kaim 2016 78

Page 79: Introduction to big data

Storage Technology: Encoding Format

• Apache Parquet vs. Apache Orc

Copyright © William El Kaim 2016 79

Page 80: Introduction to big data

Hadoop Batch processing

Copyright © William El Kaim 2016 80

• Scalable

• Large amount of static data

• Distributed

• Parallel

• Fault tolerant

• High latency

Volume

Source: Rubén Casado

Page 81: Introduction to big data

• MapReduce was designed by

Google as a programming model for

processing large data sets with a

parallel, distributed algorithm on a

cluster.

• Key Terminology

• Job: A “full program” - an execution of a

Mapper and Reducer across a data set

• Task: An execution of a Mapper or a

Reducer on a slice of data – a.k.a. Task-

In-Progress (TIP)

• Task Attempt: A particular instance of an

attempt to execute a task on a machine

Hadoop – Batch Processing - Map Reduce

81Copyright © William El Kaim 2016

Page 82: Introduction to big data

• Processing can occur on data stored either in a filesystem (unstructured) or

in a database (structured).

• MapReduce can take advantage of the locality of data, processing it near the

place it is stored in order to reduce the distance over which it must be

transmitted.

• "Map" step

• Each worker node applies the "map()" function to the local data, and writes the output to a

temporary storage.

• A master node ensures that only one copy of redundant input data is processed.

• "Shuffle" step

• Worker nodes redistribute data based on the output keys (produced by the "map()" function),

such that all data belonging to one key is located on the same worker node.

• "Reduce" step

• Worker nodes now process each group of output data, per key, in parallel.

Hadoop – Batch Processing - Map Reduce

82Copyright © William El Kaim 2016

Page 83: Introduction to big data

Hadoop – Batch Processing - Map Reduce

Copyright © William El Kaim 2016 83

Page 84: Introduction to big data

Real-time Processing

Copyright © William El Kaim 2016 84

• Low latency

• Continuous

unbounded

streams of data

• Distributed

• Parallel

• Fault-tolerant

Velocity

Source: Rubén Casado

Page 85: Introduction to big data

• Computational model and Infrastructure for continuous data processing, with

the ability to produce low-latency results

• Data collected continuously is naturally processed continuously (Event Processing or

Complex Event Processing -CEP)

Real-time (Stream) Processing

85Copyright © William El Kaim 2016 Source: Trivadis

Page 86: Introduction to big data

Real-time (Stream) Processing

• (Event-) Stream Processing

• A one-at-a-time processing model

• A datum is processed as it arrives

• Sub-second latency

• Difficult to process state data efficiently

• Micro-Batching

• A special case of batch processing with very small batch sizes (tiny)

• A nice mix between batching and streaming

• At cost of latency

• Gives statefull computation, making windowing an easy task

Copyright © William El Kaim 2016 86Source: Trivadis

Page 88: Introduction to big data

Real-time (Stream) Processing Arch. Pattern

Copyright © William El Kaim 2016 88Source: Cloudera

Page 89: Introduction to big data

Hybrid Computation Model

Copyright © William El Kaim 2016 89

• Low latency

• Massive data + Streaming data

• Scalable

• Combine batch and real-time results

Volume Velocity

Source: Rubén Casado

Page 90: Introduction to big data

Hybrid Computation: Lambda Architecture

• Data-processing architecture designed to handle massive quantities of data

by taking advantage of both batch- and stream-processing methods.

• A system consisting of three layers: batch processing, speed (or real-time)

processing, and a serving layer for responding to queries.

• This approach to architecture attempts to balance latency, throughput, and fault-

tolerance by using batch processing to provide comprehensive and accurate views of

batch data, while simultaneously using real-time stream processing to provide views of

online data.

• The two view outputs may be joined before presentation.

• Lambda Architecture case stories via lambda-architecture.net

Copyright © William El Kaim 2016 90Source: Kreps

Page 91: Introduction to big data

• Batch layer• Receives arriving data, combines it with historical data and recomputes results by

iterating over the entire combined data set.

• The batch layer has two major tasks:

• managing historical data; and recomputing results such as machine learning models.

• Operates on the full data and thus allows the system to produce the most accurate results. However, the results come at the cost of high latency due to high computation time.

• The speed layer• Is used in order to provide results in a low-latency, near real-time fashion.

• Receives the arriving data and performs incremental updates to the batch layer results.

• Thanks to the incremental algorithms implemented at the speed layer, computation cost is significantly reduced.

• The serving layer enables various queries of the results sent from the batch and speed layers.

Hybrid Computation: Lambda Architecture

91Copyright © William El Kaim 2016 Source: Kreps

Page 92: Introduction to big data

Hybrid computation: Lambda Architecture

Copyright © William El Kaim 2016 92Source: Mapr

Page 93: Introduction to big data

Hybrid computation: Kappa Architecture

• Proposal from Jay Kreps (LinkedIn) in this article.

• Then talk “Turning the database inside out with Apache Samza” by Martin Kleppmann

• Main objective

• Avoid maintaining two separate code bases for the batch and speed layers (lambda).

• Key benefits

• Handle both real-time data processing and continuous data reprocessing using a single

stream processing engine.

• Data reprocessing is an important requirement for making visible the effects of code

changes on the results.

Copyright © William El Kaim 2016 93Source: Kreps

Page 94: Introduction to big data

Hybrid computation: Kappa Architecture

• Architecture is composed of only two layers:

• The stream processing layer runs the stream processing jobs.

• Normally, a single stream processing job is run to enable real-time data processing.

• Data reprocessing is only done when some code of the stream processing job needs to be

modified.

• This is achieved by running another modified stream processing job and replying all previous data.

• The serving layer is used to query the results (like the Lambda architecture).

Copyright © William El Kaim 2016 94Source: O’Reilly

Page 95: Introduction to big data

Hybrid computation: Lambda vs. Kappa

Copyright © William El Kaim 2016 95

Lambda

Kappa

Source: Kreps

Used to value all data in a unique treatment chain

Used to provide the freshest data to customers

Page 96: Introduction to big data

Hadoop Processing Paradigms Evolutions

Copyright © William El Kaim 2016 96Source: Rubén Casado Tejedor

Page 97: Introduction to big data

Hadoop Architecture

Copyright © William El Kaim 2016 97

Metadata

Source Data

Computed Data

Data Lake

Data Preparation

Data Science Tools & Platforms

Open Data

Operational Systems

(ODS, IoT)

Existing Sources of Data

(Databases, DW, DataMart)

Data Sourcing

Dataiku, Tableau, Python, R, etc.

Batch

Streaming

Data IngestionData Sources

Data DrivenBusiness Process,

Applications and Services

Lakeshore

SQL & NoSQLDatabase

HadoopSpark

BI Tools & Platforms

Qlik, Tibco, IBM, SAP, BIME, etc.

App. Services

Cascading, Crunch, Hfactory, Hunk, Spring

for Hadoop

Feature Preparation

Page 98: Introduction to big data

Hadoop Technologies

Copyright © William El Kaim 2016

98

NoSQL Databases: Cassandra, Ceph, DynamoDB, Hbase, Hive, Impala, Ring, OpenStack Swift, etc.

Data Lake

Data PreparationData Sourcing

Data Science: Dataiku, Datameer, Tamr, R, SaS, Python, RapidMiner, etc.

Data IngestionData Sources

BI Tools & Platforms

App. Services

Cascading, Crunch, Hfactory, Hunk, Spring for Hadoop, D3.js, Leaflet

Feature Preparation

Ingestion Technologies: Apex, Flink, Flume, Kafka, Amazon Kinesis, Nifi, Samza, Spark, Sqoop, Scribe, Storm, NFS Gateway, etc.

Distributed File System: GlusterFS, HDFS, Amazon S3, MapRFS, ElasticSearch

Batch

Streaming

Encoding Format: JSON, Rcfile, Parquet, ORCfile

Map Reduce

Event Stream & Micro Batch

Open Data

Operational Systems

(ODS, IoT)

Existing Sources of Data

(Databases, DW, DataMart)

Distributions: Cloudera, HortonWorks, MapR, SyncFusion, Amazon EMR, Azure HDInsight, Altiscale, Pachyderm, Qubole, etc.

Data Warehouse

Lakeshore & Analytics

Qlik, Tableau, Tibco, Jethro, Looker, IBM, SAP, BIME, etc.

Cassandra, Druid, DynamoDB, MongoDB, Redshift, Google BigQuery, etc.

Machine Learning: BigML, Mahout, Predicsys, Azure ML, TensorFlow, H2O, etc.

Analytics App and Services

Copyright © William El Kaim 2016

Page 99: Introduction to big data

Plan

• Taming The Data Deluge

• What is Big Data?

• What is Hadoop?

• When to use Hadoop?

• How to Implement Hadoop?

• What is a Data Lake?

• What is BI on Hadoop?

• What is Data Science?

• Hadoop Processing Paradigms

• Hadoop Tools Landscape

Copyright © William El Kaim 2016 99

Page 100: Introduction to big data

Big Data LandscapeHadoop Distributors

• Three Main pure-play Hadoop distributors

• Cloudera, Hortonworks, and MapR Technologies

• Other Hadoop distributors

• SyncFusion: Hadoop for Windows,

• Pivotal Big Data Suite

• Pachyderm

• Hadoop Cloud Provider:

• Altiscale, Amazon EMR, BigStep, Google Cloud

DataProc, HortonWorks SequenceIQ, IBM

BigInsights, Microsoft HDInsight, Oracle Big

Data, Qubole, Rackspace

• Hadoop Infrastructure as a Service

• BlueData, Packet

Copyright © William El Kaim 2016 100

Forrester Big Data Hadoop Cloud, Q1 2016

Page 101: Introduction to big data

Big Data Landscape Data Preparation

Copyright © William El Kaim 2016 101

Source: Bloor Research

Page 102: Introduction to big data

Big Data LandscapeBusiness Intelligence and Analytics Platforms

Copyright © William El Kaim 2016 102

Page 103: Introduction to big data

Big Data LandscapeHadoop Distributions To Start With

• Apache Hadoop

• Cloudera Live

• Dataiku

• Hortonworks Sandbox

• IBM BigInsights

• MapR Sandbox

• Microsoft Azure HDInsight

• Syncfusion Hadoop for Windows

• W3C Big Data Europe Platform

Copyright © William El Kaim 2016 103

Page 104: Introduction to big data

104Copyright © William El Kaim 2016

Twitter

http://www.twitter.com/welkaim

SlideShare

http://www.slideshare.net/welkaim

EA Digital Codex

http://www.eacodex.com/

Linkedin

http://fr.linkedin.com/in/williamelkaim

Claudine O'Sullivan