A short introduction to Spark and its benefits

Big Data & Data Science20 mars 2017

Big Data & Data Science : Agenda – 18h30 / 20h15

1/ L’écosystème Apache Spark Johan Picard, Expert Big Data

2/ SQL on Hadoop at scale – SparkSQL2.1 & BigSQL4.3 on 100TB Hadoop-DS Victor Hatinguais, Architecte Big Data

3/ Social Data : Machine Learning pour un projet à caractère social Samed Atouati & Abdellah Lamrani Alaoui, aspirants Data Scientist, étudiants à l'Ecole Centrale Paris

4/ Data Science Experience Zied Abidi, Data Scientist

5/ Comment faire parler les données pour détecter des anomalies ? Pauline Clavelloux, Data Scientist

Questions & Réponses - Clôture

IBM | Spark 3

Power of data. Simplicity of design. Speed of innovation.

Apache Spark in 15 minutes

IBM | Spark 4

Apache Spark

Apache Spark is a fast and general engine for large scale data processing.

https://spark.apache.org/

IBM | Spark 5

Spark History: one of the most active open-source projects

2002 – MapReduce @ Google2004 – MapReduce paper2006 – Hadoop @ Yahoo2008 – Hadoop Summit2010 – Spark paper2013 – Spark 0.7 Apache Incubator2014 – Apache Spark top-level 2014 – 1.2.0 released in December2015 – 1.3.0 released in March2015 – 1.4.0 released in June2015 – 1.5.0 released in September2016 – 1.6.0 released in January2016 – 2.0.0 released in July2016 – 2.1.0 released in December

Spark is HOT!!!Most active project in Hadoop ecosystemOne of top 3 most active Apache projectsDatabricks founded by the creators of Spark from UC Berkeley’s AMPLab

IBM | Spark 6

Spark is the most active open source project in Big Data

Source: Syncort – Hadoop Perspectives for 2016

2016900

Now 1039 contributors…

IBM | Spark 7

Why Spark? In-memory performances and code compactness

IBM | Spark 8

Spark RDDIn-memory distribution

HDFSOn-disk distribution

Why Spark? A distributed framework

IBM | Spark 9

Resilient Distributed Dataset

Create RDDs: parallelize textFile Transformations

Get results: Actions

IBM | Spark 10

Why Spark? A bunch of comfortables APIs

IBM | Spark 11

Spark Programming Languages

IBM | Spark 12

Distributed File System Data Preparation SQL Engine Stream Processing Graph Engine Machine Learning Distributed R

Spark SQL Spark Streaming GraphX MLlib Spark R

Why Spark? An unified framework

IBM | Spark 13

• Reliability• Resiliency• Security

• Multiple data sources• Multiple applications

• Multiple users

• Files• Semi-structured

• Databases

Unlimited Scale

Enterprise Platform

Wide Range of Data Formats

Spark complements Hadoop (1/3): Hadoop Strengths

IBM | Spark 14

• Need deep Java skills• Few abstractions available for

analysts

• No in-memory framework• Application tasks write to disk with

each cycle

• Only suitable for batch workloads• Rigid processing model

In-Memory Performance

Ease of Development

Combine Workflows

Spark complements Hadoop (2/3): MapReduce Weaknesses

IBM | Spark 15

Ease of Development• Easier APIs

• Python, Scala, Java

• Resilient Distributed Datasets• Unify processing

• Batch• Interactive

• Iterative algorithms• Micro-batch

Combine Workflows

Spark complements Hadoop (3/3): Spark Advantages

IBM | Spark 16

Ease of Development

Combine Workflows

Unlimited Scale

Enterprise Platform

Wide Range of Data Formats

The Flexibility of Spark on a Stable Hadoop Platform

IBM | Spark 17

Spark Shell: interactive Scala PySpark: interactive Python Spark Submit: compiled Notebooks: Jupyter, Zeppelin

How to develop and run a Spark job?

IBM | Spark 18

What Spark Is Not!

Not only for Hadoop – Spark can work with Hadoop (especially HDFS), but Spark is a standalone system

Not a data store – Spark attaches to other data stores but does not provide its own

Not only for machine learning – Spark includes machine learning and does it very well, but it can handle much broader tasks equally well

Not a replacement for Streams – Spark Streaming is micro-batching, not true streaming, and cannot handle the real-time complex event processing

Not a language!!!

IBM | Spark 19

Spark et IBM

IBM | Spark 20

IBM has the largest investment in Spark of any company in the world

visit www.spark.tc for more informationIBM | Spark

IBM Spark Technology Center

https://ibm.biz/hadoop-jirahttps://ibm.biz/spark-jira

On of the top commiter/contributor 300+ inventors Commitment to educate 1 million data

scientists Contributed SystemML Founding member of AMPLab Partnerships in the ecosystem

IBM | Spark 21

Leadership in Spark

Spark Technology Center has contributed 829 code changes to Spark components since we started around middle of 2015

STC contributions have been. 52% to Spark SQL, 16% to PySpark, 26% to ML and MLlib. For more details, use this dash board https://www.ibm.biz/spark-jira

IBM | Spark 22

Data Science Experience (DSX)

IBM | Spark

ALL YOUR TOOLS IN ONE PLACEIBM Data Science Experience is an environment that

brings together everything that a Data Scientist needs. It includes the most popular Open Source tools and IBM unique value-add functionalities with community and social features, integrated as a first class citizen to

make Data Scientists more successful.

datascience.ibm.com

IBM | Spark 23

Power of data. Simplicity of design. Speed of innovation.

PoT IBM sur Google

9 Mai : Manipulation de données massives avec Spark10 Mai : Formation machine learning utilisant DSX

A short introduction to Spark and its benefits

Data & Analytics

Benefits of Linkedin Short Report

Spark & Spark SQL

20.8mm専用プラグレンチ SPARK PLUG WRENCH …関連アイテムタイヤバルブレンチタイヤレバーニップルレンチ SPARK PLUG WRENCH (SHORT TYPE) No. S D1

Purple cow employee benefits 2011 (the short version)

006 benefits & knowledge management-short

Compiled AASB Standard...Short-term employee benefits 9 – 10 Recognition and measurement All short-term employee benefits 11 – 12 Short-term paid absences 13 – 18 Profit-sharing

The Environmental & Economic Benefits of Short Sea ... · PDF fileEconomic Benefits of Short Sea Shipping by ... International ocean carriers unload their cargo at coastal U.S. ports

Apache Ignite and Apache Spark - GridGain Systems · Ignite and Spark Integration Spark Application Spark Worker Spark Job Spark Job Yarn Mesos Docker HDFS Spark Worker Spark Job

Benefits Dependency Network Short Guide

Medical benefits group short presentation

The benefits system: a short guide for GPs...The benefits system – a short guide for GPs 4 Introduction This guide describes the main benefits that DWP provides, and situations when

APPLICATION FOR SHORT TERM DISABILITY INCOME BENEFITS

Purple cow employee benefits 2011 (the short version)

HIRTH 3203 Carburated - 65 HP parts manual.pdf · 31 1 Spark plug 5668 Do not mix spark plug types 32 1 Voltage Regulator 5338 33 1 Magnet Support 5724 34 1 Spark plug lead, short

The benefits system: a short guide for GPs

Enterprise communications and collaboration for …...The Cisco Spark Hybrid Services are: [Enjoy business benefits] Link Cisco Spark in the cloud with your on premises technology,

Spark Plugs - Bosch Global · Spark Plugs Overview, features and benefits Engineered for high performance and long life. Bosch makes history in spark plug development: Since 1920,

The Benefits of Short-term Programs

The Benefits of Multiple Short-term Deterministic Model ... · The Benefits of Multiple Short-term Deterministic Model Solutions During Hurricane Events Peter F. Blottman, Jerry Combs,

The Corporate Finance Benefits of Short-horizon Investors · The Corporate Finance Benefits of Short-horizon Investors* Mariassunta Giannetti Stockholm School of Economics, CEPR,