View
85
Download
0
Category
Preview:
Citation preview
Big Data & Data Science20 mars 2017
Big Data & Data Science : Agenda – 18h30 / 20h15
1/ L’écosystème Apache Spark Johan Picard, Expert Big Data
2/ SQL on Hadoop at scale – SparkSQL2.1 & BigSQL4.3 on 100TB Hadoop-DS Victor Hatinguais, Architecte Big Data
3/ Social Data : Machine Learning pour un projet à caractère social Samed Atouati & Abdellah Lamrani Alaoui, aspirants Data Scientist, étudiants à l'Ecole Centrale Paris
4/ Data Science Experience Zied Abidi, Data Scientist
5/ Comment faire parler les données pour détecter des anomalies ? Pauline Clavelloux, Data Scientist
Questions & Réponses - Clôture
IBM | Spark 3
Power of data. Simplicity of design. Speed of innovation.
Apache Spark in 15 minutes
IBM | Spark 4
Apache Spark
Apache Spark is a fast and general engine for large scale data processing.
https://spark.apache.org/
IBM | Spark 5
Spark History: one of the most active open-source projects
2002 – MapReduce @ Google2004 – MapReduce paper2006 – Hadoop @ Yahoo2008 – Hadoop Summit2010 – Spark paper2013 – Spark 0.7 Apache Incubator2014 – Apache Spark top-level 2014 – 1.2.0 released in December2015 – 1.3.0 released in March2015 – 1.4.0 released in June2015 – 1.5.0 released in September2016 – 1.6.0 released in January2016 – 2.0.0 released in July2016 – 2.1.0 released in December
Spark is HOT!!!Most active project in Hadoop ecosystemOne of top 3 most active Apache projectsDatabricks founded by the creators of Spark from UC Berkeley’s AMPLab
IBM | Spark 6
Spark is the most active open source project in Big Data
Source: Syncort – Hadoop Perspectives for 2016
2015
2014
2016900
Now 1039 contributors…
IBM | Spark 7
Why Spark? In-memory performances and code compactness
IBM | Spark 8
Spark RDDIn-memory distribution
HDFSOn-disk distribution
Why Spark? A distributed framework
IBM | Spark 9
Resilient Distributed Dataset
Create RDDs: parallelize textFile Transformations
Get results: Actions
IBM | Spark 10
Why Spark? A bunch of comfortables APIs
IBM | Spark 11
Spark Programming Languages
IBM | Spark 12
Distributed File System Data Preparation SQL Engine Stream Processing Graph Engine Machine Learning Distributed R
Spark SQL Spark Streaming GraphX MLlib Spark R
Why Spark? An unified framework
IBM | Spark 13
• Reliability• Resiliency• Security
• Multiple data sources• Multiple applications
• Multiple users
• Files• Semi-structured
• Databases
Unlimited Scale
Enterprise Platform
Wide Range of Data Formats
Spark complements Hadoop (1/3): Hadoop Strengths
IBM | Spark 14
• Need deep Java skills• Few abstractions available for
analysts
• No in-memory framework• Application tasks write to disk with
each cycle
• Only suitable for batch workloads• Rigid processing model
In-Memory Performance
Ease of Development
Combine Workflows
Spark complements Hadoop (2/3): MapReduce Weaknesses
IBM | Spark 15
In-Memory Performance
Ease of Development• Easier APIs
• Python, Scala, Java
• Resilient Distributed Datasets• Unify processing
• Batch• Interactive
• Iterative algorithms• Micro-batch
Combine Workflows
Spark complements Hadoop (3/3): Spark Advantages
IBM | Spark 16
In-Memory Performance
Ease of Development
Combine Workflows
Unlimited Scale
Enterprise Platform
Wide Range of Data Formats
The Flexibility of Spark on a Stable Hadoop Platform
IBM | Spark 17
Spark Shell: interactive Scala PySpark: interactive Python Spark Submit: compiled Notebooks: Jupyter, Zeppelin
How to develop and run a Spark job?
IBM | Spark 18
What Spark Is Not!
Not only for Hadoop – Spark can work with Hadoop (especially HDFS), but Spark is a standalone system
Not a data store – Spark attaches to other data stores but does not provide its own
Not only for machine learning – Spark includes machine learning and does it very well, but it can handle much broader tasks equally well
Not a replacement for Streams – Spark Streaming is micro-batching, not true streaming, and cannot handle the real-time complex event processing
Not a language!!!
IBM | Spark 19
Spark et IBM
IBM | Spark 20
IBM has the largest investment in Spark of any company in the world
visit www.spark.tc for more informationIBM | Spark
IBM Spark Technology Center
https://ibm.biz/hadoop-jirahttps://ibm.biz/spark-jira
On of the top commiter/contributor 300+ inventors Commitment to educate 1 million data
scientists Contributed SystemML Founding member of AMPLab Partnerships in the ecosystem
IBM | Spark 21
Leadership in Spark
Spark Technology Center has contributed 829 code changes to Spark components since we started around middle of 2015
STC contributions have been. 52% to Spark SQL, 16% to PySpark, 26% to ML and MLlib. For more details, use this dash board https://www.ibm.biz/spark-jira
IBM | Spark 22
Data Science Experience (DSX)
IBM | Spark
ALL YOUR TOOLS IN ONE PLACEIBM Data Science Experience is an environment that
brings together everything that a Data Scientist needs. It includes the most popular Open Source tools and IBM unique value-add functionalities with community and social features, integrated as a first class citizen to
make Data Scientists more successful.
datascience.ibm.com
IBM | Spark 23
Power of data. Simplicity of design. Speed of innovation.
PoT IBM sur Google
9 Mai : Manipulation de données massives avec Spark10 Mai : Formation machine learning utilisant DSX
Recommended