1. Introduction to Apache Spark and Machine Learning Ezekiel
Awoyemi Data Engineer Andela
2. What is Apache Spark It is an open-source cluster computing
framework built around speed, ease of use, and sophisticated
analytics compared to other big data analytics like MapReduce and
Storm. 2
3. Spark-stack 3
4. What is Big Data and where does it come from Ad impression
Fast forward, pause and rewind of videos Transactions Social
networks Telecommunication networks 4
5. Data Science Data Science aims to derive knowledge from big
data, efficiently and intelligently Nowcasting: example Google flu
trends in Feb, 2010. Forecasting: example Princeton Universitys
Epidemiological modelling of online social network dynamics 5
6. Database/Data Science 6 ELEMENTS DATABASE DATA SCIENCE
PRIORITIES Consistency, Error recovery, Audibility Speed,
Availability, Query richness DATA VALUE Precious Cheap DATA VOLUME
Modest Massive STRUCTURE Strong (Schema) Weak or none(Text)
EXAMPLES Bank records, Medical records, Census, Personal records
Online clicks, GPS logs, Tweets, etc Querying the past Querying the
future
7. Spark Program Lifecycle Create RDDs from external data or
parallels a collection in your driver program Lazily transform them
into new RDDs Cache() some RDDs for reuse Perform actions to
execute parallel computation and produce results 7
8. Machine Learning Machine Learning is used to solve
Supervised Classification Problems. Give machines examples and they
will learn with that We can use Collaborative filtering which is
commonly used for recommender systems Naive Bayes
Principles/algorithms, etc 8
9. Examples of machine learning Classification of email as spam
Self driving car Recommending new songs, movies, etc 9
10. Coding Example 1. Text file is the complete work of William
Shakespeare Count the number of lines in the file Print the first
line or first item in the RDD How many lines contain the word come
Count the number of words in the file How many times do we now have
the word come Print the first item in the RDD Print (word, count)
pair 10
11. Coding Example 2. We have 1000209 ratings from 6040 users
on 3706 movies collected by MovieLens Using a small set of movies
that have received the most ratings from users in the MoviesLens
dataset. Get a fellow to rate movies (1(poor) - 5(best), or 0 if
not seen) Make movie recommendations 11