INTRO TO APACHESPARK
BIG DATA FOR THE BUSINESS ANALYSTCreated by / Gus Cavanaugh @GusCavanaugh
WHY ARE WE HERE?Business analysts use data to inform business decisions.
Spark is one of many tools that can help you do that.
SO LET'S DIVE RIGHT INval input = sc.textfile("file:///test.csv")
input.collect().foreach(println)
This code just loads a file and prints it out to the screen
HERE'S HOW I KNOW...Excel formulas are super hard
=VLOOKUP(B2,'Raw Data'!$B$1:$D$2,3,FALSE)
=SUMPRODUCT((A1:A10="Ford")*(B1:B10="June")*(C1:C10))
If you learned how to write VLOOKUPs, you can learn tocode
DISTINCTION: WE ARE NOTENGINEERS
We are not building production applications
We just want to answer questions with data rather than withspeculation
WE MAY SHARE TOOLS WITHENGINEERS, BUT OUR PROCESS IS
DIFFERENTPrincipally, we emphasize interactive analysis
This means we want the flexibility to change the questionswe ask as we work
AND THE ABILITY TO STOP OURANALYSIS AT ANY POINT
We are not doing analysis for the sake of doing analysis
Good may be the enemy of great, but better is the enemy ofdone
OUR ANALYTIC PROCESSDon't measure, just cutGoogle is your best friendYou don't have to know how to do anythingYou just have to be able to find out
WHAT IS SPARK?Spark is an open-source processing framework designed for
cluster computing
WHY IS IT POPULAR?Super fast...
Plays well with Hadoop
Native APIs for analyst friendly languages like Python andR
FAST REVIEW OF HADOOPGoogle was indexing the web every day
They wrote some custom software to store and processthose documents (web pages)
The open source version of that software is called Hadoop
HADOOP CONSISTS OF TWO MAINPIECES
The Hadoop Distributed File System: HDFS
And a processing framework called MapReduce
HDFS enabled fault-tolerant storage on commodity serversat scale
And MapReduce allowed you to process what you stored inparallel
THIS IS A BIG DEAL...Companies storing ever increasing amounts of data could:
Do so much cheaperWith more flexibility
HADOOP CAME WITH A COSTParallel processing, but not necessarily fast (batch
processing)
Difficult to programpackage org.myorg; import java.io.IOException;import java.util.*;
import org.apache.hadoop.fs.Path;import org.apache.hadoop.conf.*;import org.apache.hadoop.io.*;import org.apache.hadoop.mapred.*;import org.apache.hadoop.util.*;
public class WordCount
public static class Map extends MapReduceBase implements Mapper<longwritable private final static IntWritable one = new IntWritable(1); private Text word = new Text();
NOT INTERACTIVEWriting MapReduce jobs in Java is an inefficient way for
business analysts to process data in parallel
We get the parallel processing speed, but the developmenttime is long (or the time spent asking a dev to write it...)
BUT WHAT ABOUT PIG..?Pig is a sort of scripting language for Hadoop with friendly
syntax that lets you read from any data sourceA = load './input.txt';B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;C = group B by word;D = foreach C generate COUNT(B), group;store D into './wordcount';
While it works well, it's another language to learn and it isonly used in Hadoop
BUT WHAT ABOUT SQL-ON-HADOOP?
A few options: Hive, Impala, Big SQL
If you have these options, use them
But they all involve substantial ETL and (maybe) additionalhardware
In D.C. we know what that means: you get it on next year'scontract
WHAT IS ETL? AND WHY WOULD WENEED IT?
Because unlike most Hadoop tutorials, the data analystsaccess are not in flat files
For analytics, it is very likely you'll want data from yourHadoop application's database
But what is your Hadoop application's database?
HBASE - THE HADOOP DATABASEOne big freakin' table
No joins - row keys are everything
Great for applications, terrible for analysts
WHY AM I TALKING ABOUT HBASEDURING A SPARK PRESENTATION?
Because I want you to know that your data will not be in theformat you want
ETL - Extract, Transform, Load, is a real process thatengineers will have to spend time on to get your data into a
SQL friendly environment
This will not be an application feature, but an analytics one(so don't be surprised if this gets skipped)
MY RAMBLING POINT IS THAT YOUWILL HAVE MESSY DATA
Hadoop, Spark, Tableau, nor anything else will solve that
You still have to rely on the tools you use for data wrangling
Like Python and R
TOOL COMPARISONTool Powerful? Friendly?
Excel No Hell Yes
Python/R Meh... Yes
Hadoop Yes Hell no
Spark Hell yes Just right
SPARK IS OUR BEST ANSWERYou can write Python and iterative computations are
processed in memory, so they are easier to write and muchfaster than MapReduce
HOW YOU CAN GET STARTEDBig Data UniversitySpark on Bluemix
EXTRASMy video on Docker installSpark paper