Download pdf - Spark For The Business Analyst

INTRO TO APACHESPARK

BIG DATA FOR THE BUSINESS ANALYSTCreated by / Gus Cavanaugh @GusCavanaugh

http://twitter.com/GusCavanaugh

http://codingcharlatan.com/

WHY ARE WE HERE?Business analysts use data to inform business decisions.

Spark is one of many tools that can help you do that.

SO LET'S DIVE RIGHT INval input = sc.textfile("file:///test.csv")

input.collect().foreach(println)

This code just loads a file and prints it out to the screen

BIG CAVEATWe will be coding

No, there is no other way

Yes, it will be hard

But you can do it

HERE'S HOW I KNOW...Excel formulas are super hard

=VLOOKUP(B2,'Raw Data'!$B$1:$D$2,3,FALSE)

=SUMPRODUCT((A1:A10="Ford")*(B1:B10="June")*(C1:C10))

If you learned how to write VLOOKUPs, you can learn tocode

DISTINCTION: WE ARE NOTENGINEERS

We are not building production applications

We just want to answer questions with data rather than withspeculation

WE MAY SHARE TOOLS WITHENGINEERS, BUT OUR PROCESS IS

DIFFERENTPrincipally, we emphasize interactive analysis

This means we want the flexibility to change the questionswe ask as we work

AND THE ABILITY TO STOP OURANALYSIS AT ANY POINT

We are not doing analysis for the sake of doing analysis

Good may be the enemy of great, but better is the enemy ofdone

IN BUSINESS LANGUAGEWe want the highest analytic return for our time investment

OUR ANALYTIC PROCESSDon't measure, just cutGoogle is your best friendYou don't have to know how to do anythingYou just have to be able to find out

WHAT IS SPARK?Spark is an open-source processing framework designed for

cluster computing

https://databricks.com/spark/about

WHY IS IT POPULAR?Super fast...

Plays well with Hadoop

Native APIs for analyst friendly languages like Python andR

WAIT...I'VE HEARD THIS BEFORESounds like the original promise of Hadoop...

How is Spark different?

FAST REVIEW OF HADOOPGoogle was indexing the web every day

They wrote some custom software to store and processthose documents (web pages)

The open source version of that software is called Hadoop

HADOOP CONSISTS OF TWO MAINPIECES

The Hadoop Distributed File System: HDFS

And a processing framework called MapReduce

HDFS enabled fault-tolerant storage on commodity serversat scale

And MapReduce allowed you to process what you stored inparallel

THIS IS A BIG DEAL...Companies storing ever increasing amounts of data could:

Do so much cheaperWith more flexibility

HADOOP CAME WITH A COSTParallel processing, but not necessarily fast (batch

processing)

Difficult to programpackage org.myorg; import java.io.IOException;import java.util.*;

import org.apache.hadoop.fs.Path;import org.apache.hadoop.conf.*;import org.apache.hadoop.io.*;import org.apache.hadoop.mapred.*;import org.apache.hadoop.util.*;

public class WordCount

public static class Map extends MapReduceBase implements Mapper<longwritable private final static IntWritable one = new IntWritable(1); private Text word = new Text();

NOT INTERACTIVEWriting MapReduce jobs in Java is an inefficient way for

business analysts to process data in parallel

We get the parallel processing speed, but the developmenttime is long (or the time spent asking a dev to write it...)

BUT WHAT ABOUT PIG..?Pig is a sort of scripting language for Hadoop with friendly

syntax that lets you read from any data sourceA = load './input.txt';B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;C = group B by word;D = foreach C generate COUNT(B), group;store D into './wordcount';

While it works well, it's another language to learn and it isonly used in Hadoop

BUT WHAT ABOUT SQL-ON-HADOOP?

A few options: Hive, Impala, Big SQL

If you have these options, use them

But they all involve substantial ETL and (maybe) additionalhardware

In D.C. we know what that means: you get it on next year'scontract

WHAT IS ETL? AND WHY WOULD WENEED IT?

Because unlike most Hadoop tutorials, the data analystsaccess are not in flat files

For analytics, it is very likely you'll want data from yourHadoop application's database

But what is your Hadoop application's database?

HBASE - THE HADOOP DATABASEOne big freakin' table

No joins - row keys are everything

Great for applications, terrible for analysts

WHY AM I TALKING ABOUT HBASEDURING A SPARK PRESENTATION?

Because I want you to know that your data will not be in theformat you want

ETL - Extract, Transform, Load, is a real process thatengineers will have to spend time on to get your data into a

SQL friendly environment

This will not be an application feature, but an analytics one(so don't be surprised if this gets skipped)

MY RAMBLING POINT IS THAT YOUWILL HAVE MESSY DATA

Hadoop, Spark, Tableau, nor anything else will solve that

You still have to rely on the tools you use for data wrangling

Like Python and R

TOOL COMPARISONTool Powerful? Friendly?

Excel No Hell Yes

Python/R Meh... Yes

Hadoop Yes Hell no

Spark Hell yes Just right

IDEAL SCENARIOI can write the same Python scripts that I use to process data

on my local machine

SPARK IS OUR BEST ANSWERYou can write Python and iterative computations are

processed in memory, so they are easier to write and muchfaster than MapReduce

HOW YOU CAN GET STARTEDBig Data UniversitySpark on Bluemix

https://console.ng.bluemix.net/home/

http://bigdatauniversity.com/

EXTRASMy video on Docker installSpark paper

https://www.youtube.com/watch?v=VKPh_nMbvBA

https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf