30
The Nuts and Bolts of Hadoop and it's Ever- changing Ecosystem along with what I wish someone would have explained to me before I started working with Hadoop Jeff Crawford Associate Professor Lipscomb University https://www.linkedin.com/in/crawdoc Presented at the 2014 Analytics Summit (Franklin, TN) September 10, 2014 Philip Best Data Science Architect HCA

The Nuts and Bolts of Hadoop and it's Ever-changing Ecosystem, Presented by Jeff Crawford

Embed Size (px)

DESCRIPTION

Originally presented at the 2014 Nashville Analytics Summit

Citation preview

Page 1: The Nuts and Bolts of Hadoop and it's Ever-changing Ecosystem, Presented by Jeff Crawford

The Nuts and Bolts of Hadoop and it's Ever-changing Ecosystemalong with what I wish someone would have explained

to me before I started working with Hadoop

Jeff CrawfordAssociate ProfessorLipscomb University

https://www.linkedin.com/in/crawdoctor

Presented at the 2014 Analytics Summit (Franklin, TN)September 10, 2014

Philip BestData Science Architect

HCA

Page 2: The Nuts and Bolts of Hadoop and it's Ever-changing Ecosystem, Presented by Jeff Crawford

About Lipscomb’s CCT…Lipscomb’s College of Computing and Technology offers the following graduate programs:– MS in Information Technology– MS in Informatics & Analytics– MS in Software Engineering

Programs are designed with working professionals in mind. Earn a MS degree in as little as 12 months. GRE is waived for those with 5 or more years work experience in their area of study. See one of the Lipscomb folks for more information.

Visit http://technology.lipscomb.edu/ to learn more and apply

https://www.linkedin.com/in/crawdoctor

Page 3: The Nuts and Bolts of Hadoop and it's Ever-changing Ecosystem, Presented by Jeff Crawford

A Spineless DisclaimerAll the thoughts in this presentation are:– The result of learning a speaker dropped from the

conference on Monday morning…– Intended for good and not harm– Derived from lots of reading, research, discussions

and personal experience– Will be review / old-hat for some of you– Not intended to be wholly inclusive– Most likely correct

https://www.linkedin.com/in/crawdoctor

Page 4: The Nuts and Bolts of Hadoop and it's Ever-changing Ecosystem, Presented by Jeff Crawford

Presentation Agenda

1. What Hadoop is2. What Hadoop isn’t3. What Hadoop looks like4. Why didn’t anyone tell me? Things I

wish I would have known when getting started

5. How to get started and get experience

https://www.linkedin.com/in/crawdoctor

Page 5: The Nuts and Bolts of Hadoop and it's Ever-changing Ecosystem, Presented by Jeff Crawford

Wha

t Had

oop

isDefinition somewhat depends on where you sit…• Infrastructure– Concerned with cluster implementation including

but not limited to issues of performance, availability, scalability, etc.

• Data Science Proper– Concerned with extracting meaning from large and

messy data sources• Business Intelligence / Reporting– Concerned with delivering actionable information to

the right people at the right time• Management– Concerned with economically pursuing business

objectives

https://www.linkedin.com/in/crawdoctor

Page 6: The Nuts and Bolts of Hadoop and it's Ever-changing Ecosystem, Presented by Jeff Crawford

• Designed to solve a specific type of problem– How do I provide structure and meaning to a

large and/or rapidly changing and/or unstructured set of data?

• Designed to address several limitations with traditional RDMBS’s

https://www.linkedin.com/in/crawdoctor

Wha

t Had

oop

is

Page 7: The Nuts and Bolts of Hadoop and it's Ever-changing Ecosystem, Presented by Jeff Crawford

• A distributed storage and processing framework that is abstracted from “users”– HDFS– MapReduce

• Open-source software (Apache Software Foundation) derived from organizations that (sometimes) like to share, such as– Google– Yahoo!– Facebook– Cloudera / HortonWorks

https://www.linkedin.com/in/crawdoctor

Wha

t Had

oop

is

Page 8: The Nuts and Bolts of Hadoop and it's Ever-changing Ecosystem, Presented by Jeff Crawford

• Java-based• Batch processing– HDFS designed for “write once, read many”

operations• Flexible– Can work with all types of data, constrained

by your ability to program structure– Can use a variety of languages beyond Java to

interact with Hadoop• Python, Perl, C++, R, etc.

• Resilient– Built with a “design to fail” mentality– “Rack aware” storage

https://www.linkedin.com/in/crawdoctor

Wha

t Had

oop

is

Page 9: The Nuts and Bolts of Hadoop and it's Ever-changing Ecosystem, Presented by Jeff Crawford

• Designed to utilize (mostly) commodity (COTS) hardware

• Linear-ish scalability• Extensible– Ever-moving, ever-changing, ever-evolving

https://www.linkedin.com/in/crawdoctor

Wha

t Had

oop

is

Page 10: The Nuts and Bolts of Hadoop and it's Ever-changing Ecosystem, Presented by Jeff Crawford

https://www.linkedin.com/in/crawdoctor

Image pulled from http://techblog.baghel.com/index.php?itemid=132, details at http://hadoop.apache.org/

Wha

t Had

oop

is

Page 11: The Nuts and Bolts of Hadoop and it's Ever-changing Ecosystem, Presented by Jeff Crawford

https://www.linkedin.com/in/crawdoctor

Image pulled from http://www.cloudera.com/content/cloudera/en/products-and-services/cdh.html

Wha

t Had

oop

is

Page 12: The Nuts and Bolts of Hadoop and it's Ever-changing Ecosystem, Presented by Jeff Crawford

In Summary, Hadoop provides a distributed storage and computation platform– From existing hardware of varying quality• Scaling out, not up• Whatever hardware you can get your hands on

– Which handles data storage and resiliency• Using HDFS to store files• Built-in redundancy factor

– With a unified computation framework• Making the traditionally hard task of parallel

programming more attainable• Automatically leverages locality of data

https://www.linkedin.com/in/crawdoctor

Wha

t Had

oop

is

Page 13: The Nuts and Bolts of Hadoop and it's Ever-changing Ecosystem, Presented by Jeff Crawford

• A replacement for RDBMS’s• A solution for every type of problem– Batch processing– Expectation of “large” files

• Free• Straightforward to administer / manage /

work with• A means of simplifying the definition of

business objectives• The path to operational zen

https://www.linkedin.com/in/crawdoctor

Wha

t Had

oop

IS N

OT

Page 14: The Nuts and Bolts of Hadoop and it's Ever-changing Ecosystem, Presented by Jeff Crawford

HDFS• A file system– Files are stored in blocks (typically 64MB or 128MB)– Blocks are stored across multiple devices to address

fault tolerance and performance issues• Can be “rack aware”

• Utilizes two types of machines (aka, nodes)– Namenode: Contains information on the location of

all files in the filesystem (metadata)• Potential single point of failure so in comes use HDFS HA• Can use secondary name node, but it is a simple backup

using checkpoints

– Datanode: Contains actual files• POSIX-like commands

https://www.linkedin.com/in/crawdoctor

Wha

t Had

oop

look

s lik

e

Page 15: The Nuts and Bolts of Hadoop and it's Ever-changing Ecosystem, Presented by Jeff Crawford

MapReduce• A logical framework for distributed

computation• Genius is that you perform compute

processes as close as possible to the actual data– Minimize network costs

• Two versions currently: MRv1 and YARN (MRv2)

https://www.linkedin.com/in/crawdoctor

Wha

t Had

oop

look

s lik

e

Page 16: The Nuts and Bolts of Hadoop and it's Ever-changing Ecosystem, Presented by Jeff Crawford

MapReduce• At a high level, the process involves…

1. Prepare the Mapper environment – identify initial key/value pair to address within the dataset and distribute Mapper to appropriate nodes

2. Run Mapper code on data to produce key/value pairs3. Organize (e.g., “shuffle”) the Mapper output and send

to identified Reducers for further processing4. Run Reducer code on data to produce key/value pairs5. Collect all the Reducer output (sorted by final key)

• Canonical example when getting started with Hadoop is writing a MapReduce job that will count the number of words in a given corpus (e.g., set of files)

https://www.linkedin.com/in/crawdoctor

Wha

t Had

oop

look

s lik

e

Page 17: The Nuts and Bolts of Hadoop and it's Ever-changing Ecosystem, Presented by Jeff Crawford

MapReduce

https://www.linkedin.com/in/crawdoctor

Wha

t Had

oop

look

s lik

e

Figure from http://www.alex-hanna.com/tworkshops/lesson-5-hadoop-and-mapreduce/

Page 18: The Nuts and Bolts of Hadoop and it's Ever-changing Ecosystem, Presented by Jeff Crawford

WordCount Example – MapReduce via Python

https://www.linkedin.com/in/crawdoctor

Page 19: The Nuts and Bolts of Hadoop and it's Ever-changing Ecosystem, Presented by Jeff Crawford

WordCount Example – MapReduce via Python

https://www.linkedin.com/in/crawdoctor

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -mapper ./mapper.py -reducer ./reducer.py -input books/* -output WordCount/v1 -file ./mapper.py -file ./reducer.py

Page 20: The Nuts and Bolts of Hadoop and it's Ever-changing Ecosystem, Presented by Jeff Crawford

https://www.linkedin.com/in/crawdoctor

Image pulled from http://techblog.baghel.com/index.php?itemid=132, details at http://hadoop.apache.org/

Wha

t Had

oop

is

Page 21: The Nuts and Bolts of Hadoop and it's Ever-changing Ecosystem, Presented by Jeff Crawford

Hadoop Ecosystem• Things often heard when people are

introduced to Hadoop…– Java? Seriously? Do I look like a sadist?

• Python, Perl, Ruby, PHP, etc. via Hadoop Streaming– “Any language able to read from stdin, write to sdtout

and parse tab and new line characters will work”

• Apache Pig – scripting that simplifies the MapReduce process• Apache Hive – SQL-ish code that allows you to

generate MapReduce programs

https://www.linkedin.com/in/crawdoctor

Wha

t Had

oop

look

s lik

e

Page 22: The Nuts and Bolts of Hadoop and it's Ever-changing Ecosystem, Presented by Jeff Crawford

Hadoop Ecosystem• Things often heard when people are

introduced to Hadoop…– All work and no play makes Jack a dull boy

• Apache Spark - provides in-memory processing capabilities

• Apache HBase – provides random, real-time read/write access to large datasets– Hadoop’s NoSQL column-oriented data store

• Cloudera’s Impala – Hive-like but provides a better data warehouse-ish experiences

– What if I wanted to MapReduce my MapReduce job?• Apache Oozie – provides mechanism for scheduling

MapReduce jobs

https://www.linkedin.com/in/crawdoctor

Wha

t Had

oop

look

s lik

e

Page 23: The Nuts and Bolts of Hadoop and it's Ever-changing Ecosystem, Presented by Jeff Crawford

Hadoop Ecosystem• Things often heard when people are

introduced to Hadoop…– And the data will come from where?• Apache Flume – allows pulling log type files from

external sources• Apache Sqoop – allows back and forth transfer of

data between Hadoop and most RDBMS’s

– What about data mining?• Mahout – provides a machine learning library for

Hadoop • R Connectors – allows you to utilize R as a front-

end for working with Hadoop

https://www.linkedin.com/in/crawdoctor

Wha

t Had

oop

look

s lik

e

Page 24: The Nuts and Bolts of Hadoop and it's Ever-changing Ecosystem, Presented by Jeff Crawford

Revisiting WordCount with Pig

https://www.linkedin.com/in/crawdoctor

A = load 'books/*';B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;C = group B by word;D = foreach C generate group as word, COUNT(B) as word_count;E = order D by word_count desc;store E into 'pig/wordcount_v1';

Page 25: The Nuts and Bolts of Hadoop and it's Ever-changing Ecosystem, Presented by Jeff Crawford

WordCount Example - Pig

https://www.linkedin.com/in/crawdoctor

Page 26: The Nuts and Bolts of Hadoop and it's Ever-changing Ecosystem, Presented by Jeff Crawford

Utilizing Hive

https://www.linkedin.com/in/crawdoctor

Consider a dataset that contains airline departure / arrival data for major US airports. We would like to generate some simple descriptive statistics for the large dataset. Simple in SQL, not so much in MapReduce…

select carrier, count(carrier) as carrier_count, sum(if(departuredelay IS NULL,1,0)) as dep_delay_null,max(departuredelay) as dep_delay_max, min(departuredelay) as dep_delay_min, avg(departuredelay) as dep_delay_avg,stddev(departuredelay) as dep_delay_stddev from flightdata group by carrier order by carrier;

Page 27: The Nuts and Bolts of Hadoop and it's Ever-changing Ecosystem, Presented by Jeff Crawford

Hadoop Things to Know• The environment is fragile– LOTS of moving, changing parts

• Everything is configurable• Environment is not intuitive to most IT professionals• All roads start and end with Java• MapReduce jobs can get complex… keep them

simple and chain when necessary• All nodes must have all tools available that are

referenced in the MapReduce code• Tasks run in their own Java Virtual Machine– Can cause unnecessary overhead when there are many

tasksWhy

did

n’t a

nyon

e te

ll m

e?https://www.linkedin.com/in/crawdoctor

Page 28: The Nuts and Bolts of Hadoop and it's Ever-changing Ecosystem, Presented by Jeff Crawford

• Get Hadoop: The Definitive Guide• Learn Java or Python basics– Those with OO experience, Java might be best– All others, give Python a shot– Find a good IDE…

• Use Cloudera’s Quickstart VM– Requires VirtualBox or VMWare and sufficient

resources on your computer• Find some peers to help you navigate

through the questions that will ariseGetti

ng st

arte

d: P

art 1

https://www.linkedin.com/in/crawdoctor

Page 29: The Nuts and Bolts of Hadoop and it's Ever-changing Ecosystem, Presented by Jeff Crawford

If you want to run Hadoop on your own hardware…• Install Cloudera in Pseudo-distributed mode– Utilize virtualization if you don’t have a

dedicated machine• Install CDH 5 in a cluster configuration– At least 3 machines required

If you don’t want to use your own hardware• Sign up for a free Amazon AWS account and

follow docs RE Elastic MapReduce (EMR)• http://aws.amazon.com/elasticmapreduce/

Getti

ng st

arte

d: P

art 2

https://www.linkedin.com/in/crawdoctor

Page 30: The Nuts and Bolts of Hadoop and it's Ever-changing Ecosystem, Presented by Jeff Crawford

Dat

a Sc

ienc

e is

a T

eam

Spo

rthttps://www.linkedin.com/in/crawdoctor

Hacker

Scientist

Trusted advisor

Quantitative analyst

Business expert

Technologist

Project manager

Thoughts from Chapter 4 of Davenport (2014) & a bit of Crawford