78
Hadoop Technical Workshop Academic Hadoop Usage

Hadoop Technical Workshop Academic Hadoop Usage. Overview University of Washington Curriculum –Teaching Methods –Reflections –Student Background –Course

  • View
    214

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Hadoop Technical Workshop Academic Hadoop Usage. Overview University of Washington Curriculum –Teaching Methods –Reflections –Student Background –Course

Hadoop Technical Workshop

Academic Hadoop Usage

Page 2: Hadoop Technical Workshop Academic Hadoop Usage. Overview University of Washington Curriculum –Teaching Methods –Reflections –Student Background –Course

Overview

• University of Washington Curriculum– Teaching Methods– Reflections– Student Background– Course Staff Requirements

Page 3: Hadoop Technical Workshop Academic Hadoop Usage. Overview University of Washington Curriculum –Teaching Methods –Reflections –Student Background –Course

UW: Course Summary

• Course title: “Problem Solving on Large Scale Clusters”

• Primary purpose: developing large-scale problem solving skills

• Format: 6 weeks of lectures + labs, 4 week project

Page 4: Hadoop Technical Workshop Academic Hadoop Usage. Overview University of Washington Curriculum –Teaching Methods –Reflections –Student Background –Course

UW: Course Goals

• Think creatively about large-scale problems in a parallel fashion; design parallel solutions

• Manage large data sets under memory, bandwidth limitations

• Develop a foundation in parallel algorithms for large-scale data

• Identify and understand engineering trade-offs in real systems

Page 5: Hadoop Technical Workshop Academic Hadoop Usage. Overview University of Washington Curriculum –Teaching Methods –Reflections –Student Background –Course

Lectures

• 2 hours, once per week• Half formal lecture, half discussion• Mostly covered systems &

background• Included group activities for

reinforcement

Page 6: Hadoop Technical Workshop Academic Hadoop Usage. Overview University of Washington Curriculum –Teaching Methods –Reflections –Student Background –Course

Classroom Activities

• Worksheets included pseudo-code programming, working through examples– Performed in groups of 2—3

• Small-group discussions about engineering and systems design– Groups of ~10– Course staff facilitated, but mostly open-

ended

Page 7: Hadoop Technical Workshop Academic Hadoop Usage. Overview University of Washington Curriculum –Teaching Methods –Reflections –Student Background –Course

Readings

• No textbook• One academic paper per week

– E.g., “Simplified Data Processing on Large Clusters”

– Short homework covered comprehension

• Formed basis for discussion

Page 8: Hadoop Technical Workshop Academic Hadoop Usage. Overview University of Washington Curriculum –Teaching Methods –Reflections –Student Background –Course

Lecture Schedule

• Introduction to Distributed Computing

• MapReduce: Theory and Implementation

• Networks and Distributed Reliability• Real-World Distributed Systems• Distributed File Systems• Other Distributed Systems

Page 9: Hadoop Technical Workshop Academic Hadoop Usage. Overview University of Washington Curriculum –Teaching Methods –Reflections –Student Background –Course

Intro to Distributed Computing

• What is distributed computing?• Flynn’s Taxonomy• Brief history of distributed computing• Some background on synchronization

and memory sharing

Page 10: Hadoop Technical Workshop Academic Hadoop Usage. Overview University of Washington Curriculum –Teaching Methods –Reflections –Student Background –Course

MapReduce

• Brief refresher on functional programming

• MapReduce slides – More detailed version of module I

• Discussion on MapReduce

Page 11: Hadoop Technical Workshop Academic Hadoop Usage. Overview University of Washington Curriculum –Teaching Methods –Reflections –Student Background –Course

Networking and Reliability

• Crash course in networking• Distributed systems reliability

– What is reliability?– How do distributed systems fail?– ACID, other metrics

• Discussion: Does MapReduce provide reliability?

Page 12: Hadoop Technical Workshop Academic Hadoop Usage. Overview University of Washington Curriculum –Teaching Methods –Reflections –Student Background –Course

Real Systems

• Design and implementation of Nutch• Tech talk from Googler on Google

Maps

Page 13: Hadoop Technical Workshop Academic Hadoop Usage. Overview University of Washington Curriculum –Teaching Methods –Reflections –Student Background –Course

Distributed File Systems

• Introduced GFS• Discussed implementation of NFS

and AndrewFS for comparison

Page 14: Hadoop Technical Workshop Academic Hadoop Usage. Overview University of Washington Curriculum –Teaching Methods –Reflections –Student Background –Course

Other Distributed Systems

• BOINC: Another platform• Broader definition of distributed

systems– DNS– One Laptop per Child project

Page 15: Hadoop Technical Workshop Academic Hadoop Usage. Overview University of Washington Curriculum –Teaching Methods –Reflections –Student Background –Course

Labs

• Also 2 hours, once per week• Focused on applications of

distributed systems• Four lab projects over six weeks

Page 16: Hadoop Technical Workshop Academic Hadoop Usage. Overview University of Washington Curriculum –Teaching Methods –Reflections –Student Background –Course

Lab Schedule

• Introduction to Hadoop, Eclipse Setup, Word Count

• Inverted Index• PageRank on Wikipedia• Clustering on Netflix Prize Data

Page 17: Hadoop Technical Workshop Academic Hadoop Usage. Overview University of Washington Curriculum –Teaching Methods –Reflections –Student Background –Course

Design Projects

• Final four weeks of quarter• Teams of 1—3 students• Students proposed topic, gathered

data, developed software, and presented solution

Page 18: Hadoop Technical Workshop Academic Hadoop Usage. Overview University of Washington Curriculum –Teaching Methods –Reflections –Student Background –Course

Example: Geozette

Image © Julia Schwartz

Page 19: Hadoop Technical Workshop Academic Hadoop Usage. Overview University of Washington Curriculum –Teaching Methods –Reflections –Student Background –Course

Example: Galaxy Simulation

Image © Slava Chernyak, Mike Hoak

Page 20: Hadoop Technical Workshop Academic Hadoop Usage. Overview University of Washington Curriculum –Teaching Methods –Reflections –Student Background –Course

Other Projects

• Bayesian Wikipedia spam filter• Unsupervised synonym extraction• Video collage rendering

Page 21: Hadoop Technical Workshop Academic Hadoop Usage. Overview University of Washington Curriculum –Teaching Methods –Reflections –Student Background –Course

Ongoing research: traceroutes

• Analyze time-stamped traceroute data to model changes in Internet router topology– 4.5 GB of data/day * 1.5 years– 12 billion traces from 200 PlanetLab

sites

• Calculates prevalence and persistence of routes between hosts

Page 22: Hadoop Technical Workshop Academic Hadoop Usage. Overview University of Washington Curriculum –Teaching Methods –Reflections –Student Background –Course

Ongoing research: dynamic program traces

• Dynamic memory trace data from simulators can reach hundreds of GB

• Existing work focuses on sampling• New capability: record all accesses

and post-process with Hadoop

Page 23: Hadoop Technical Workshop Academic Hadoop Usage. Overview University of Washington Curriculum –Teaching Methods –Reflections –Student Background –Course

Common Features

• Hadoop!• Used publicly-available web APIs for

data• Many involved reading papers for

algorithms and translating into MapReduce framework

Page 24: Hadoop Technical Workshop Academic Hadoop Usage. Overview University of Washington Curriculum –Teaching Methods –Reflections –Student Background –Course

Background Topics

• Programming Languages• Systems:

– Operating Systems – File Systems– Networking

• Databases

Page 25: Hadoop Technical Workshop Academic Hadoop Usage. Overview University of Washington Curriculum –Teaching Methods –Reflections –Student Background –Course

Programming Languages

• MapReduce is based on functional programming map and fold

• FP is taught in one quarter, but not reinforced– “Crash course” necessary– Worksheets to pose short problems in

terms of map and fold– Immutable data a key concept

Page 26: Hadoop Technical Workshop Academic Hadoop Usage. Overview University of Washington Curriculum –Teaching Methods –Reflections –Student Background –Course

Multithreaded programming

• Taught in OS course at Washington– Not a prerequisite!

• Students need to understand multiple copies of same method running in parallel

Page 27: Hadoop Technical Workshop Academic Hadoop Usage. Overview University of Washington Curriculum –Teaching Methods –Reflections –Student Background –Course

File Systems

• Necessary to understand GFS• Comparison to NFS, other distributed

file systems relevant

Page 28: Hadoop Technical Workshop Academic Hadoop Usage. Overview University of Washington Curriculum –Teaching Methods –Reflections –Student Background –Course

Networking

• TCP/IP• Concepts of “connection,” network

splits, other failure modes• Bandwidth issues

Page 29: Hadoop Technical Workshop Academic Hadoop Usage. Overview University of Washington Curriculum –Teaching Methods –Reflections –Student Background –Course

Other Systems Topics

• Process Scheduling• Synchronization• Memory coherency

Page 30: Hadoop Technical Workshop Academic Hadoop Usage. Overview University of Washington Curriculum –Teaching Methods –Reflections –Student Background –Course

Databases

• Concept of shared consistency model• Consensus• ACID characteristics

– Journaling– Multi-phase commit processes

Page 31: Hadoop Technical Workshop Academic Hadoop Usage. Overview University of Washington Curriculum –Teaching Methods –Reflections –Student Background –Course

Course Staff

• Instructor (me!)• Two undergrad teaching assistants

– Helped facilitate discussions, directed labs

• One student sys admin– Worked only about three hours/week

Page 32: Hadoop Technical Workshop Academic Hadoop Usage. Overview University of Washington Curriculum –Teaching Methods –Reflections –Student Background –Course

Preparation

• Teaching assistants had taken previous iteration of course in winter

• Lectures retooled based on feedback from that quarter– Added reasonably large amount of

background material

• Ran & solved all labs in advance

Page 33: Hadoop Technical Workshop Academic Hadoop Usage. Overview University of Washington Curriculum –Teaching Methods –Reflections –Student Background –Course

The Course: What Worked

• Discussions– Often covered broad range of subjects

• Hands-on lab projects• “Active learning” in classroom• Independent design projects

Page 34: Hadoop Technical Workshop Academic Hadoop Usage. Overview University of Washington Curriculum –Teaching Methods –Reflections –Student Background –Course

Things to Improve: Coverage

• Algorithms were not reinforced during lecture– Students requested much more time be

spent on “how to parallelize an iterative algorithm”

• Background material was very fast-paced

Page 35: Hadoop Technical Workshop Academic Hadoop Usage. Overview University of Washington Curriculum –Teaching Methods –Reflections –Student Background –Course

Things to Improve: Projects

• Labs could have used a moderated/scripted discussion component– Just “jumping in” to the code proved difficult– No time was devoted to Hadoop itself in

lecture– Clustering lab should be split in two

• Design projects could have used more time

Page 36: Hadoop Technical Workshop Academic Hadoop Usage. Overview University of Washington Curriculum –Teaching Methods –Reflections –Student Background –Course

Future Course Ideas Overview

• Systems course• Web application design• Integration in other applications

courses• Misc. content ideas• Making your own data sets

Page 37: Hadoop Technical Workshop Academic Hadoop Usage. Overview University of Washington Curriculum –Teaching Methods –Reflections –Student Background –Course

Systems Course

• Focused on parallel & distributed systems

• Hadoop included in comparison to other cluster techniques

• Emphasis on performance, profiling, and management

Page 38: Hadoop Technical Workshop Academic Hadoop Usage. Overview University of Washington Curriculum –Teaching Methods –Reflections –Student Background –Course

Topic Map

MapReduce

Distributed FS

n-phase commit

Communication networksand topologies

IPC mechanisms

sockets

Definitions of failure and reliability

consensus

MPI

Condor

Page 39: Hadoop Technical Workshop Academic Hadoop Usage. Overview University of Washington Curriculum –Teaching Methods –Reflections –Student Background –Course

Introductory Material

• Networking basics• Multithreading

Page 40: Hadoop Technical Workshop Academic Hadoop Usage. Overview University of Washington Curriculum –Teaching Methods –Reflections –Student Background –Course

Distributed Reliability

• Reliability metrics• Methods of failure• Techniques to combat failure

– Journaling, n-phase commit

• Techniques to achieve consensus– Leader election, voting

Page 41: Hadoop Technical Workshop Academic Hadoop Usage. Overview University of Washington Curriculum –Teaching Methods –Reflections –Student Background –Course

Parallel Processing

• How to parallelize algorithms• Parallelization in one machine vs.

across several machines– Techniques applicable to one vs. other– Cache coherency– Memory distribution

Page 42: Hadoop Technical Workshop Academic Hadoop Usage. Overview University of Washington Curriculum –Teaching Methods –Reflections –Student Background –Course

Parallelization Frameworks

• Multithreading on one machine• RPC, MPI, PVM• Higher-level scheduling

– Condor vs. Hadoop

• Tradeoffs in design

Page 43: Hadoop Technical Workshop Academic Hadoop Usage. Overview University of Washington Curriculum –Teaching Methods –Reflections –Student Background –Course

Algorithm Design Comparison

• Matrix multiplication, sorting, searching, PageRank, etc– … For a standard distributed system– … For Hadoop

Page 44: Hadoop Technical Workshop Academic Hadoop Usage. Overview University of Washington Curriculum –Teaching Methods –Reflections –Student Background –Course

Distributed Storage

• NFS, AFS, GFS• Database clustering techniques

– Distributed SQL databases– HBase– Distributed memory caches, object

stores

Page 45: Hadoop Technical Workshop Academic Hadoop Usage. Overview University of Washington Curriculum –Teaching Methods –Reflections –Student Background –Course

Lab Focus

• Implementing parallel and distributed algorithms

• Experiment with different frameworks

• Perform measurements– Bandwidth consumption– Latency & performance

• Code analysis

Page 46: Hadoop Technical Workshop Academic Hadoop Usage. Overview University of Washington Curriculum –Teaching Methods –Reflections –Student Background –Course

Final Thoughts

• Lots of low-level programming involved

• Appropriate mostly for last-year students

• Hadoop community would find scholarly benchmarks useful– wiki.apache.org/hadoop/

ProjectSuggestions– “JIRA” bug/feature request database

Page 47: Hadoop Technical Workshop Academic Hadoop Usage. Overview University of Washington Curriculum –Teaching Methods –Reflections –Student Background –Course

Web Application Design

Page 48: Hadoop Technical Workshop Academic Hadoop Usage. Overview University of Washington Curriculum –Teaching Methods –Reflections –Student Background –Course

Basic Web Development Topics

LAMP Stack / Ruby on Rails

3-tier apps, e-commerce requirements

XML

HTTP

HTML

Page 49: Hadoop Technical Workshop Academic Hadoop Usage. Overview University of Washington Curriculum –Teaching Methods –Reflections –Student Background –Course

Large-Scale Web Server Technology

Nutch/Lucene, Hadoop

Amazon Web Services

SOAP/XMLHttpRequest

Distributed System Basics

RPC Systems

Page 50: Hadoop Technical Workshop Academic Hadoop Usage. Overview University of Washington Curriculum –Teaching Methods –Reflections –Student Background –Course

Next Steps

• RPC – Internal RPC; message queues and

distributed back-ends– Thrift, ProtocolBuffers– SOAP and XMLHttpRequest

Page 51: Hadoop Technical Workshop Academic Hadoop Usage. Overview University of Washington Curriculum –Teaching Methods –Reflections –Student Background –Course

Scaling Really Big

• Nutch/Lucene• Hadoop• Amazon Web Services

Page 52: Hadoop Technical Workshop Academic Hadoop Usage. Overview University of Washington Curriculum –Teaching Methods –Reflections –Student Background –Course

Data Aggregation and Analysis

• How to crawl and parse web pages• Generate link graphs• Perform analyses (e.g., PageRank)• Semantic analysis

Page 53: Hadoop Technical Workshop Academic Hadoop Usage. Overview University of Washington Curriculum –Teaching Methods –Reflections –Student Background –Course

Web Site Tuning

• Web page layout optimization– Speed– Accessibility– Ease-of-use

• Server log analysis– User-targeted site features

• Service replication– Consistency, latency issues

Page 54: Hadoop Technical Workshop Academic Hadoop Usage. Overview University of Washington Curriculum –Teaching Methods –Reflections –Student Background –Course

Security and the Web

• Data sanitization• SQL injection attacks• DOS attacks• Data collection methods & ethics

– User data privacy

Page 55: Hadoop Technical Workshop Academic Hadoop Usage. Overview University of Washington Curriculum –Teaching Methods –Reflections –Student Background –Course

Projects

• Code labs in Python, PHP, Ruby• Simple database design• Building a small search engine with

Nutch/Lucene• Design scalable architecture and run

on Amazon EC2• Web site design project

– Security/penetration analysis of other teams’ sites

Page 56: Hadoop Technical Workshop Academic Hadoop Usage. Overview University of Washington Curriculum –Teaching Methods –Reflections –Student Background –Course

Final Course Thoughts

• Web-based services are increasingly relevant – Exciting new opportunity for students– Example course in action:

www.cs.washington.edu/education/courses/

cse454/07au/

Page 57: Hadoop Technical Workshop Academic Hadoop Usage. Overview University of Washington Curriculum –Teaching Methods –Reflections –Student Background –Course

Using Hadoop in Other Courses

• Hadoop is a natural component for many existing courses– Artificial intelligence– Web search– Data mining / information retrieval– Databases (HBase)– Networking – Computational biology? Graphics?

Page 58: Hadoop Technical Workshop Academic Hadoop Usage. Overview University of Washington Curriculum –Teaching Methods –Reflections –Student Background –Course

Low Level Module

• “MapReduce in a week:” code.google.com/edu/content/parallel.html

• 3-lecture series on distributed processing and Hadoop; enough to get students started

• … more discussion of online resources next

Page 59: Hadoop Technical Workshop Academic Hadoop Usage. Overview University of Washington Curriculum –Teaching Methods –Reflections –Student Background –Course

AI/Data Mining Ideas

• Use Nutch to perform a web crawl and classify pages using Bayesian analysis

• Hadoop makes processing easy– Data sanitization– Classifier engine (Use WEKA right in

Hadoop)– HDFS for document

storage/retrieval/search

Page 60: Hadoop Technical Workshop Academic Hadoop Usage. Overview University of Washington Curriculum –Teaching Methods –Reflections –Student Background –Course

AI/Data Mining Ideas

• Extract semantically valuable data from web pages– E.g., match names to phone numbers,– News articles to locations

• Hadoop allows students to explore a much broader scale than previously possible

Page 61: Hadoop Technical Workshop Academic Hadoop Usage. Overview University of Washington Curriculum –Teaching Methods –Reflections –Student Background –Course

Graphics Examples

• Re-encode a render pipeline as a set of MapReduce tasks

• Use feature detection + clustering on a corpus of images to find images with similar shapes/features

Page 62: Hadoop Technical Workshop Academic Hadoop Usage. Overview University of Washington Curriculum –Teaching Methods –Reflections –Student Background –Course

Student-Generated Ideas

• Data processing with Yahoo Pig• Distributed SQL databases• Distributed systems “ground-up”

projects:– Sockets, then RPC, then Hadoop

• Other concepts: Bittorrent, DHTs, P2P• Other frameworks: e.g., BOINC

projects

Page 63: Hadoop Technical Workshop Academic Hadoop Usage. Overview University of Washington Curriculum –Teaching Methods –Reflections –Student Background –Course

Making Datasets

• Your department is full of data!– Graphics data– Sensor data from RFID, Ubicomp,

robotics…– Measurements from networking lab– Ask around: Someone has a few dozen

gigs of log files to donate– (What happens if you leave Ethereal in

promiscuous mode for a week straight?)

Page 64: Hadoop Technical Workshop Academic Hadoop Usage. Overview University of Washington Curriculum –Teaching Methods –Reflections –Student Background –Course

Making Datasets

• Other departments are full of data!– Biology– Chemistry– Physics (campus particle accelerator?)

Page 65: Hadoop Technical Workshop Academic Hadoop Usage. Overview University of Washington Curriculum –Teaching Methods –Reflections –Student Background –Course

Making Datasets

• The web is full of data!– Use Nutch to crawl web sites– Wikis are especially good (hmm..)

Page 66: Hadoop Technical Workshop Academic Hadoop Usage. Overview University of Washington Curriculum –Teaching Methods –Reflections –Student Background –Course

Conclusions

• Hadoop isn’t a full course in itself– But it combines well with a lot of other

ideas

• Can be used for at least a half a course

• … Or as little as a week or two• Look around you – Hadoop can be

applied to more areas than you might think

Page 67: Hadoop Technical Workshop Academic Hadoop Usage. Overview University of Washington Curriculum –Teaching Methods –Reflections –Student Background –Course

Open Source Tools for Teaching

Page 68: Hadoop Technical Workshop Academic Hadoop Usage. Overview University of Washington Curriculum –Teaching Methods –Reflections –Student Background –Course

Overview

• Slides• Lab Materials• Readings• Video Lectures• Datasets

http://code.google.com/edu

Page 69: Hadoop Technical Workshop Academic Hadoop Usage. Overview University of Washington Curriculum –Teaching Methods –Reflections –Student Background –Course

Slides

• Multiple short course outlines available:

• “MapReduce in a week”

• “Introduction to Problem Solving on Large Scale Clusters”

• “MapReduce Mini Lecture Series”

Page 70: Hadoop Technical Workshop Academic Hadoop Usage. Overview University of Washington Curriculum –Teaching Methods –Reflections –Student Background –Course

Labs

• Lab designs from UW course available– “Introduction to MapReduce”– “A Simple Inverted Index”– “PageRank on the Wikipedia Corpus”– “Clustering the Netflix Movie Data”

Page 71: Hadoop Technical Workshop Academic Hadoop Usage. Overview University of Washington Curriculum –Teaching Methods –Reflections –Student Background –Course

Readings

• Google has several papers available– “Introduction to Distributed Systems”– “MapReduce: Simplified Data Processing on

Large Scale Clusters”– “The Google File System”– “BigTable: A Distributed Storage System

for Structured Data”

http://research.google.com/pubs/papers.html

Page 72: Hadoop Technical Workshop Academic Hadoop Usage. Overview University of Washington Curriculum –Teaching Methods –Reflections –Student Background –Course

Lecture Videos

MapReduce Mini-series

Page 73: Hadoop Technical Workshop Academic Hadoop Usage. Overview University of Washington Curriculum –Teaching Methods –Reflections –Student Background –Course

Datasets: Wikipedia

• Wikipedia supports free “bulk download” of data– Current site snapshot (big)– Entire revision history (massive)

• Eliminates need for Nutch crawls• Good for indexing, search labs

http://download.wikimedia.org

Page 74: Hadoop Technical Workshop Academic Hadoop Usage. Overview University of Washington Curriculum –Teaching Methods –Reflections –Student Background –Course

Datasets: Netflix

• Netflix’s web site provides recommendations

• Theory: Other people watched movie X, then Y. You watched X, you might like Y.

• Open question: Can you provide more useful recommendations than their current system?

Page 75: Hadoop Technical Workshop Academic Hadoop Usage. Overview University of Washington Curriculum –Teaching Methods –Reflections –Student Background –Course

Datasets: Netflix

• The Netflix Prize: $1,000,000 if you can find a better algorithm, based on their criteria

• They provide you with a large dataset of existing rental associations to work with

www.netflixprize.com

Page 76: Hadoop Technical Workshop Academic Hadoop Usage. Overview University of Washington Curriculum –Teaching Methods –Reflections –Student Background –Course

Conclusions

• Lots of starter materials available on the web– Good for reference– Get teaching assistants up to speed

• Readings, sample worksheets and other resources are open content & ready to use

Page 77: Hadoop Technical Workshop Academic Hadoop Usage. Overview University of Washington Curriculum –Teaching Methods –Reflections –Student Background –Course

Aaron Kimball

[email protected]

Page 78: Hadoop Technical Workshop Academic Hadoop Usage. Overview University of Washington Curriculum –Teaching Methods –Reflections –Student Background –Course