View
214
Download
0
Tags:
Embed Size (px)
Citation preview
Hadoop Technical Workshop
Academic Hadoop Usage
Overview
• University of Washington Curriculum– Teaching Methods– Reflections– Student Background– Course Staff Requirements
UW: Course Summary
• Course title: “Problem Solving on Large Scale Clusters”
• Primary purpose: developing large-scale problem solving skills
• Format: 6 weeks of lectures + labs, 4 week project
UW: Course Goals
• Think creatively about large-scale problems in a parallel fashion; design parallel solutions
• Manage large data sets under memory, bandwidth limitations
• Develop a foundation in parallel algorithms for large-scale data
• Identify and understand engineering trade-offs in real systems
Lectures
• 2 hours, once per week• Half formal lecture, half discussion• Mostly covered systems &
background• Included group activities for
reinforcement
Classroom Activities
• Worksheets included pseudo-code programming, working through examples– Performed in groups of 2—3
• Small-group discussions about engineering and systems design– Groups of ~10– Course staff facilitated, but mostly open-
ended
Readings
• No textbook• One academic paper per week
– E.g., “Simplified Data Processing on Large Clusters”
– Short homework covered comprehension
• Formed basis for discussion
Lecture Schedule
• Introduction to Distributed Computing
• MapReduce: Theory and Implementation
• Networks and Distributed Reliability• Real-World Distributed Systems• Distributed File Systems• Other Distributed Systems
Intro to Distributed Computing
• What is distributed computing?• Flynn’s Taxonomy• Brief history of distributed computing• Some background on synchronization
and memory sharing
MapReduce
• Brief refresher on functional programming
• MapReduce slides – More detailed version of module I
• Discussion on MapReduce
Networking and Reliability
• Crash course in networking• Distributed systems reliability
– What is reliability?– How do distributed systems fail?– ACID, other metrics
• Discussion: Does MapReduce provide reliability?
Real Systems
• Design and implementation of Nutch• Tech talk from Googler on Google
Maps
Distributed File Systems
• Introduced GFS• Discussed implementation of NFS
and AndrewFS for comparison
Other Distributed Systems
• BOINC: Another platform• Broader definition of distributed
systems– DNS– One Laptop per Child project
Labs
• Also 2 hours, once per week• Focused on applications of
distributed systems• Four lab projects over six weeks
Lab Schedule
• Introduction to Hadoop, Eclipse Setup, Word Count
• Inverted Index• PageRank on Wikipedia• Clustering on Netflix Prize Data
Design Projects
• Final four weeks of quarter• Teams of 1—3 students• Students proposed topic, gathered
data, developed software, and presented solution
Example: Geozette
Image © Julia Schwartz
Example: Galaxy Simulation
Image © Slava Chernyak, Mike Hoak
Other Projects
• Bayesian Wikipedia spam filter• Unsupervised synonym extraction• Video collage rendering
Ongoing research: traceroutes
• Analyze time-stamped traceroute data to model changes in Internet router topology– 4.5 GB of data/day * 1.5 years– 12 billion traces from 200 PlanetLab
sites
• Calculates prevalence and persistence of routes between hosts
Ongoing research: dynamic program traces
• Dynamic memory trace data from simulators can reach hundreds of GB
• Existing work focuses on sampling• New capability: record all accesses
and post-process with Hadoop
Common Features
• Hadoop!• Used publicly-available web APIs for
data• Many involved reading papers for
algorithms and translating into MapReduce framework
Background Topics
• Programming Languages• Systems:
– Operating Systems – File Systems– Networking
• Databases
Programming Languages
• MapReduce is based on functional programming map and fold
• FP is taught in one quarter, but not reinforced– “Crash course” necessary– Worksheets to pose short problems in
terms of map and fold– Immutable data a key concept
Multithreaded programming
• Taught in OS course at Washington– Not a prerequisite!
• Students need to understand multiple copies of same method running in parallel
File Systems
• Necessary to understand GFS• Comparison to NFS, other distributed
file systems relevant
Networking
• TCP/IP• Concepts of “connection,” network
splits, other failure modes• Bandwidth issues
Other Systems Topics
• Process Scheduling• Synchronization• Memory coherency
Databases
• Concept of shared consistency model• Consensus• ACID characteristics
– Journaling– Multi-phase commit processes
Course Staff
• Instructor (me!)• Two undergrad teaching assistants
– Helped facilitate discussions, directed labs
• One student sys admin– Worked only about three hours/week
Preparation
• Teaching assistants had taken previous iteration of course in winter
• Lectures retooled based on feedback from that quarter– Added reasonably large amount of
background material
• Ran & solved all labs in advance
The Course: What Worked
• Discussions– Often covered broad range of subjects
• Hands-on lab projects• “Active learning” in classroom• Independent design projects
Things to Improve: Coverage
• Algorithms were not reinforced during lecture– Students requested much more time be
spent on “how to parallelize an iterative algorithm”
• Background material was very fast-paced
Things to Improve: Projects
• Labs could have used a moderated/scripted discussion component– Just “jumping in” to the code proved difficult– No time was devoted to Hadoop itself in
lecture– Clustering lab should be split in two
• Design projects could have used more time
Future Course Ideas Overview
• Systems course• Web application design• Integration in other applications
courses• Misc. content ideas• Making your own data sets
Systems Course
• Focused on parallel & distributed systems
• Hadoop included in comparison to other cluster techniques
• Emphasis on performance, profiling, and management
Topic Map
MapReduce
Distributed FS
n-phase commit
Communication networksand topologies
IPC mechanisms
sockets
Definitions of failure and reliability
consensus
MPI
Condor
Introductory Material
• Networking basics• Multithreading
Distributed Reliability
• Reliability metrics• Methods of failure• Techniques to combat failure
– Journaling, n-phase commit
• Techniques to achieve consensus– Leader election, voting
Parallel Processing
• How to parallelize algorithms• Parallelization in one machine vs.
across several machines– Techniques applicable to one vs. other– Cache coherency– Memory distribution
Parallelization Frameworks
• Multithreading on one machine• RPC, MPI, PVM• Higher-level scheduling
– Condor vs. Hadoop
• Tradeoffs in design
Algorithm Design Comparison
• Matrix multiplication, sorting, searching, PageRank, etc– … For a standard distributed system– … For Hadoop
Distributed Storage
• NFS, AFS, GFS• Database clustering techniques
– Distributed SQL databases– HBase– Distributed memory caches, object
stores
Lab Focus
• Implementing parallel and distributed algorithms
• Experiment with different frameworks
• Perform measurements– Bandwidth consumption– Latency & performance
• Code analysis
Final Thoughts
• Lots of low-level programming involved
• Appropriate mostly for last-year students
• Hadoop community would find scholarly benchmarks useful– wiki.apache.org/hadoop/
ProjectSuggestions– “JIRA” bug/feature request database
Web Application Design
Basic Web Development Topics
LAMP Stack / Ruby on Rails
3-tier apps, e-commerce requirements
XML
HTTP
HTML
Large-Scale Web Server Technology
Nutch/Lucene, Hadoop
Amazon Web Services
SOAP/XMLHttpRequest
Distributed System Basics
RPC Systems
Next Steps
• RPC – Internal RPC; message queues and
distributed back-ends– Thrift, ProtocolBuffers– SOAP and XMLHttpRequest
Scaling Really Big
• Nutch/Lucene• Hadoop• Amazon Web Services
Data Aggregation and Analysis
• How to crawl and parse web pages• Generate link graphs• Perform analyses (e.g., PageRank)• Semantic analysis
Web Site Tuning
• Web page layout optimization– Speed– Accessibility– Ease-of-use
• Server log analysis– User-targeted site features
• Service replication– Consistency, latency issues
Security and the Web
• Data sanitization• SQL injection attacks• DOS attacks• Data collection methods & ethics
– User data privacy
Projects
• Code labs in Python, PHP, Ruby• Simple database design• Building a small search engine with
Nutch/Lucene• Design scalable architecture and run
on Amazon EC2• Web site design project
– Security/penetration analysis of other teams’ sites
Final Course Thoughts
• Web-based services are increasingly relevant – Exciting new opportunity for students– Example course in action:
www.cs.washington.edu/education/courses/
cse454/07au/
Using Hadoop in Other Courses
• Hadoop is a natural component for many existing courses– Artificial intelligence– Web search– Data mining / information retrieval– Databases (HBase)– Networking – Computational biology? Graphics?
Low Level Module
• “MapReduce in a week:” code.google.com/edu/content/parallel.html
• 3-lecture series on distributed processing and Hadoop; enough to get students started
• … more discussion of online resources next
AI/Data Mining Ideas
• Use Nutch to perform a web crawl and classify pages using Bayesian analysis
• Hadoop makes processing easy– Data sanitization– Classifier engine (Use WEKA right in
Hadoop)– HDFS for document
storage/retrieval/search
AI/Data Mining Ideas
• Extract semantically valuable data from web pages– E.g., match names to phone numbers,– News articles to locations
• Hadoop allows students to explore a much broader scale than previously possible
Graphics Examples
• Re-encode a render pipeline as a set of MapReduce tasks
• Use feature detection + clustering on a corpus of images to find images with similar shapes/features
Student-Generated Ideas
• Data processing with Yahoo Pig• Distributed SQL databases• Distributed systems “ground-up”
projects:– Sockets, then RPC, then Hadoop
• Other concepts: Bittorrent, DHTs, P2P• Other frameworks: e.g., BOINC
projects
Making Datasets
• Your department is full of data!– Graphics data– Sensor data from RFID, Ubicomp,
robotics…– Measurements from networking lab– Ask around: Someone has a few dozen
gigs of log files to donate– (What happens if you leave Ethereal in
promiscuous mode for a week straight?)
Making Datasets
• Other departments are full of data!– Biology– Chemistry– Physics (campus particle accelerator?)
Making Datasets
• The web is full of data!– Use Nutch to crawl web sites– Wikis are especially good (hmm..)
Conclusions
• Hadoop isn’t a full course in itself– But it combines well with a lot of other
ideas
• Can be used for at least a half a course
• … Or as little as a week or two• Look around you – Hadoop can be
applied to more areas than you might think
Open Source Tools for Teaching
Overview
• Slides• Lab Materials• Readings• Video Lectures• Datasets
http://code.google.com/edu
Slides
• Multiple short course outlines available:
• “MapReduce in a week”
• “Introduction to Problem Solving on Large Scale Clusters”
• “MapReduce Mini Lecture Series”
Labs
• Lab designs from UW course available– “Introduction to MapReduce”– “A Simple Inverted Index”– “PageRank on the Wikipedia Corpus”– “Clustering the Netflix Movie Data”
Readings
• Google has several papers available– “Introduction to Distributed Systems”– “MapReduce: Simplified Data Processing on
Large Scale Clusters”– “The Google File System”– “BigTable: A Distributed Storage System
for Structured Data”
http://research.google.com/pubs/papers.html
Lecture Videos
MapReduce Mini-series
Datasets: Wikipedia
• Wikipedia supports free “bulk download” of data– Current site snapshot (big)– Entire revision history (massive)
• Eliminates need for Nutch crawls• Good for indexing, search labs
http://download.wikimedia.org
Datasets: Netflix
• Netflix’s web site provides recommendations
• Theory: Other people watched movie X, then Y. You watched X, you might like Y.
• Open question: Can you provide more useful recommendations than their current system?
Datasets: Netflix
• The Netflix Prize: $1,000,000 if you can find a better algorithm, based on their criteria
• They provide you with a large dataset of existing rental associations to work with
www.netflixprize.com
Conclusions
• Lots of starter materials available on the web– Good for reference– Get teaching assistants up to speed
• Readings, sample worksheets and other resources are open content & ready to use
Aaron Kimball