Xiaoxiao Had Oop

8/12/2019 Xiaoxiao Had Oop

1/31

Hadoop and its Real-world

Applications

Xiaoxiao Shi, Guan Wang

Experience: work at Yahoo! in 2010 summer,on developing hadoop-based machine learning models.


2/31

Contents

Motivation of Hadoop

History of Hadoop

The current applications of Hadoop Programming examples

Research with Hadoop

Conclusions


3/31


How do you scale up applications? Run jobs processing 100s of terabytes of data

Takes 11 days to read on 1 computer

Need lots of cheap computers Fixes speed problem (15 minutes on 1000 computers),but

Reliability problems In large clusters, computers fail every day

Cluster size is not fixed

Need common infrastructure Must be efficient and reliable


4/31


Open Source Apache Project

Hadoop Core includes:

Distributed File System - distributes data

Map/Reduce - distributes application

Written in Java

Runs on

Linux, Mac OS/X, Windows, and Solaris Commodity hardware


5/31

Fun Fact of Hadoop

"The name my kid gave a stuffed yellowelephant. Short, relatively easy to spelland pronounce, meaningless, and not used

elsewhere: those are my namingcriteria. Kids are good at generating such.Googol is a kids term."

---- Doug Cutting, Hadoop projectcreator
http://hadoop.apache.org/


6/31

History of Hadoop

Apache Nutch

Doug Cutting

Map-reduce

2004

It is an important technique!

Extended

The great journey begins


7/31

History of Hadoop

Yahoo! became the primary contributor in2006


8/31

History of Hadoop

Yahoo! deployed large scale science clusters in2007.

Tons of Yahoo! Research papers emerge: WWW CIKM SIGIR VLDB

Yahoo! began running major production jobsin Q1 2008.

Nowadays


9/31

Nowadays

When you visityahoo, you areinteractingwith dataprocessed withHadoop!


10/31

Nowadays

Ads

Optimization

Content

OptimizationSearch Index

Content Feed

Processing



11/31

Nowadays

Ads

Optimization

Content

OptimizationSearch Index

Content Feed

Processing

Machine

Learning(e.g. Spam filters)



12/31

Nowadays

Yahoo! has ~20,000 machines running Hadoop The largest clusters are currently 2000 nodes

Several petabytes of user data (compressed, unreplicated)

Yahoo! runs hundreds of thousands of jobs every month


13/31

Nowadays

Who use Hadoop?

Amazon/A9

AOL

Facebook

Fox interactive media

Google

IBM

New York Times

PowerSet (now Microsoft)

Quantcast Rackspace/Mailtrust

Veoh

Yahoo!

More at http://wiki.apache.org/hadoop/PoweredBy


14/31

Nowadays (job market on Nov 15th)

Software Developer Intern - IBM - Somers, NY +3 locations- Agile development - Big data / Hadoop /

data analytics a plus

Software Developer - IBM - San Jose, CA +4 locations - include Hadoop-powered distributed parallel data

processing system, big data analytics ... multiple technologies, including Hadoop


15/31

It is important

Details


16/31

Nowadays

Hadoop Core Distributed File System

MapReduce Framework

Pig (initiated by Yahoo!) Parallel Programming Language and Runtime

Hbase (initiated by Powerset) Table storage for semi-structured data

Zookeeper (initiated by Yahoo!) Coordinating distributed systems

Hive (initiated by Facebook) SQL-like query language and metastore


17/31


18/31

HDFS

Hadoop's Distributed File System is designed to reliably storevery large files across machines in a large cluster. It isinspired by the Google File System. Hadoop DFS stores eachfile as a sequence of blocks, all blocks in a file except the lastblock are the same size. Blocks belonging to a file are

replicated for fault tolerance. The block size and replicationfactor are configurable per file. Files in HDFS are "write once"and have strictly one writer at any time.

Hadoop Distributed File System Goals: Store large data sets Cope with hardware failure Emphasize streaming data access


19/31

Typical Hadoop Structure

Commodity hardware Linux PCs with local 4 disks

Typically in 2 level architecture 40 nodes/rack Uplink from rack is 8 gigabit

Rack-internal is 1 gigabit all-to-all


20/31


21/31

Hadoop structure

Single namespace for entire cluster Managed by a single namenode. Files are single-writer and append-only. Optimized for streaming reads of large files.

Files are broken in to large blocks. Typically 128 MB Replicated to several datanodes, for reliability

Client talks to both namenode and datanodes

Data is not sent through the namenode. Throughput of file system scales nearly linearly withthe number of nodes.

Access from Java, C, or command line.


22/31


23/31

Hadoop Structure

Java and C++ APIs In Java use Objects, while in C++ bytes

Each task can process data sets larger than RAM Automatic re-execution on failure

In a large cluster, some nodes are always slow orflaky

Framework re-executes failed tasks

Locality optimizations

Map-Reduce queries HDFS for locations of input data

Map tasks are scheduled close to the inputs whenpossible


24/31

Example of Hadoop Programming

Word Count:

I ike parallel computing. I also took courses

on parallel computing

Parallel: 2

Computing: 2

I: 2

Like: 1


25/31

Example of Hadoop Programming

Intuition: design

Assume each node will process a paragraph

Map: What is the key?

What is the value?

Reduce: What to collect?

What to reduce?


26/31

Word Count ExamplepublicclassMapClass extendsMapReduceBase

implementsMapper {

privatefinalstaticIntWritable ONE= newIntWritable(1);

publicvoidmap(LongWritable key, Text value,OutputCollector out,

Reporter reporter) throwsIOException {String line = value.toString();StringTokenizer itr = newStringTokenizer(line);while(itr.hasMoreTokens()) {

out.collect(newtext(itr.nextToken()), ONE);}

}}


27/31

Word Count ExamplepublicclassReduceClass extendsMapReduceBase

implementsReducer {

publicvoidreduce(Text key, Iterator values,OutputCollector out,Reporter reporter) throwsIOException {

intsum = 0;

while(values.hasNext()) {sum += values.next().get();

}out.collect(key, newIntWritable(sum));

}}


28/31

Word Count Example

publicstaticvoidmain(String[] args) throwsException {

JobConf conf = newJobConf(WordCount.class);

conf.setJobName("wordcount");

conf.setMapperClass(MapClass.class);

conf.setCombinerClass(ReduceClass.class);conf.setReducerClass(ReduceClass.class);

FileInputFormat.setInputPaths(conf, args[0]);

FileOutputFormat.setOutputPath(conf, newPath(args[1]));

conf.setOutputKeyClass(Text.class);// out keys are words (strings)

conf.setOutputValueClass(IntWritable.class);// values are counts

JobClient.runJob(conf);

}c


29/31

Hadoop in Yahoo!

29

Before Hadoop After Hadoop

Time 26 days 20 minutes

Language C++ Python

Development Time 2-3 weeks 2-3 days

Database for Search Assistis built using Hadoop.

3 years of log-data

20-steps of map-reduce


30/31

Related research of hadoop

Conference Tutorial: KDD Tutorial: Modeling with Hadoop, KDD 2011 (top conference in data mining)

Strta Tutorial: How to Develop Big Data Applications for Hadoop

OSCON Tutorial: Introduction to Hadoop,

Papers: Scalable distributed inference of dynamic user interests for behavioral targeting. KDD 2011: 114-122

Yucheng Low, Deepak Agarwal, Alexander J. Smola: Multiple domain user personalization. KDD 2011: 123-131

Shuang-Hong Yang, Bo Long, Alexander J. Smola, Hongyuan Zha, Zhaohui Zheng: Collaborative competitive filtering: learning

recommender using context of user choice. SIGIR 2011: 295-304

Srinivas Vadrevu, Choon Hui Teo, Suju Rajan, Kunal Punera, Byron Dom, Alexander J. Smola, Yi Chang, Zhaohui Zheng: Scalable

clustering of news search results. WSDM 2011: 675-684

Shuang-Hong Yang, Bo Long, Alexander J. Smola, Narayanan Sadagopan, Zhaohui Zheng, Hongyuan Zha: Like like alike: joint

friendship and interest propagation in social networks. WWW 2011: 537-546

Amr Ahmed, Alexander J. Smola: WWW 2011 invited tutorial overview: latent variable models on the internet. WWW

(Companion Volume) 2011: 281-282

Daniel Hsu, Nikos Karampatziakis, John Langford, Alexander J. Smola: Parallel Online Learning CoRR abs/1103.4204: (2011)

Neethu Mohandas, Sabu M. Thampi: Improving Hadoop Performance in Handling Small Files. ACC 2011:187-194

Tomasz Wiktor Wlodarczyk, Yi Han, Chunming Rong: Performance Analysis of Hadoop for Query Processing. AINA Workshops

2011:507-513

All just this year! 2011!


31/31

For more information:

http://hadoop.apache.org/

http://developer.yahoo.com/hadoop/

Who uses Hadoop?:

http://wiki.apache.org/hadoop/PoweredBy
http://hadoop.apache.org/http://developer.yahoo.com/hadoop/http://wiki.apache.org/hadoop/PoweredByhttp://wiki.apache.org/hadoop/PoweredByhttp://wiki.apache.org/hadoop/PoweredByhttp://wiki.apache.org/hadoop/PoweredByhttp://developer.yahoo.com/hadoop/http://hadoop.apache.org/

Documents

Xiaoxiao Had Oop