Xiaoxiao Had Oop

Embed Size (px)

Citation preview

  • 8/12/2019 Xiaoxiao Had Oop

    1/31

    Hadoop and its Real-world

    Applications

    Xiaoxiao Shi, Guan Wang

    Experience: work at Yahoo! in 2010 summer,on developing hadoop-based machine learning models.

  • 8/12/2019 Xiaoxiao Had Oop

    2/31

    Contents

    Motivation of Hadoop

    History of Hadoop

    The current applications of Hadoop Programming examples

    Research with Hadoop

    Conclusions

  • 8/12/2019 Xiaoxiao Had Oop

    3/31

    Motivation of Hadoop

    How do you scale up applications? Run jobs processing 100s of terabytes of data

    Takes 11 days to read on 1 computer

    Need lots of cheap computers Fixes speed problem (15 minutes on 1000 computers),but

    Reliability problems In large clusters, computers fail every day

    Cluster size is not fixed

    Need common infrastructure Must be efficient and reliable

  • 8/12/2019 Xiaoxiao Had Oop

    4/31

    Motivation of Hadoop

    Open Source Apache Project

    Hadoop Core includes:

    Distributed File System - distributes data

    Map/Reduce - distributes application

    Written in Java

    Runs on

    Linux, Mac OS/X, Windows, and Solaris Commodity hardware

  • 8/12/2019 Xiaoxiao Had Oop

    5/31

    Fun Fact of Hadoop

    "The name my kid gave a stuffed yellowelephant. Short, relatively easy to spelland pronounce, meaningless, and not used

    elsewhere: those are my namingcriteria. Kids are good at generating such.Googol is a kids term."

    ---- Doug Cutting, Hadoop projectcreator

    http://hadoop.apache.org/
  • 8/12/2019 Xiaoxiao Had Oop

    6/31

    History of Hadoop

    Apache Nutch

    Doug Cutting

    Map-reduce

    2004

    It is an important technique!

    Extended

    The great journey begins

  • 8/12/2019 Xiaoxiao Had Oop

    7/31

    History of Hadoop

    Yahoo! became the primary contributor in2006

  • 8/12/2019 Xiaoxiao Had Oop

    8/31

    History of Hadoop

    Yahoo! deployed large scale science clusters in2007.

    Tons of Yahoo! Research papers emerge: WWW CIKM SIGIR VLDB

    Yahoo! began running major production jobsin Q1 2008.

    Nowadays

  • 8/12/2019 Xiaoxiao Had Oop

    9/31

    Nowadays

    When you visityahoo, you areinteractingwith dataprocessed withHadoop!

  • 8/12/2019 Xiaoxiao Had Oop

    10/31

    Nowadays

    Ads

    Optimization

    Content

    OptimizationSearch Index

    Content Feed

    Processing

    When you visityahoo, you areinteractingwith dataprocessed withHadoop!

  • 8/12/2019 Xiaoxiao Had Oop

    11/31

    Nowadays

    Ads

    Optimization

    Content

    OptimizationSearch Index

    Content Feed

    Processing

    Machine

    Learning(e.g. Spam filters)

    When you visityahoo, you areinteractingwith dataprocessed withHadoop!

  • 8/12/2019 Xiaoxiao Had Oop

    12/31

    Nowadays

    Yahoo! has ~20,000 machines running Hadoop The largest clusters are currently 2000 nodes

    Several petabytes of user data (compressed, unreplicated)

    Yahoo! runs hundreds of thousands of jobs every month

  • 8/12/2019 Xiaoxiao Had Oop

    13/31

    Nowadays

    Who use Hadoop?

    Amazon/A9

    AOL

    Facebook

    Fox interactive media

    Google

    IBM

    New York Times

    PowerSet (now Microsoft)

    Quantcast Rackspace/Mailtrust

    Veoh

    Yahoo!

    More at http://wiki.apache.org/hadoop/PoweredBy

  • 8/12/2019 Xiaoxiao Had Oop

    14/31

    Nowadays (job market on Nov 15th)

    Software Developer Intern - IBM - Somers, NY +3 locations- Agile development - Big data / Hadoop /

    data analytics a plus

    Software Developer - IBM - San Jose, CA +4 locations - include Hadoop-powered distributed parallel data

    processing system, big data analytics ... multiple technologies, including Hadoop

  • 8/12/2019 Xiaoxiao Had Oop

    15/31

    It is important

    Details

  • 8/12/2019 Xiaoxiao Had Oop

    16/31

    Nowadays

    Hadoop Core Distributed File System

    MapReduce Framework

    Pig (initiated by Yahoo!) Parallel Programming Language and Runtime

    Hbase (initiated by Powerset) Table storage for semi-structured data

    Zookeeper (initiated by Yahoo!) Coordinating distributed systems

    Hive (initiated by Facebook) SQL-like query language and metastore

  • 8/12/2019 Xiaoxiao Had Oop

    17/31

  • 8/12/2019 Xiaoxiao Had Oop

    18/31

    HDFS

    Hadoop's Distributed File System is designed to reliably storevery large files across machines in a large cluster. It isinspired by the Google File System. Hadoop DFS stores eachfile as a sequence of blocks, all blocks in a file except the lastblock are the same size. Blocks belonging to a file are

    replicated for fault tolerance. The block size and replicationfactor are configurable per file. Files in HDFS are "write once"and have strictly one writer at any time.

    Hadoop Distributed File System Goals: Store large data sets Cope with hardware failure Emphasize streaming data access

  • 8/12/2019 Xiaoxiao Had Oop

    19/31

    Typical Hadoop Structure

    Commodity hardware Linux PCs with local 4 disks

    Typically in 2 level architecture 40 nodes/rack Uplink from rack is 8 gigabit

    Rack-internal is 1 gigabit all-to-all

  • 8/12/2019 Xiaoxiao Had Oop

    20/31

  • 8/12/2019 Xiaoxiao Had Oop

    21/31

    Hadoop structure

    Single namespace for entire cluster Managed by a single namenode. Files are single-writer and append-only. Optimized for streaming reads of large files.

    Files are broken in to large blocks. Typically 128 MB Replicated to several datanodes, for reliability

    Client talks to both namenode and datanodes

    Data is not sent through the namenode. Throughput of file system scales nearly linearly withthe number of nodes.

    Access from Java, C, or command line.

  • 8/12/2019 Xiaoxiao Had Oop

    22/31

  • 8/12/2019 Xiaoxiao Had Oop

    23/31

    Hadoop Structure

    Java and C++ APIs In Java use Objects, while in C++ bytes

    Each task can process data sets larger than RAM Automatic re-execution on failure

    In a large cluster, some nodes are always slow orflaky

    Framework re-executes failed tasks

    Locality optimizations

    Map-Reduce queries HDFS for locations of input data

    Map tasks are scheduled close to the inputs whenpossible

  • 8/12/2019 Xiaoxiao Had Oop

    24/31

    Example of Hadoop Programming

    Word Count:

    I ike parallel computing. I also took courses

    on parallel computing

    Parallel: 2

    Computing: 2

    I: 2

    Like: 1

  • 8/12/2019 Xiaoxiao Had Oop

    25/31

    Example of Hadoop Programming

    Intuition: design

    Assume each node will process a paragraph

    Map: What is the key?

    What is the value?

    Reduce: What to collect?

    What to reduce?

  • 8/12/2019 Xiaoxiao Had Oop

    26/31

    Word Count ExamplepublicclassMapClass extendsMapReduceBase

    implementsMapper {

    privatefinalstaticIntWritable ONE= newIntWritable(1);

    publicvoidmap(LongWritable key, Text value,OutputCollector out,

    Reporter reporter) throwsIOException {String line = value.toString();StringTokenizer itr = newStringTokenizer(line);while(itr.hasMoreTokens()) {

    out.collect(newtext(itr.nextToken()), ONE);}

    }}

  • 8/12/2019 Xiaoxiao Had Oop

    27/31

    Word Count ExamplepublicclassReduceClass extendsMapReduceBase

    implementsReducer {

    publicvoidreduce(Text key, Iterator values,OutputCollector out,Reporter reporter) throwsIOException {

    intsum = 0;

    while(values.hasNext()) {sum += values.next().get();

    }out.collect(key, newIntWritable(sum));

    }}

  • 8/12/2019 Xiaoxiao Had Oop

    28/31

    Word Count Example

    publicstaticvoidmain(String[] args) throwsException {

    JobConf conf = newJobConf(WordCount.class);

    conf.setJobName("wordcount");

    conf.setMapperClass(MapClass.class);

    conf.setCombinerClass(ReduceClass.class);conf.setReducerClass(ReduceClass.class);

    FileInputFormat.setInputPaths(conf, args[0]);

    FileOutputFormat.setOutputPath(conf, newPath(args[1]));

    conf.setOutputKeyClass(Text.class);// out keys are words (strings)

    conf.setOutputValueClass(IntWritable.class);// values are counts

    JobClient.runJob(conf);

    }c

  • 8/12/2019 Xiaoxiao Had Oop

    29/31

    Hadoop in Yahoo!

    29

    Before Hadoop After Hadoop

    Time 26 days 20 minutes

    Language C++ Python

    Development Time 2-3 weeks 2-3 days

    Database for Search Assistis built using Hadoop.

    3 years of log-data

    20-steps of map-reduce

  • 8/12/2019 Xiaoxiao Had Oop

    30/31

    Related research of hadoop

    Conference Tutorial: KDD Tutorial: Modeling with Hadoop, KDD 2011 (top conference in data mining)

    Strta Tutorial: How to Develop Big Data Applications for Hadoop

    OSCON Tutorial: Introduction to Hadoop,

    Papers: Scalable distributed inference of dynamic user interests for behavioral targeting. KDD 2011: 114-122

    Yucheng Low, Deepak Agarwal, Alexander J. Smola: Multiple domain user personalization. KDD 2011: 123-131

    Shuang-Hong Yang, Bo Long, Alexander J. Smola, Hongyuan Zha, Zhaohui Zheng: Collaborative competitive filtering: learning

    recommender using context of user choice. SIGIR 2011: 295-304

    Srinivas Vadrevu, Choon Hui Teo, Suju Rajan, Kunal Punera, Byron Dom, Alexander J. Smola, Yi Chang, Zhaohui Zheng: Scalable

    clustering of news search results. WSDM 2011: 675-684

    Shuang-Hong Yang, Bo Long, Alexander J. Smola, Narayanan Sadagopan, Zhaohui Zheng, Hongyuan Zha: Like like alike: joint

    friendship and interest propagation in social networks. WWW 2011: 537-546

    Amr Ahmed, Alexander J. Smola: WWW 2011 invited tutorial overview: latent variable models on the internet. WWW

    (Companion Volume) 2011: 281-282

    Daniel Hsu, Nikos Karampatziakis, John Langford, Alexander J. Smola: Parallel Online Learning CoRR abs/1103.4204: (2011)

    Neethu Mohandas, Sabu M. Thampi: Improving Hadoop Performance in Handling Small Files. ACC 2011:187-194

    Tomasz Wiktor Wlodarczyk, Yi Han, Chunming Rong: Performance Analysis of Hadoop for Query Processing. AINA Workshops

    2011:507-513

    All just this year! 2011!

  • 8/12/2019 Xiaoxiao Had Oop

    31/31

    For more information:

    http://hadoop.apache.org/

    http://developer.yahoo.com/hadoop/

    Who uses Hadoop?:

    http://wiki.apache.org/hadoop/PoweredBy

    http://hadoop.apache.org/http://developer.yahoo.com/hadoop/http://wiki.apache.org/hadoop/PoweredByhttp://wiki.apache.org/hadoop/PoweredByhttp://wiki.apache.org/hadoop/PoweredByhttp://wiki.apache.org/hadoop/PoweredByhttp://developer.yahoo.com/hadoop/http://hadoop.apache.org/