of 30/30
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Essentials of Pig Mastering Hadoop Map-reduce for Data Analysis Shashank Tiwari blog: shanky.org | twitter: @tshanky [email protected]fideas.com

SDEC2011 Essentials of Pig

  • View
    1.437

  • Download
    1

Embed Size (px)

DESCRIPTION

 

Text of SDEC2011 Essentials of Pig

  • 1. Condential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Essentials of PigMastering Hadoop Map-reduce for Data AnalysisShashank Tiwariblog: shanky.org | twitter: @[email protected]
  • 2. Condential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Session Agenda What is Pig and why should you use it? Installing & Setting up Pig Pigs Components Using Pig with Hadoop MapReduce Summary & Conclusion
  • 3. Condential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.What is Pig? Higher-level abstraction for Hadoop MapReduce An infrastructure for data analysis using a scripting language named, Pig Latin
  • 4. Condential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Why should you use Pig? Hadoop MapReduce: Requires you to be a programmer Forces you to design all your algorithms in terms of the map and reduce primitives
  • 5. Condential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Installing & Setting Up Pig -- Required Software Required Software: Java 1.6.x Hadoop 0.20.x Ant 1.7+ (for builds) JUnit 4.5 (for tests) Cygwin (on Windows)
  • 6. Condential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Download Source: http://pig.apache.org/ Version: 0.8.1 -- current stable
  • 7. Condential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Install & Congure Extract: tar zxvf pig-0.8.1.tar.gz Move & Create Symbolic Link: ln -s pig-0.8.1 pig Edit: bin/pig export PIG_CLASSPATH=$HADOOP_HOME/conf
  • 8. Condential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Verify Installation Verify: (remember to start Hadoop rst.) bin/pig -help (command options) bin/pig (run the grunt shell)
  • 9. Condential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Running Pig Run Mode Local Mode -- single machine MapReduce Mode -- needs a Hadoop cluster (with HDFS) Run via: grunt shell pig scripts embedded programs
  • 10. Condential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Pig IDE PigPen, an eclipse based IDE graphical data ow denition can show example data ow
  • 11. Condential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Pig Components Pig Latin Pig Engine execution engine on top of Hadoop includes default optimal congurations
  • 12. Condential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.A client for your cluster Pig does not run on a Hadoop cluster It connects to one
  • 13. Condential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Pig Latin Data ow language (Not declarative like SQL) Increases productivity (less lines do more) Includes standard operations like join, lter, group, sort User code and existing binaries can be included Supports nested data types Does not require metadata
  • 14. Condential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Pig Latin Example Will leverage the tutorial that comes with the distribution Check the tutorial folder in the distribution
  • 15. Condential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Start Grunt Shell cd $PIG_HOME bin/pig -x local
  • 16. Condential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Aggregate Data grunt> log = LOAD tutorial/data/excite-small.log AS (user, timestamp, query); alternate delimiters can be used and de-serializers like PigJsonLoader can be leveraged grunt> grouped = GROUP log BY user; grunt> counted = FOREACH grouped GENERATE group, COUNT(log); grunt> STORE counted INTO output;
  • 17. Condential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Group Data grunt> grouped = GROUP log BY user; In Pig group operation generates (key, collection) pair , where the collection itself is a collection of tuples. The key of the tuples is the same key as that of the (key, collection) pair
  • 18. Condential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Filter Data grunt> log= LOAD tutorial/data/excite-small.log AS (user, time, query); grunt> grouped = GROUP log BY user; grunt> counted = FOREACH grouped GENERATE group, COUNT(log) AS cnt; grunt> ltered = FILTER counted BY cnt > 75; grunt> STORE ltered INTO output1;
  • 19. Condential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Order Data grunt> log= LOAD tutorial/data/excite-small.log AS (user, time, query); grunt> grouped = GROUP log BY user; grunt> counted = FOREACH grouped GENERATE group, COUNT(log) AS cnt; grunt> ltered = FILTER counted BY cnt > 50; grunt> sorted = ORDER ltered BY cnt; grunt> STORE sorted INTO output2;
  • 20. Condential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Join Data Example Words appearing in Adventures of Huckleberry Finn by Mark Twain http://www.gutenberg.org/ebooks/76 Words appearing in The Adventures of Sherlock Holmes by Sir Arthur Conan Doyle http://www.gutenberg.org/ebooks/1661
  • 21. Condential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Loading & Counting Huckleberry Finn Data grunt> A = load pg76.txt; grunt> B = foreach A generate atten(TOKENIZE((chararray)$0)) as word; grunt> C = lter B by word matches w+; grunt> D = group C by word; grunt> E = foreach D generate COUNT(C), group; store E into huckleberry_nn_freq;
  • 22. Condential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Loading & Counting Sherlock Holmes Data grunt> A = load pg1661.txt; grunt> B = foreach A generate atten(TOKENIZE((chararray)$0)) as word; grunt> C = lter B by word matches w+; grunt> D = group C by word; grunt> E = foreach D generate COUNT(C), group; grunt> store E into sherlock_holmes_freq;
  • 23. Condential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Join Data grunt> hf= LOAD huckleberry_nn_freq AS (freq, word); grunt> sh= LOAD sherlock_holmes_freq AS (freq, word); grunt> inboth = JOIN hf BY word, sh BY word; grunt> STORE inboth INTO output3;
  • 24. Condential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Set Difference (A - B, in A but not in B) hf= LOAD huckleberry_nn_freq AS (freq, word); sh = LOAD sherlock_holmes_freq AS (freq, word); grouped = COGROUP hf BY word, sh BY word; not_in_hf = FILTER grouped BY COUNT(hf) == 0; out = FOREACH not_in_hf GENERATE FLATTEN(sh); STORE out INTO output4;
  • 25. Condential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Cogroup Data Extends the idea of grouping to multiple collections Instead of (key, collection) pair, it now emits a key and a set of tuples from each of the multiple collections With two sources of input it would be (key, collection1, collection2), where tuples from the rst source will be in collection1 and tuples from the second source will be in collection2.
  • 26. Condential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Data types Supported int, long, double, chararray, bytearray map, tuple (ordered), bag (unordered)
  • 27. Condential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Data type Declaration hf= LOAD huckleberry_nn_freq AS (freq:int, word:chararray); explicit data type declaration hf= LOAD huckleberry_nn_freq AS (freq:int, word:chararray); weighted = FOREACH hf GENERATE freq * 100; type inference, freq cast to int
  • 28. Condential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Data type Declaration hf= LOAD huckleberry_nn_freq AS (freq:int, word:chararray); explicit data type declaration hf= LOAD huckleberry_nn_freq AS (freq:int, word:chararray); weighted = FOREACH hf GENERATE freq * 100; type inference, freq cast to int
  • 29. Condential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Custom Extensions User dened functions can be called from Pig scripts Nested operations can be carried out FOREACH grouped { sorted = ORDER hf BY counted; GENERATE group, CustomFunction(sorted); } Flow can be split: SPLIT A INTO Negative IF $0 < 0, Positive IF $0 > 0;
  • 30. Condential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Questions? blog: shanky.org | twitter: @tshanky [email protected]