36
© Hortonworks Inc. 2011 Pig programming is more fun: New features in Pig Daniel Dai (@daijy) Thejas Nair (@thejasn) Page 1

Pig programming is more fun: New features in Pig

  • Upload
    daijy

  • View
    123

  • Download
    4

Embed Size (px)

DESCRIPTION

In the last year, we add lots of new language features into Pig. Pig programing is much more easier than before. With Pig Macro, we can write functions for Pig and we can modularize Pig program. Pig embedding allow use to embed Pig statement into Python and make use of rich language features of Python such as loop and branch. Java is no longer the only choice to write Pig UDF, we can write UDF in Python, Javascript and Ruby. Nested foreach and cross gives us more ways to manipulate data, which is not possible before. We also add tons of syntax sugar to simplify the Pig syntax. For example, direct syntax support for map, tuple and bag, project range expression in foreach, etc. We also revive the support for illustrate command to ease the debugging. In this paper, I will give an overview of all these features and illustrate how to use these features to program more efficiently in Pig. I will also give concrete example to demonstrate how Pig language evolves overtime with these language improvements.

Citation preview

  • 1. Pig programming is more fun: New features in PigDaniel Dai (@daijy)Thejas Nair (@thejasn) Hortonworks Inc. 2011Page 1

2. What is Apache Pig?Pig Latin, a high levelAn engine thatdata processingexecutes Piglanguage.Latin locally or on a Hadoop cluster.Pig-latin-cup pic from http://www.flickr.com/photos/frippy/2507970530/Architecting the Future of Big Data Page 2 Hortonworks Inc. 2011 3. Pig-latin example Query : Get the list of web pages visited by users whoseage is between 20 and 29 years.USERS = load users as (uid, age);USERS_20s = filter USERS by age >= 20 and age averageArchitecting the Future of Big Data Page 17 Hortonworks Inc. 2011 18. Pig relation-as-scalar Step 1 is like .. = load .. ..= group .. al_rel = foreach .. AVG(ltime) as avg_ltime; Step 2 looks like page_views = load pviews.txt as (url, ltime, ..); slow_views = filter page_views by ltime > avg_ltime Architecting the Future of Big DataPage 18 Hortonworks Inc. 2011 19. Pig relation-as-scalar Getting results of step 1 (average_gpa) Join result of step 1 with students relation, or Write result into file, then use udf to read from file Pig scalar feature now simplifies this- slow_views = filter page_views by ltime > al_rel.avg_ltime Runtime exception if al_rel has more than one record.Architecting the Future of Big Data Page 19 Hortonworks Inc. 2011 20. UDF in Scripting Language Benefit Use legacy code Use library in scripting language Leverage Hadoop for non-Java programmer Currently supported language Python (0.8) JavaScript (0.8) Ruby (0.10) Extensible Interface Minimum effort to support another languageArchitecting the Future of Big Data Page 20 Hortonworks Inc. 2011 21. Writing a Python UDFWrite a Python UDFregister util.py using jython as util;@outputSchema("word:chararray") B = foreach A generate util.square(i);def concat(word):return word + word Invoke Python functions when needed@outputSchemaFunction("squareSchema") Type conversiondef square(num): Python simple type Pig simple typeif num == None: Python Array Pig Bagreturn None Python Dict Pig Mapreturn ((num)*(num)) Pyton Tuple Pig Tupledef squareSchema(input):return inputArchitecting the Future of Big DataPage 21 Hortonworks Inc. 2011 22. Use NLTK in Pig Exampleregister nltk_util.py using jython as nltk;Pig eats everythingB = foreach A generate nltk.tokenize(sentence) Tokenizenltk_util.py Stemmingimport nltkporter = nltk.PorterStemmer()(Pig)@outputSchema("words:{(word:chararray)}")(eat)def tokenize(sentence): (everything)tokens = nltk.word_tokenize(sentence)words = [porter.stem(t) for t in tokens]return words Architecting the Future of Big Data Page 22 Hortonworks Inc. 2011 23. Comparison with Pig StreamingPig Streaming Scripting UDF B = stream A through `perlB = foreach A generateSyntax sample.pl`;myfunc.concat(a0, a1), a2;function parameter/returnstdin/tout Input/Output valueentire relation particular fieldsNeed to parse input/convert Type conversion isType Conversion type automatic Every streaming operator Organize the functions intoModularizeneed a separate script moduleArchitecting the Future of Big Data Page 23 Hortonworks Inc. 2011 24. Writing a Script EngineWriting a bridge UDFclass JythonFunction extends EvalFunc { Convert Pig input into Python public Object exec(Tuple tuple) { PyObject[] params = JythonUtils.pigTupleToPyTuple(tuple).getArray(); PyObject result = f.__call__(params);Invoke Python UDF return JythonUtils.pythonToPig(result); } Convert result to Pig public Schema outputSchema(Schema input) { PyObject outputSchemaDef = f.__findattr__("outputSchema".intern()); return Utils.getSchemaFromString(outputSchemaDef.toString()); }} Architecting the Future of Big Data Page 24 Hortonworks Inc. 2011 25. Writing a Script EngineRegister scripting UDFregister util.py using jython as util;What happens in Pigclass JythonScriptEngine extends ScriptEngine { public void registerFunctions(String path, String namespace, PigContextpigContext) {myudf.pydef square(num): square JythonFunction(square)def concat(word):concat JythonFunction(concat)def count(bag):countJythonFunction(count)}} Architecting the Future of Big Data Page 25 Hortonworks Inc. 2011 26. Algebraic UDF in JRubyclass SUM < AlgebraicPigUdf output_schema Schema.longdef initial numnumInitial Functionenddef intermed numnum.flatten.inject(:+)Intermediate Functionenddef final numintermed(num)Final FunctionendendArchitecting the Future of Big DataPage 26 Hortonworks Inc. 2011 27. Pig Embedding Embed Pig inside scripting languagePythonJavaScript Algorithms which cannot complete using one Pig scriptIterative algorithm PageRank, Kmeans, Neural Network, Apriori, etc Parallel Independent execution Ensemble Divide and Conquer BranchingArchitecting the Future of Big DataPage 27 Hortonworks Inc. 2011 28. Pig Embeddingfrom org.apache.pig.scripting import Pig Compile Piginput= ":INPATH:/singlefile/studenttab10kScriptP = Pig.compile("""A = load $in as (name, age, gpa); store A into output;""")Q = P.bind({in:input})Bind Variablesresult = Q.runSingle() Launch Pig Scriptresult = stats.result(A)for t in result.iterator(): Iterate result print t Architecting the Future of Big Data Page 28 Hortonworks Inc. 2011 29. Convergence ExampleP = Pig.compile(DEFINE myudf MyUDF($param); A = load input; B = foreach A generate MyUDF(*); store B into output; )while True:Q = P.bind({ param:new_parameter})Bind to new parameterresults = Q.runSingle()iter = results.result("result").iterator()if converged:Convergence checkbreaknew_parameter = xxxxxxChange parameterArchitecting the Future of Big DataPage 29 Hortonworks Inc. 2011 30. Pig Embedding Running embeded Pig scriptpig sample.py while True: What happen within Pig?Q = P.bind()results = Q.runSingle() While Loop converge? Pig Script PythoPytho nnsample.pyScriptPigScript JythonPigEndArchitecting the Future of Big DataPage 30 Hortonworks Inc. 2011 31. Nested Operator Nested Operator: Operator inside foreachB = group A by name;C = foreach B {C0 = limit A 10;generate flatten(C0);} Prior Pig 0.10, supported nested operatorDISTINCT, FILTER, LIMIT, and ORDER BY New operators added in 0.10CROSS, FOREACH Architecting the Future of Big DataPage 31 Hortonworks Inc. 2011 32. Nested Cross/ForEach (i0, a)(i0, 0)A= B= (i0, b)(i0,1) a 0 CoGroup A, B C= (i0, , ) b 1 (a, 0)C = CoGroup A, B; Cross A, B (a,1) D = ForEach C { (i0, (b, 0)X = Cross A, B; (b,1) Y = ForEach X generateCONCAT(f1, f2); (a0) Generate Y;ForEach CONCAT (a1) (i0, } (b0) (b1) Architecting the Future of Big Data Page 32 Hortonworks Inc. 2011 33. HCatalog Integration Hcatalog PigMap Reduce Hive HCatalog HCatLoader/HCatStorageLoad/Store from HCatalog from Pig HCatalog DDL Integration (Pig 0.11)sql create table student(name string, age int, gpa double);Architecting the Future of Big DataPage 33 Hortonworks Inc. 2011 34. Misc Loaders HBaseStoragePig builtin AvroStoragePiggybank CassandraStorageIn Cassandra code base MongoStorageIn Mongo DB code base JsonLoader/JsonStoragePig builtin Architecting the Future of Big Data Page 34 Hortonworks Inc. 2011 35. TalendEnterprise Data Integration Talend Open Studio for Big Data Feature-rich Job Designer Rich palette of pre-built templates Supports HDFS, Pig, Hive, HBase, HCatalog Apache-licensed, bundled with HDP Key benefits Graphical development Robust and scalable execution Broadest connectivity to support all systems: 450+ components Real-time debugging Hortonworks Inc. 2011 Page 35 36. Questions Architecting the Future of Big Data Page 36 Hortonworks Inc. 2011