29
Pig and Hive Csaba Toth Central California .NET User Group Meeting Date: April 17 th , 2014 Location: Bitwise Industries, Fresno

Hive and Pig for .NET User Group

Embed Size (px)

DESCRIPTION

Introduction to Pig and Hive

Citation preview

Page 1: Hive and Pig for .NET User Group

Pig and Hive

Csaba TothCentral California .NET User Group

MeetingDate: April 17th, 2014

Location: Bitwise Industries, Fresno

Page 2: Hive and Pig for .NET User Group

Agenda

• Little recap of Hadoop and Map-Reduce• Pig and Hive• Recommendation engine• Demos 1: Exercises with on-premise Hadoop

emulator• Demos 2: Azure HDInsight

Page 3: Hive and Pig for .NET User Group

Hadoop

• Hadoop is an open-source software framework that supports data-intensive distributed applications.

• Has two main pieces:– Storing large amounts of data: HDFS, Hadoop

Distributed File System– Processing large amounts of data: implementation

of the MapReduce programming model

Page 4: Hive and Pig for .NET User Group

Hadoop

• All of this in a cost effective way: Hadoop is managing a cluster of commodity hardware computers.

• The cluster is composed of a single master node and multiple worker nodes

• It is written in Java, utilizes JVMs

Page 5: Hive and Pig for .NET User Group

Name node

HDFS visually

MetadataStore

Data node Data node Data node

Node 1 Node 2

Block A Block B Block A Block B

Node 3

Block A Block B

Page 6: Hive and Pig for .NET User Group

Name nodeHeart beat signals and

communication

Job / task management

Jobtracker

Data node Data node Data node

Tasktracker Tasktracker

Map 1 Reduce 1 Map 2 Reduce 2

Tasktracker

Map 3 Reduce 3

Page 7: Hive and Pig for .NET User Group

MapReduce

• Hadoop leverages the functional programming model of map/reduce.

• Moves away from shared resources and related synchronization and contention issues

• Thus inherently scalable and suitable for processing large data sets, distributed computing on clusters of computers/nodes.

• The goal of map reduce is to break huge data sets into smaller pieces, distribute those pieces to various worker nodes, and process the data in parallel.

• Hadoop leverages a distributed file system to store the data on various nodes.

Page 8: Hive and Pig for .NET User Group

MapReduce

• It is about two functions: map and reduce1. Map Step:– Processes a key/value pairs and generate a set of

intermediate key/value pairs form that

2. Shuffle step:– Groups all intermediate values associated with the same

intermediate key into one set

3. Reduce Step:– Processes the intermediate values associated with the same

intermediate key and produces a set of values based on the groups (usually some kind of aggregate)

Page 9: Hive and Pig for .NET User Group

Word count

http://blog.jteam.nl/wp-content/uploads/2009/08/MapReduceWordCountOverview1.png

Page 10: Hive and Pig for .NET User Group

Map, Shuffle, and Reduce

https://mm-tom.s3.amazonaws.com/blog/MapReduce.png

Page 11: Hive and Pig for .NET User Group

Hadoop Ecosystem and MapReduce

• Writing MapReduce Jobs in Java or C# or other languages is useful but it is the hard way

• Several domain-specific higher level languages and frameworks exist– They allow phrase complicated tasks way more

simpler and shorter than Java– These languages translate everything into Map-

Reduce jobs under the hood transparently• We’ll see two examples: Hive and Pig

Page 12: Hive and Pig for .NET User Group

Hive and Pig

• Hive (http://hive.apache.org/)– Provides SQL-like approach– The best if the input data is at least conform to some schema,

so it can be consumed (SQL requires columnar format, tables)– Good for someone coming from SQL background

• Pig (http://pig.apache.org/)– The syntax is closer to a programming language– It defines a series of transformations, projecting one schema

into another– Also the best if the data is not totally free form and have

some kind of schema

Page 13: Hive and Pig for .NET User Group

Hadoop Ecosystem / Architecture

Log Data RDBMS

Data Integration Layer

Flume Sqoop

Storage Layer (HDFS)

Computing Layer (MapReduce)

Advanced Query Engine (Hive, Pig)

Data Mining(Pegasus,Mahout)

Index, Searches(Lucene)

DB drivers(Hive driver)

Web Browser (JS)PresentationLayer

Page 14: Hive and Pig for .NET User Group

Simple recommendation engine

• Sites such as Amazon.com and Netflix.com use complex algorithms

• But the underlying concepts are simple: finding correlation between data

• Pearson Coefficient, Excel CORREL function• Demo

• C# naïve implementation• Recommendation using Pig

Page 15: Hive and Pig for .NET User Group

Simple recommendation engine

• Sites such as Amazon.com and Netflix.com use complex algorithms

• But the underlying concepts are simple: finding correlation between data

• Pearson Coefficient, Excel CORREL function

Page 16: Hive and Pig for .NET User Group

Pearson coefficientPearson product-moment correlation coefficient value

Comments

-1 Perfectly correlated data, but as one rises the other decreases

0 Uncorrelated data

+1 Perfectly correlated data

• DEMO

Page 17: Hive and Pig for .NET User Group

Ratings dataName The Lord of the Rings The Chronicles of Narnia

Jack 2 3

Mark 4 4.5

Albert 4 3.5

John 5 5

• Pearson Correlation Coefficient: 0.8705715

Page 18: Hive and Pig for .NET User Group

Ratings dataName ofmovie critic

Name of movie Rating

Lisa Rose Lady in the Water 2.5

Lisa Rose Snakes on a Plane 3.5

Lisa Rose Just My Luck 3

Lisa Rose Superman Returns 3.5

Lisa Rose You Me and Dupree 2.5

Lisa Rose The Night Listener 3

Gene Seymour Lady in the Water 3

Gene Seymour Snakes on a Plane 3.5

Gene Seymour Just My Luck 1.5

Gene Seymour Superman Returns 5

Gene Seymour The Night Listener 3

Page 19: Hive and Pig for .NET User Group

Simple recommendation

• DEMO– C# implementation– Simple naïve algorithm– Non parallel, not Hadoop

Page 20: Hive and Pig for .NET User Group

Pig

• A data-flow language• Express the processing as a series of

transformations• Steps are translated into Map Reduce jobs• We can look at it like LINQ• We’ll learn it by example– Pig’s command line shell: grunt– Pig’s language: Pig Latin

Page 21: Hive and Pig for .NET User Group

Pig

• Load and store:– can load/store data from/to HDFS

• Relations– The transformations are performed on ‘relations’ –

Pig calls the collections like that, don not confuse with traditional relational DB terminology!

– Think of it as a table with rows and columns of data– When grouped relations can contain associative

key-values

Page 22: Hive and Pig for .NET User Group

Pig

• Joins– Can accomplish joins in an conceptually intuitive

manner (using a common key)• Filter– Can apply filters to data. A predicate should be

specified• Projection– Can project from an existing collection = form a new

collection in a way like an SQL select does. That is the “GENERATE” command of Pig

Page 23: Hive and Pig for .NET User Group

Pig

• Grouping– Can group data by one or more keys. Once grouped, you

can maintain the hierarchical structure in the relation throughout the transformations. Projections can be made, or sometimes you can flatten some of the hierarchy.

• Dump– DUMP statement outputs a contents of a relation onto the

console. Useful when fooling around in the Pig shell• Extensible– UDFs: User Defined Functions

Page 24: Hive and Pig for .NET User Group

Simple recommendation

• DEMO– Pig implementation– Local pseudo cluster (HDInsight on-premise)

Page 25: Hive and Pig for .NET User Group

Hive DEMO

• Analyzing last time’s wordcount (warpeace) results

• Analyzing the recommendation engine results

Page 26: Hive and Pig for .NET User Group

HDInsight

• Microsoft’s Hadoop PaaS solution in the cloud• Hortonworks Hadoop implementation• New Azure portal is coming, in preview:

http://azure.microsoft.com/en-us/services/preview/

• Build conference: 40 new functions in Azure• Latest addition: ISS, Azure Intelligent Systems

Service (https://connect.microsoft.com/site1132/)• Analytics Platform System (APS): evolutionary

combination of SQL Server PDW and Hadoop

Page 27: Hive and Pig for .NET User Group

References

• Daniel Jebaraj: Ignore HDInsight at Your Own Peril: Everything You Need to Know

• Tom White: Hadoop: The Definitive Guide, 3rd Edition, Yahoo Press

• Lynn Langit’s various presentations and YouTube videos

• Dattatrey Sindol: Big Data Basics - Part 1 - Introduction to Big Data

• Bruno Terkaly’s presentations (for example Hadoop on Azure: Introduction)

Page 28: Hive and Pig for .NET User Group

Thanks for your attention!

Page 29: Hive and Pig for .NET User Group

Hadoop vs RDBMSHadoop / MapReduce RDBMS

Size of data Petabytes Gigabytes

Integrity of data Low High (referential, typed)

Data schema Dynamic Static

Access method Interactive and Batch Batch

Scaling Linear Nonlinear (worse than linear)

Data structure Unstructured Structured

Normalization of data Not Required Required

Query Response Time Has latency (due to batch processing)

Can be near immediate