38
Apache Giraph start analyzing graph relationships in your bigdata in 45 minutes (or your money back)!

Apache Giraph: start analyzing graph relationships in your bigdata in 45 minutes (or your money back)!

  • Upload
    rhatr

  • View
    866

  • Download
    0

Embed Size (px)

DESCRIPTION

The genesis of Hadoop was in analyzing massive amounts of data with a mapreduce framework. SQL­-on­Hadoop has followed shortly after that, paving a way to the whole schema-­on­-read notion. Discovering graph relationship in your data is the next logical step. Apache Giraph (modeled on Google’s Pregel) lets you apply the power of BSP approach to the unstructured data. In this talk we will focus on practical advice of how to get up and running with Apache Giraph, start analyzing simple data sets with built­-in algorithms and finally how to implement your own graph processing applications using the APIs provided by the project. We will then dive into how Giraph integrates with the Hadoop ecosystem (Hive, HBase, Accumulo, etc.) and will also provide a whirlwind tour of Giraph architecture.

Citation preview

Page 1: Apache Giraph: start analyzing graph relationships in your bigdata in 45 minutes (or your money back)!

Apache Giraphstart analyzing graph relationships in your bigdata in 45 minutes (or your money back)!

Page 2: Apache Giraph: start analyzing graph relationships in your bigdata in 45 minutes (or your money back)!

Who’s this guy?

Page 3: Apache Giraph: start analyzing graph relationships in your bigdata in 45 minutes (or your money back)!

Roman Shaposhnik

•ASF junkie• VP of Apache Incubator, former VP of Apache Bigtop• Hadoop/Sqoop/Giraph committer• contributor across the Hadoop ecosystem)

•Used to be root@Cloudera•Used to be a PHB at Yahoo!•Used to be a UNIX hacker at Sun microsystems

Page 4: Apache Giraph: start analyzing graph relationships in your bigdata in 45 minutes (or your money back)!

Giraph in action (MEAP)

http://manning.com/martella/

Page 5: Apache Giraph: start analyzing graph relationships in your bigdata in 45 minutes (or your money back)!

What’s this all about?

Page 6: Apache Giraph: start analyzing graph relationships in your bigdata in 45 minutes (or your money back)!

Agenda

• A brief history of Hadoop-based bigdata management• Extracting graph relationships from unstructured data• A case for iterative and explorative workloads• Bulk Synchronous Parallel to the rescue!• Apache Giraph: a Hadoop-based BSP graph analysis

framework• Giraph application development• Demos! Code! Lots of it!

Page 7: Apache Giraph: start analyzing graph relationships in your bigdata in 45 minutes (or your money back)!

On day one Doug created HDFS/MR

Page 8: Apache Giraph: start analyzing graph relationships in your bigdata in 45 minutes (or your money back)!

Google papers

•GFS (file system)• distributed• replicated• non-POSIX

•MapReduce (computational framework)• distributed• batch-oriented (long jobs; final results)• data-gravity aware• designed for “embarrassingly parallel” algorithms

Page 9: Apache Giraph: start analyzing graph relationships in your bigdata in 45 minutes (or your money back)!

One size doesn’t fit all

•Key-value approach• map is how we get the keys• shuffle is how we sort the keys• reduce is how we get to see all the values for a key

•Pipeline approach•Intermediate results in a pipeline need to be flushed to HDFS•A very particular “API” for working with your data

Page 10: Apache Giraph: start analyzing graph relationships in your bigdata in 45 minutes (or your money back)!

It’s not about the size of your data;it’s about what you do with it!

Page 11: Apache Giraph: start analyzing graph relationships in your bigdata in 45 minutes (or your money back)!

Graph relationships

•Entities in your data: tuples• customer data• product data• interaction data

•Connection between entities: graphs• social network or my customers• clustering of customers vs. products

Page 12: Apache Giraph: start analyzing graph relationships in your bigdata in 45 minutes (or your money back)!

Challenges

•Data is dynamic• No way of doing “schema on write”

•Combinatorial explosion of datasets• Relationships grow exponentially

•Algorithms become• explorative• iterative

Page 13: Apache Giraph: start analyzing graph relationships in your bigdata in 45 minutes (or your money back)!

Graph databases

•Plenty available• Neo4J, Titan, etc.

•Benefits• Tightly integrate systems with few moving parts• High performance on known data sets

•Shortcomings• Don’t integrate with HDFS• Combine storage and computational layers• A sea of APIs

Page 14: Apache Giraph: start analyzing graph relationships in your bigdata in 45 minutes (or your money back)!

Enter Apache Giraph

Page 15: Apache Giraph: start analyzing graph relationships in your bigdata in 45 minutes (or your money back)!

Key insights

•Keep state in memory for as long as needed•Leverage HDFS as a repository for unstructured data•Allow for maximum parallelism (shared nothing)•Allow for arbitrary communications•Leverage BSP approach

Page 16: Apache Giraph: start analyzing graph relationships in your bigdata in 45 minutes (or your money back)!

Bulk Synchronous Parallel

time

communications

local processing

barrier #1

barrier #2

barrier #3

Page 17: Apache Giraph: start analyzing graph relationships in your bigdata in 45 minutes (or your money back)!

BSP applied to graphs

@rhatr

@TheASF

@c0sin

Think like a vertex:•I know my local state•I know my neighbours•I can send messages to vertices•I can declare that I am done•I can mutate graph topology

Page 18: Apache Giraph: start analyzing graph relationships in your bigdata in 45 minutes (or your money back)!

Bulk Sequential Processing

time

message delivery

individualvertexprocessing &sending messages

superstep #1

all vertices are “done”

superstep #2

Page 19: Apache Giraph: start analyzing graph relationships in your bigdata in 45 minutes (or your money back)!

Giraph “Hello World”public class GiraphHelloWorld extends

BasicComputation<IntWritable, IntWritable, NullWritable, NullWritable> {

public void compute(Vertex<IntWritable, IntWritable, NullWritable> vertex,

Iterable<NullWritable> messages) {

System.out.println(“Hello world from the: “ + vertex.getId() + “ : “);

for (Edge<IntWritable, NullWritable> e : vertex.getEdges()) {

System.out.println(“ “ + e.getTargetVertexId());

}

System.out.println(“”);

}

}

Page 20: Apache Giraph: start analyzing graph relationships in your bigdata in 45 minutes (or your money back)!

Mighty four of Giraph API

BasicComputation<IntWritable, // VertexID -- vertex ref IntWritable, // VertexData -- a vertex datum NullWritable, // EdgeData -- an edge label datum NullWritable>// MessageData -- message payload

Page 21: Apache Giraph: start analyzing graph relationships in your bigdata in 45 minutes (or your money back)!

On circles and arrows

•You don’t even need a graph to begin with!• Well, ok you need at least one node

•Dynamic extraction of relationships• EdgeInputFormat• VetexInputFormat

•Full integration with Hadoop ecosystem• HBase/Accumulo, Gora, Hive/HCatalog

Page 22: Apache Giraph: start analyzing graph relationships in your bigdata in 45 minutes (or your money back)!

Anatomy of Giraph run

3 1 2

1 2 1 3

12 1 33 1 2

HDFS mappers reducers HDFS

@rhatr

@TheASF

@c0sin

InputForm

at

OutputF

ormat

1

2

3

Page 23: Apache Giraph: start analyzing graph relationships in your bigdata in 45 minutes (or your money back)!

Anatomy of Giraph run mappers or YARN containers

@rhatr

@TheASF

@c0sin

InputForm

at

OutputF

ormat

1

2

3

Page 24: Apache Giraph: start analyzing graph relationships in your bigdata in 45 minutes (or your money back)!

A vertex view

MessageData1

VertexID

EdgeData

EdgeData

VertexData MessageData2

Page 25: Apache Giraph: start analyzing graph relationships in your bigdata in 45 minutes (or your money back)!

Turning Twitter into Facebook

@rhatr

@TheASF

@c0sin

rhatr

TheASF

c0sin

Page 26: Apache Giraph: start analyzing graph relationships in your bigdata in 45 minutes (or your money back)!

Ping thy neighbourspublic void compute(Vertex<Text, DoubleWritable, DoubleWritable> vertex, Iterable<Text> ms ){

if (getSuperstep() == 0) {

sendMessageToAllEdges(vertex, vertex.getId());

} else {

for (Text m : ms) {

if (vertex.getEdgeValue(m) == null) {

vertex.addEdge(EdgeFactory.create(m, SYNTHETIC_EDGE));

}

}

}

vertex.voteToHalt();

}

Page 27: Apache Giraph: start analyzing graph relationships in your bigdata in 45 minutes (or your money back)!

Demo time!

Page 28: Apache Giraph: start analyzing graph relationships in your bigdata in 45 minutes (or your money back)!

But I don’t have a cluster!•Hadoop in pseudo-distributed mode

• All Hadoop services on the same host (different JVMs)• Hadoop-as-a-Service

• Amazon’s EMR, etc.• Hadoop in local mode

Page 29: Apache Giraph: start analyzing graph relationships in your bigdata in 45 minutes (or your money back)!

Prerequisites

•Apache Hadoop 1.2.1•Apache Giraph 1.1.0-SNAPSHOT•Apache Maven 3.x•JDK 7+

Page 30: Apache Giraph: start analyzing graph relationships in your bigdata in 45 minutes (or your money back)!

Setting things up

$ curl hadoop.tar.gz | tar xzvf –$ git clone git://git.apache.org/giraph.git ; cd giraph$ mvn –Phadoop_1 package$ tar xzvf *dist*/*.tar.gz

$ export HADOOP_HOME=/Users/shapor/dist/hadoop-1.2.1$ export GIRAPH_HOME=/Users/shapor/dist/$ export HADOOP_CONF_DIR=$GIRAPH_HOME/conf$ PATH=$HADOOP_HOME/bin:$GIRAPH_HOME/bin:$PATH

Page 31: Apache Giraph: start analyzing graph relationships in your bigdata in 45 minutes (or your money back)!

Setting project up (maven)

<dependencies>

<dependency>

<groupId>org.apache.giraph</groupId>

<artifactId>giraph-core</artifactId>

<version>1.1.0-SNAPSHOT</version>

</dependency>

<dependency>

<groupId>org.apache.hadoop</groupId>

<artifactId>hadoop-core</artifactId>

<version>1.2.1</version>

</dependency>

</dependencies>

Page 32: Apache Giraph: start analyzing graph relationships in your bigdata in 45 minutes (or your money back)!

Running it

$ mvn package$ giraph target/*.jar giraph.GiraphHelloWorld \ -vip src/main/resources/1 \ -vif org.apache.giraph.io.formats.IntIntNullTextInputFormat \ -w 1 \ -ca giraph.SplitMasterWorker=false,giraph.logLevel=error

Page 33: Apache Giraph: start analyzing graph relationships in your bigdata in 45 minutes (or your money back)!

Testing itpublic void testNumberOfVertices() throws Exception {

GiraphConfiguration conf = new GiraphConfiguration();

conf.setComputationClass(GiraphHelloWorld.class);

conf.setVertexInputFormatClass(TextDoubleDoubleAdjacencyListVertexInputFormat.class);

Iterable<String> results =

InternalVertexRunner.run(conf, graphSeed);

}

Page 34: Apache Giraph: start analyzing graph relationships in your bigdata in 45 minutes (or your money back)!

Simplified view mappers or YARN containers

@rhatr

@TheASF

@c0sin

InputForm

at

OutputF

ormat

1

2

3

Page 35: Apache Giraph: start analyzing graph relationships in your bigdata in 45 minutes (or your money back)!

Master and master compute

@rhatr

@TheASF

@c0sin

1

2

3

ActiveMaster

compute()

StandbyMaster

Zookeeper

Page 36: Apache Giraph: start analyzing graph relationships in your bigdata in 45 minutes (or your money back)!

Master compute

•Runs before slave compute()•Has a global view•A place for aggregator manipulation

Page 37: Apache Giraph: start analyzing graph relationships in your bigdata in 45 minutes (or your money back)!

Aggregators

•“Shared variables”•Each vertex can push values to an aggregator•Master compute has full control

@rhatr

@TheASF

@c0sin

1

2

3

ActiveMaster

min: sum:

11

22

31

Page 38: Apache Giraph: start analyzing graph relationships in your bigdata in 45 minutes (or your money back)!

Questions?