First NL-HUG: Large-scale data processing at SARA with Apache Hadoop

Embed Size (px)

Citation preview

Large-scale data processing
[at SARA]
[with Apache Hadoop]

Evert LammertsFebruary 9, 2012, Netherlands Hadoop User Group

Who's who?

Who's who?

Who has worked on scale?e.g. database sharding, round-robin HTTP, Hadoop, key-value databases, anything else over multiple nodes?

>= 5 nodes, >= 10 nodes, >= 50 nodes, >= 100 nodes?

In this talk

Why large-scale data processing?

An introduction to scale @ SARA

An introduction to Hadoop & MapReduce

Hadoop @ SARA

Why large-scale data processing?
An introduction to scale @ SARA
An introduction to Hadoop & MapReduce
Hadoop @ SARA

(Jimmy Lin, University of Maryland / Twitter, 2011)

(IEEE Intelligent Systems, 03/04-2009, vol 24, issue 2, p8-12)

s/knowledge/data/g*

HTTP logs, Click data, Query logs, CRM data, Financial data, Social networks, Archives, Crawls, and many more

You already have your data

(*Jimmy Lin, University of Maryland / Twitter, 2011)

Data-processing as a commodity

Cheap Clusters

Simple programming models

Easy-to-learn scripting

Anybody with the know-how can generate insights!

Note: the know-how = Data Science

DevOps

Programming algorithms

Domain knowledge

Why large-scale data processing?
An introduction to scale @ SARA
An introduction to Hadoop & MapReduce
Hadoop @ SARA

SARA
the national center for scientific computing

Facilitating Science in The Netherlands with Equipment for and Expertise on Large-Scale Computing, Large-Scale Data Storage, High-Performance Networking, eScience, and Visualization

Large-scale data != new

Different types of computing

ParallelismData parallelism

Task parallelism

ArchitecturesSIMD: Single Instruction Multiple Data

MIMD: Multiple Instruction Multiple Data

MISD: Multiple Instruction Single Data

SISD: Single Instruction Single Data (Von Neumann)

Parallelism: Amdahl's law

Data parallelism

Compute @ SARA

What's different about Hadoop?

No more do-it-yourself parallelism it's hard!But rather linearly scalable data parallelism

Separating the what from the how

(NYT, 14/06/2006)

Why large-scale data processing?
An introduction to scale @ SARA
An introduction to Hadoop & MapReduce
Hadoop @ SARA

A bit of history

Nutch*

2002

2004

MR/GFS**

2006

2004

Hadoop

* http://nutch.apache.org/** http://labs.google.com/papers/mapreduce.html http://labs.google.com/papers/gfs.html

http://wiki.apache.org/hadoop/PoweredBy

2010 - 2012: A Hype in Production

Core principals

Scale out, not up

Move processing to the data

Process data sequentially, avoid random reads

Seamless scalability

(Jimmy Lin, University of Maryland / Twitter, 2011)

A typical data-parallel problem in abstraction

Iterate over a large number of records

Extract something of interest

Create an ordering in intermediate results

Aggregate intermediate results

Generate output

MapReduce: functional abstraction of step 2 & step 4

(Jimmy Lin, University of Maryland / Twitter, 2011)

MapReduce

Programmer specifies two functionsmap(k, v) *

reduce(k', v') *

All values associated with a single key are sent to the same reducer

The framework handles the rest

The rest?

Scheduling, data distribution, ordering, synchronization, error handling...

An overview of a Hadoop cluster

The ecosystem

Hbase, Hive, Pig, HCatalog, Giraph, Elephantbird, and many others...

Why large-scale data processing?
An introduction to scale @ SARA
An introduction to Hadoop & MapReduce
Hadoop @ SARA

Timeline

2009:Piloting Hadoop on Cloud2010:Test cluster available for scientists 6 machines * 4 cores / 24 TB storage / 16GB RAM Just me!2011:Funding granted for production service2012:Production cluster available (~March) 72 machines * 8 cores / 8 TB storage / 64GB RAM Integration with Kerberos for secure multi-tenancy 3 devops, team of consultants

Architecture

Components

Hadoop, Hive, Pig, Hbase, HCatalog - others?

What are scientists doing?

Information Retrieval

Natural Language Processing

Machine Learning

Econometry

Bioinformatics

Computational Ecology / Ecoinformatics

Machine learning: Infrawatch, Hollandse Brug

Structural health monitoring

145sensors

100Hz

60seconds

60minutes

24hours

365days

x

x

x

x

x

= large data

(Arno Knobbe, LIACS, 2011, http://infrawatch.liacs.nl)

And others: NLP & IR

e.g. ClueWeb: a ~13.4 TB webcrawl

e.g. Twitter gardenhose data

e.g. Wikipedia dumps

e.g. del.ico.us & flickr tags

Finding named entities: [person company place] names

Creating inverted indexes

Piloting real-time search

Personalization

Semantic web

Interest from industry

We're opening shop. Come and pilot.

Final thoughts

The tide rises, data is not getting less, let's ride that wave!

Hadoop is the first to provide commodity computingHadoop is not the only

Hadoop is probably not the best

Hadoop has momentum

And how many infrastructures do we need?

MapReduce fits surprisingly well as a programming model for data-parallelism

The data center is your computer

Where is the data scientist? Much to learn & teach!

Any questions?

[email protected]@eevrt@sara_nl

Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline LevelSeventh Outline LevelEighth Outline LevelNinth Outline Level