If you can't read please download the document
Upload
evert-lammerts
View
2.004
Download
1
Embed Size (px)
Citation preview
Large-scale data processing
[at SARA]
[with Apache Hadoop]
Evert LammertsFebruary 9, 2012, Netherlands Hadoop User Group
Who's who?
Who's who?
Who has worked on scale?e.g. database sharding, round-robin HTTP, Hadoop, key-value databases, anything else over multiple nodes?
>= 5 nodes, >= 10 nodes, >= 50 nodes, >= 100 nodes?
In this talk
Why large-scale data processing?
An introduction to scale @ SARA
An introduction to Hadoop & MapReduce
Hadoop @ SARA
Why large-scale data processing?
An introduction to scale @ SARA
An introduction to Hadoop & MapReduce
Hadoop @ SARA
(Jimmy Lin, University of Maryland / Twitter, 2011)
(IEEE Intelligent Systems, 03/04-2009, vol 24, issue 2, p8-12)
s/knowledge/data/g*
HTTP logs, Click data, Query logs, CRM data, Financial data, Social networks, Archives, Crawls, and many more
You already have your data
(*Jimmy Lin, University of Maryland / Twitter, 2011)
Data-processing as a commodity
Cheap Clusters
Simple programming models
Easy-to-learn scripting
Anybody with the know-how can generate insights!
Note: the know-how = Data Science
DevOps
Programming algorithms
Domain knowledge
Why large-scale data processing?
An introduction to scale @ SARA
An introduction to Hadoop & MapReduce
Hadoop @ SARA
SARA
the national center for scientific computing
Facilitating Science in The Netherlands with Equipment for and Expertise on Large-Scale Computing, Large-Scale Data Storage, High-Performance Networking, eScience, and Visualization
Large-scale data != new
Different types of computing
ParallelismData parallelism
Task parallelism
ArchitecturesSIMD: Single Instruction Multiple Data
MIMD: Multiple Instruction Multiple Data
MISD: Multiple Instruction Single Data
SISD: Single Instruction Single Data (Von Neumann)
Parallelism: Amdahl's law
Data parallelism
Compute @ SARA
What's different about Hadoop?
No more do-it-yourself parallelism it's hard!But rather linearly scalable data parallelism
Separating the what from the how
(NYT, 14/06/2006)
Why large-scale data processing?
An introduction to scale @ SARA
An introduction to Hadoop & MapReduce
Hadoop @ SARA
A bit of history
Nutch*
2002
2004
MR/GFS**
2006
2004
Hadoop
* http://nutch.apache.org/** http://labs.google.com/papers/mapreduce.html http://labs.google.com/papers/gfs.html
http://wiki.apache.org/hadoop/PoweredBy
2010 - 2012: A Hype in Production
Core principals
Scale out, not up
Move processing to the data
Process data sequentially, avoid random reads
Seamless scalability
(Jimmy Lin, University of Maryland / Twitter, 2011)
A typical data-parallel problem in abstraction
Iterate over a large number of records
Extract something of interest
Create an ordering in intermediate results
Aggregate intermediate results
Generate output
MapReduce: functional abstraction of step 2 & step 4
(Jimmy Lin, University of Maryland / Twitter, 2011)
MapReduce
Programmer specifies two functionsmap(k, v) *
reduce(k', v') *
All values associated with a single key are sent to the same reducer
The framework handles the rest
The rest?
Scheduling, data distribution, ordering, synchronization, error handling...
An overview of a Hadoop cluster
The ecosystem
Hbase, Hive, Pig, HCatalog, Giraph, Elephantbird, and many others...
Why large-scale data processing?
An introduction to scale @ SARA
An introduction to Hadoop & MapReduce
Hadoop @ SARA
Timeline
2009:Piloting Hadoop on Cloud2010:Test cluster available for scientists 6 machines * 4 cores / 24 TB storage / 16GB RAM Just me!2011:Funding granted for production service2012:Production cluster available (~March) 72 machines * 8 cores / 8 TB storage / 64GB RAM Integration with Kerberos for secure multi-tenancy 3 devops, team of consultants
Architecture
Components
Hadoop, Hive, Pig, Hbase, HCatalog - others?
What are scientists doing?
Information Retrieval
Natural Language Processing
Machine Learning
Econometry
Bioinformatics
Computational Ecology / Ecoinformatics
Machine learning: Infrawatch, Hollandse Brug
Structural health monitoring
145sensors
100Hz
60seconds
60minutes
24hours
365days
x
x
x
x
x
= large data
(Arno Knobbe, LIACS, 2011, http://infrawatch.liacs.nl)
And others: NLP & IR
e.g. ClueWeb: a ~13.4 TB webcrawl
e.g. Twitter gardenhose data
e.g. Wikipedia dumps
e.g. del.ico.us & flickr tags
Finding named entities: [person company place] names
Creating inverted indexes
Piloting real-time search
Personalization
Semantic web
Interest from industry
We're opening shop. Come and pilot.
Final thoughts
The tide rises, data is not getting less, let's ride that wave!
Hadoop is the first to provide commodity computingHadoop is not the only
Hadoop is probably not the best
Hadoop has momentum
And how many infrastructures do we need?
MapReduce fits surprisingly well as a programming model for data-parallelism
The data center is your computer
Where is the data scientist? Much to learn & teach!
Any questions?
[email protected]@eevrt@sara_nl
Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline LevelSeventh Outline LevelEighth Outline LevelNinth Outline Level