58
Handling Large Datasets at Google: Current Systems and Future Directions Jeff Dean Google Fellow http://labs.google.com/people/jeff

6 Dean Google

Embed Size (px)

DESCRIPTION

google系统架构pdf

Citation preview

Page 1: 6 Dean Google

Handling Large Datasets at Google:Current Systems and Future

DirectionsJeff Dean

Google Fellow

http://labs.google.com/people/jeff

Page 2: 6 Dean Google

Outline• Hardware infrastructure• Distributed systems infrastructure:

– Scheduling system– GFS– BigTable– MapReduce

• Challenges and Future Directions– Infrastructure that spans all datacenters– More automation

Page 3: 6 Dean Google

Sample Problem Domains• Offline batch jobs

– Large datasets (PBs), bulk reads/writes (MB chunks)– Short outages acceptable– Web indexing, log processing, satellite imagery, etc.

• Online applications– Smaller datasets (TBs), small reads/writes small (KBs)– Outages immediately visible to users, low latency vital– Web search, Orkut, GMail, Google Docs, etc.

• Many areas: IR, machine learning, image/video processing, NLP, machine translation, ...

Page 4: 6 Dean Google

Typical New Engineer• Never seen a

petabyte of data• Never used a

thousand machines• Never really

experienced machine failure

Our software has to make them successful.

Page 5: 6 Dean Google

• Workloads are large and easily parallelized• Care about perf/$, not absolute machine perf• Even reliable hardware fails at our scale

• Many datacenters, all around the world– Intra-DC bandwidth >> Inter-DC bandwidth– Speed of light has remained fixed in last 10 yrs :)

Google’s Hardware PhilosophyTruckloads of low-cost machines

Page 6: 6 Dean Google

Effects of Hardware Philosophy• Software must

tolerate failure• Application’s

particular machine should not matter

• No special machines - just 2 or 3 flavors

Google - 1999

Page 7: 6 Dean Google

Current Design

• In-house rack design• PC-class

motherboards• Low-end storage and

networking hardware• Linux• + in-house software

Page 8: 6 Dean Google

The Joys of Real HardwareTypical first year for a new cluster:

~0.5 overheating (power down most machines in <5 mins, ~1-2 days to recover)~1 PDU failure (~500-1000 machines suddenly disappear, ~6 hours to come back)~1 rack-move (plenty of warning, ~500-1000 machines powered down, ~6 hours)~1 network rewiring (rolling ~5% of machines down over 2-day span)~20 rack failures (40-80 machines instantly disappear, 1-6 hours to get back)~5 racks go wonky (40-80 machines see 50% packet loss)~8 network maintenances (4 might cause ~30-minute random connectivity losses)~12 router reloads (takes out DNS and external vips for a couple minutes)~3 router failures (have to immediately pull traffic for an hour)~dozens of minor 30-second blips for dns~1000 individual machine failures~thousands of hard drive failures

slow disks, bad memory, misconfigured machines, flaky machines, etc.

Page 9: 6 Dean Google

Typical Cluster

GFSchunkserver

Schedulerslave

Linux

Machine 1

…GFS

chunkserverScheduler

slave

Linux

Machine 2

GFSchunkserver

Schedulerslave

Linux

Machine N

Page 10: 6 Dean Google

Typical Cluster

Chubby Lock service GFS masterCluster scheduling master

GFSchunkserver

Schedulerslave

Linux

Machine 1

…GFS

chunkserverScheduler

slave

Linux

Machine 2

GFSchunkserver

Schedulerslave

Linux

Machine N

Page 11: 6 Dean Google

Typical Cluster

Chubby Lock service GFS masterCluster scheduling master

GFSchunkserver

Schedulerslave

Linux

Machine 1

Userapp1

…Userapp1

GFSchunkserver

Schedulerslave

Linux

Machine 2

GFSchunkserver

Schedulerslave

Linux

Machine N

Page 12: 6 Dean Google

Typical Cluster

Chubby Lock service GFS masterCluster scheduling master

GFSchunkserver

Schedulerslave

Linux

Machine 1

User app2

Userapp1

…Userapp1

GFSchunkserver

Schedulerslave

Linux

Machine 2

GFSchunkserver

Schedulerslave

Linux

Machine N

Page 13: 6 Dean Google

Typical Cluster

Chubby Lock service GFS masterCluster scheduling master

GFSchunkserver

Schedulerslave

Linux

Machine 1

User app2

Userapp1 BigTable

server

…Userapp1

BigTableserver

GFSchunkserver

Schedulerslave

Linux

Machine 2

GFSchunkserver

Schedulerslave

Linux

Machine N

Page 14: 6 Dean Google

Typical Cluster

Chubby Lock service GFS masterCluster scheduling master

GFSchunkserver

Schedulerslave

Linux

Machine 1

User app2

Userapp1 BigTable

server

…Userapp1

BigTableserver BigTable master

GFSchunkserver

Schedulerslave

Linux

Machine 2

GFSchunkserver

Schedulerslave

Linux

Machine N

Page 15: 6 Dean Google

File Storage: GFS

• Master: Manages file metadata• Chunkserver: Manages 64MB file chunks• Clients talk to master to open and find files• Clients talk directly to chunkservers for data

Chunkserver 1 Chunkserver NChunkserver 2

Client

ClientClient

GFS Master

C0 C1

C2C5

C0

C2

C5C1

C3C5 …

Page 16: 6 Dean Google

GFS Usage

• 200+ GFS clusters• Managed by an internal service team• Largest clusters

– 5000+ machines– 5+ PB of disk usage– 10000+ clients

Page 17: 6 Dean Google

Data Storage: BigTableWhat is it, really?• 10-ft view: Row &

column abstraction for storing data

• Reality: Distributed, persistent, multi-level sorted map

Page 18: 6 Dean Google

BigTable Data Model• Multi-dimensional sparse sorted map

(row, column, timestamp) => value

Page 19: 6 Dean Google

BigTable Data Model• Multi-dimensional sparse sorted map

(row, column, timestamp) => value

“www.cnn.com”

Rows

Page 20: 6 Dean Google

BigTable Data Model• Multi-dimensional sparse sorted map

(row, column, timestamp) => value

“www.cnn.com”

Rows

“contents:” Columns

Page 21: 6 Dean Google

BigTable Data Model• Multi-dimensional sparse sorted map

(row, column, timestamp) => value

“www.cnn.com”

Rows

“contents:” Columns

“<html>…”

Page 22: 6 Dean Google

BigTable Data Model• Multi-dimensional sparse sorted map

(row, column, timestamp) => value

“www.cnn.com”

Rows

“contents:” Columns

Timestamps

t17“<html>…”

Page 23: 6 Dean Google

BigTable Data Model• Multi-dimensional sparse sorted map

(row, column, timestamp) => value

“www.cnn.com”

Rows

“contents:” Columns

t11

Timestamps

t17“<html>…”

Page 24: 6 Dean Google

BigTable Data Model• Multi-dimensional sparse sorted map

(row, column, timestamp) => value

“www.cnn.com”

Rows

“contents:” Columns

t3t11

Timestamps

t17“<html>…”

Page 25: 6 Dean Google

Tablets (cont.)

“cnn.com”

“contents:”

“<html>…”

“language:”

EN

“cnn.com/sports.html”

“zuppa.com/menu.html”

…“yahoo.com/kids.html”

…“website.com”

“aaa.com”

Page 26: 6 Dean Google

Tablets (cont.)

“cnn.com”

“contents:”

“<html>…”

“language:”

EN

“cnn.com/sports.html”

“zuppa.com/menu.html”

…“yahoo.com/kids.html”

…“website.com”

“aaa.com”

Page 27: 6 Dean Google

Tablets (cont.)

“cnn.com”

“contents:”

“<html>…”

“language:”

EN

“cnn.com/sports.html”

“zuppa.com/menu.html”

…“yahoo.com/kids.html”

…“website.com”

“aaa.com”

Page 28: 6 Dean Google

Tablets (cont.)

Tablets

“cnn.com”

“contents:”

“<html>…”

“language:”

EN

“cnn.com/sports.html”

“zuppa.com/menu.html”

…“yahoo.com/kids.html”

…“website.com”

“aaa.com”

Page 29: 6 Dean Google

Tablets (cont.)

Tablets

“cnn.com”

“contents:”

“<html>…”

“language:”

EN

“cnn.com/sports.html”

“zuppa.com/menu.html”

…“yahoo.com/kids.html”

…“website.com”

“aaa.com”

Page 30: 6 Dean Google

Tablets (cont.)

Tablets

“cnn.com”

“contents:”

“<html>…”

“language:”

EN

“cnn.com/sports.html”

“zuppa.com/menu.html”

…“yahoo.com/kids.html”

…“website.com”

“aaa.com”

Page 31: 6 Dean Google

Tablets (cont.)

Tablets

“cnn.com”

“contents:”

“<html>…”

“language:”

EN

“cnn.com/sports.html”

“zuppa.com/menu.html”

…“yahoo.com/kids.html”

…“website.com”

“aaa.com”

Page 32: 6 Dean Google

Tablets (cont.)

Tablets

“cnn.com”

“contents:”

“<html>…”

“language:”

EN

“cnn.com/sports.html”

“zuppa.com/menu.html”

…“yahoo.com/kids.html”

…“website.com”

“aaa.com”

Page 33: 6 Dean Google

Tablets (cont.)

Tablets

“cnn.com”

“contents:”

“<html>…”

“language:”

EN

“cnn.com/sports.html”

“zuppa.com/menu.html”

…“yahoo.com/kids.html”

“yahoo.com/kids.html”

…“website.com”

“aaa.com”

Page 34: 6 Dean Google

Tablets (cont.)

Tablets

“cnn.com”

“contents:”

“<html>…”

“language:”

EN

“cnn.com/sports.html”

“zuppa.com/menu.html”

…“yahoo.com/kids.html”

“yahoo.com/kids.html”

…“website.com”

“aaa.com”

Page 35: 6 Dean Google

Bigtable System Structure

Bigtable master

Bigtable tablet server Bigtable tablet serverBigtable tablet server …

Bigtable Cell

Page 36: 6 Dean Google

Bigtable System Structure

Bigtable master

Bigtable tablet server Bigtable tablet serverBigtable tablet server …

performs metadata ops +load balancing

Bigtable Cell

Page 37: 6 Dean Google

Bigtable System Structure

Bigtable master

Bigtable tablet server Bigtable tablet serverBigtable tablet server …

performs metadata ops +load balancing

serves data serves dataserves data

Bigtable Cell

Page 38: 6 Dean Google

Bigtable System Structure

Lock service

Bigtable master

Bigtable tablet server Bigtable tablet serverBigtable tablet server

GFSCluster scheduling system

performs metadata ops +load balancing

serves data serves dataserves data

Bigtable Cell

Page 39: 6 Dean Google

Bigtable System Structure

Lock service

Bigtable master

Bigtable tablet server Bigtable tablet serverBigtable tablet server

GFSCluster scheduling system

handles failover, monitoring

performs metadata ops +load balancing

serves data serves dataserves data

Bigtable Cell

Page 40: 6 Dean Google

Bigtable System Structure

Lock service

Bigtable master

Bigtable tablet server Bigtable tablet serverBigtable tablet server

GFSCluster scheduling system

holds tablet data, logshandles failover, monitoring

performs metadata ops +load balancing

serves data serves dataserves data

Bigtable Cell

Page 41: 6 Dean Google

Bigtable System Structure

Lock service

Bigtable master

Bigtable tablet server Bigtable tablet serverBigtable tablet server

GFSCluster scheduling system

holds metadata,handles master-electionholds tablet data, logshandles failover, monitoring

performs metadata ops +load balancing

serves data serves dataserves data

Bigtable Cell

Page 42: 6 Dean Google

Bigtable System Structure

Lock service

Bigtable master

Bigtable tablet server Bigtable tablet serverBigtable tablet server

GFSCluster scheduling system

holds metadata,handles master-electionholds tablet data, logshandles failover, monitoring

performs metadata ops +load balancing

serves data serves dataserves data

Bigtable CellBigtable client

Bigtable clientlibrary

Page 43: 6 Dean Google

Bigtable System Structure

Lock service

Bigtable master

Bigtable tablet server Bigtable tablet serverBigtable tablet server

GFSCluster scheduling system

holds metadata,handles master-electionholds tablet data, logshandles failover, monitoring

performs metadata ops +load balancing

serves data serves dataserves data

Bigtable CellBigtable client

Bigtable clientlibrary

Open()

Page 44: 6 Dean Google

Bigtable System Structure

Lock service

Bigtable master

Bigtable tablet server Bigtable tablet serverBigtable tablet server

GFSCluster scheduling system

holds metadata,handles master-electionholds tablet data, logshandles failover, monitoring

performs metadata ops +load balancing

serves data serves dataserves data

Bigtable CellBigtable client

Bigtable clientlibrary

Open()read/write

Page 45: 6 Dean Google

Bigtable System Structure

Lock service

Bigtable master

Bigtable tablet server Bigtable tablet serverBigtable tablet server

GFSCluster scheduling system

holds metadata,handles master-electionholds tablet data, logshandles failover, monitoring

performs metadata ops +load balancing

serves data serves dataserves data

Bigtable CellBigtable client

Bigtable clientlibrary

Open()read/write

metadata ops

Page 46: 6 Dean Google

Some BigTable Features• Single-row transactions: easy to do read/modify/write

operations• Locality groups: segregate columns into different files• In-memory columns: random access to small items• Suite of compression techniques: per-locality group• Bloom filters: avoid seeks for non-existent data• Replication: eventual-consistency replication across

datacenters, between multiple BigTable serving setups (master/slave & multi-master)

Page 47: 6 Dean Google

BigTable Usage• 500+ BigTable cells• Largest cells manage 6000TB+ of data,

3000+ machines• Busiest cells sustain >500000+ ops/

second 24 hours/day, and peak much higher

Page 48: 6 Dean Google

Data Processing: MapReduce• Google’s batch processing tool of choice• Users write two functions:

– Map: Produces (key, value) pairs from input– Reduce: Merges (key, value) pairs from Map

• Library handles data transfer and failures• Used everywhere: Earth, News, Analytics,

Search Quality, Indexing, …

Page 49: 6 Dean Google

Example: Document Indexing• Input: Set of documents D1, …, DN

• Map– Parse document D into terms T1, …, TN

– Produces (key, value) pairs• (T1, D), …, (TN, D)

• Reduce– Receives list of (key, value) pairs for term T

• (T, D1), …, (T, DN)

– Emits single (key, value) pair• (T, (D1, …, DN))

Page 50: 6 Dean Google

MapReduce ExecutionGFS

GFS

Map task 1map

k1:v k2:v

Map task 2map

k1:v k3:v

Map task 3map

k1:v k4:v

reduce

k1:v,v,v k3,v

Reduce task 1

reduce

k2:v k4,v

Reduce task 2

Shuffle and Sort

MapReducemaster

Page 51: 6 Dean Google

21

MapReduce Tricks / Features

• Data locality• Multiple I/O data types

• Data compression• Pipelined shuffle stage• Fast sorter

• Backup copies of tasks• # tasks >> # machines• Task re-execution on failure• Local or cluster execution• Distributed counters

Page 52: 6 Dean Google

22

MapReduce Programs in Google’s Source Tree

0

2500

5000

7500

10000

12500

Jan-03 May-03 Sep-03 Jan-04 May-04 Sep-04 Jan-05 May-05 Sep-05 Jan-06 May-06 Sep-06 Jan-07 May-07 Sep-07

Page 53: 6 Dean Google

23

New MapReduce Programs Per Month

0

175

350

525

700

Jan-03 May-03 Sep-03 Jan-04 May-04 Sep-04 Jan-05 May-05 Sep-05 Jan-06 May-06 Sep-06 Jan-07 May-07 Sep-07

Page 54: 6 Dean Google

23

New MapReduce Programs Per Month

0

175

350

525

700

Jan-03 May-03 Sep-03 Jan-04 May-04 Sep-04 Jan-05 May-05 Sep-05 Jan-06 May-06 Sep-06 Jan-07 May-07 Sep-07

Summer intern effect

Page 55: 6 Dean Google

MapReduce in GoogleEasy to use. Library hides complexity.

Number of jobsMar, ‘05

72KMar, ‘06

171KSep, '072,217K

Average time (seconds) 934 874 395Machine years used 981 2,002 11,081Input data read (TB) 12,571 52,254 403,152Intermediate data (TB) 2,756 6,743 34,774Output data written (TB) 941 2,970 14,018Average worker machines 232 268 394

Page 56: 6 Dean Google

Current WorkScheduling system + GFS + BigTable + MapReduce work

well within single clusters

Many separate instances in different data centers– Tools on top deal with cross-cluster issues– Each tool solves relatively narrow problem

– Many tools => lots of complexity

Can next generation infrastructure do more?

Page 57: 6 Dean Google

Next Generation InfrastructureTruly global systems to span all our datacenters • Global namespace with many replicas of data worldwide• Support both consistent and inconsistent operations• Continued operation even with datacenter partitions• Users specify high-level desires:

“99%ile latency for accessing this data should be <50ms” “Store this data on at least 2 disks in EU, 2 in U.S. & 1 in Asia”

– Increased utilization through automation– Automatic migration, growing and shrinking of services– Lower end-user latency– Provide high-level programming model for data-intensive

interactive services

Page 58: 6 Dean Google

Questions?Further info:

• The Google File System, Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung, SOSP ‘03.

• Web Search for a Planet: The Google Cluster Architecture, Luiz Andre Barroso, Jeffrey Dean, Urs Hölzle, IEEE Micro, 2003.

• Bigtable: A Distributed Storage System for Structured Data, Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber, OSDI’06

• MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean and Sanjay Ghemawat, OSDIʼ04

• Failure Trends in a Large Disk Drive Population, Eduardo Pinheiro, Wolf-Dietrich Weber and Luiz André Barroso. FAST, ‘07.

http://labs.google.com/papershttp://labs.google.com/people/jeff or [email protected]