47

Windows Scalability: Technology Terminology, Trends Jim Gray Distinguished Engineer Research Microsoft Corporation

Embed Size (px)

Citation preview

Page 1: Windows Scalability: Technology Terminology, Trends Jim Gray Distinguished Engineer Research Microsoft Corporation
Page 2: Windows Scalability: Technology Terminology, Trends Jim Gray Distinguished Engineer Research Microsoft Corporation

Windows Scalability:Windows Scalability:TechnologyTechnologyTerminology,Terminology,Trends Trends

Jim GrayJim GrayDistinguished EngineerDistinguished EngineerResearchResearchMicrosoft CorporationMicrosoft Corporation

Page 3: Windows Scalability: Technology Terminology, Trends Jim Gray Distinguished Engineer Research Microsoft Corporation

OutlineOutline

Progress: an overviewProgress: an overview Scale-Up technology trendsScale-Up technology trends

Cpus, Memory, Disks, NetworkingCpus, Memory, Disks, Networking Scale-Out terminology: Scale-Out terminology:

clones, racks/packs, farms, geoplexclones, racks/packs, farms, geoplex

Page 4: Windows Scalability: Technology Terminology, Trends Jim Gray Distinguished Engineer Research Microsoft Corporation

ProgressProgress Other speakers in this track will tell youOther speakers in this track will tell you

Windows is #1 Windows is #1 How they did it, How they did it,

and how you too can do it.and how you too can do it. Stepping backStepping back

Huge progress in last 5 years.Huge progress in last 5 years. 10x to 100x improvements10x to 100x improvements Now Windows has competitive high-end hardwareNow Windows has competitive high-end hardware

32xSMP, 64bit addressing, 30GBps bus bandwidth, …32xSMP, 64bit addressing, 30GBps bus bandwidth, … Software has evolved Software has evolved (32x smp, 256GB ram, 10 TB DB)(32x smp, 256GB ram, 10 TB DB)

In next 5 years, In next 5 years, expect 10x to 100x improvementsexpect 10x to 100x improvements

Page 5: Windows Scalability: Technology Terminology, Trends Jim Gray Distinguished Engineer Research Microsoft Corporation

AnalyzeAnalyze

MeasureMeasure

ImproveImprove

The Recurring ThemeThe Recurring Theme Windows improved 50% to 500%, Windows improved 50% to 500%,

Q: WHY?Q: WHY? A: Measure, Analyze, ImproveA: Measure, Analyze, Improve

Self TuningSelf Tuning TradeoffsTradeoffs

Buy memory locality & bandwidth with cpu (compress, pack, cluster)Buy memory locality & bandwidth with cpu (compress, pack, cluster) Trade memory for IO (caches)Trade memory for IO (caches)

SpeedupsSpeedups Introduce fast path for common caseIntroduce fast path for common case Repack for smaller I-Cache footprintRepack for smaller I-Cache footprint

ScalabilityScalability remove / improve locksremove / improve locks Cool hotspots cache / disk Cool hotspots cache / disk Examine spins and timeoutsExamine spins and timeouts Affinity/Locality to improve cachingAffinity/Locality to improve caching

Page 6: Windows Scalability: Technology Terminology, Trends Jim Gray Distinguished Engineer Research Microsoft Corporation

Scale UP

Scaleable SystemsScaleable Systems

Scale UP:Scale UP: grow by grow by adding components adding components to a single system.to a single system.

Scale OutScale Out: grow by : grow by adding more systems.adding more systems.

Scale OUT

Page 7: Windows Scalability: Technology Terminology, Trends Jim Gray Distinguished Engineer Research Microsoft Corporation

OutlineOutline

Progress: an overviewProgress: an overview ScaleUp Nodes: technology trendsScaleUp Nodes: technology trends

Cpus, Memory, Disks, NetworkingCpus, Memory, Disks, Networking ScaleOut terminology: ScaleOut terminology:

clones, racks/packs, farms, geoplexclones, racks/packs, farms, geoplex

Page 8: Windows Scalability: Technology Terminology, Trends Jim Gray Distinguished Engineer Research Microsoft Corporation

What’s REALLY New –What’s REALLY New –Windows Scale UpWindows Scale Up 64 bit & TB size main memory64 bit & TB size main memory SMP on chip: everything’s smpSMP on chip: everything’s smp 32… 256 SMP: locality/affinity matters32… 256 SMP: locality/affinity matters TB size disksTB size disks High-speed LANsHigh-speed LANs iSCSI and NAS competitioniSCSI and NAS competition

Page 9: Windows Scalability: Technology Terminology, Trends Jim Gray Distinguished Engineer Research Microsoft Corporation

64 bit – Why bother?64 bit – Why bother? 1966 Moore’s law: 1966 Moore’s law:

4x more RAM every 3 years.4x more RAM every 3 years. 1 bit of addressing every 18 months1 bit of addressing every 18 months

36 years later: 236 years later: 236/3 = 24 more bits 36/3 = 24 more bits Not exactly right, but…Not exactly right, but…

32 bits not enough for servers32 bits not enough for servers32 bits gives no headroom for clients32 bits gives no headroom for clients

So, time is running out ( has run out )So, time is running out ( has run out ) Good news: Good news:

Itanium™ and Hammer™ are maturingItanium™ and Hammer™ are maturingAnd so is the base software And so is the base software (OS, drivers, DB, Web,...)(OS, drivers, DB, Web,...)

Windows & SQL @ 256GB today!Windows & SQL @ 256GB today!

Page 10: Windows Scalability: Technology Terminology, Trends Jim Gray Distinguished Engineer Research Microsoft Corporation

Who needs 64-bit addressing?Who needs 64-bit addressing?You! Need 64-bit addressing! You! Need 64-bit addressing!

640K ought to be enough for anybody.640K ought to be enough for anybody.

Bill Gates, Bill Gates, 1981 1981

But that was 21 years ago But that was 21 years ago == 2 == 221/3 = 14 bits ago.21/3 = 14 bits ago.

20 bits + 14 bits = 34 bits so.. 20 bits + 14 bits = 34 bits so.. 16GB ought to be enough for anybody16GB ought to be enough for anybody Jim Gray, Jim Gray, 20022002

34 bits > 31 bits so…34 bits > 31 bits so…34 bits == 64 bits34 bits == 64 bits

YOU need 64 bit addressing!YOU need 64 bit addressing!

Page 11: Windows Scalability: Technology Terminology, Trends Jim Gray Distinguished Engineer Research Microsoft Corporation

64 bit – why bother?64 bit – why bother? Memory intensive calculations:Memory intensive calculations:

You can trade memory for IO and processingYou can trade memory for IO and processing Example: Data Analysis & Clustering a JHUExample: Data Analysis & Clustering a JHU in memory CPU time is in memory CPU time is

~ ~NlogN , N ~ 100MNlogN , N ~ 100M Disk M chunks Disk M chunks

→→ time time ~ M~ M22

must run many timesmust run many times Now running on Now running on

HP Itanium HP Itanium Windows.Net Server 2003 Windows.Net Server 2003 SQL ServerSQL Server

Graph courtesy of Alex Szalay & Adrian Pope of Johns Hopkins University

Memory in GB

1.0

10.0

100.0

1000.0

10000.0

100000.0

0 10 20 30 40 50 60 70 80 90 100

No of galaxies in Millions

CPU

time

(hrs

)

1

4

32

256

year

decade

week

day

month

Page 12: Windows Scalability: Technology Terminology, Trends Jim Gray Distinguished Engineer Research Microsoft Corporation

Amdahl’s balanced System LawsAmdahl’s balanced System Laws 1 mips needs 4 MB ram and needs 20 IO/s 1 mips needs 4 MB ram and needs 20 IO/s At 1 billion instructions per secondAt 1 billion instructions per second

need 4 GB/cpuneed 4 GB/cpuneed 50 disks/cpu!need 50 disks/cpu!

64 cpus … 3,000 disks64 cpus … 3,000 disks

1 bips1 bipscpucpu4 GB4 GB

RAMRAM 50 disks50 disks10,000 IOps10,000 IOps

7.5 TB7.5 TB

Page 13: Windows Scalability: Technology Terminology, Trends Jim Gray Distinguished Engineer Research Microsoft Corporation

The 5 Minute Rule – Trade The 5 Minute Rule – Trade RAM for Disk ArmsRAM for Disk Arms

If data re-referenced every 5 minutes If data re-referenced every 5 minutes It is cheaper to cache it in ramIt is cheaper to cache it in ram than to get it from disk than to get it from disk

A disk access/second ~ 50$ or A disk access/second ~ 50$ or ~ 50MB for 1 second or ~ 50MB for 1 second or ~ 50KB for 1,000 seconds. ~ 50KB for 1,000 seconds.

Each app has a memory “knee”Each app has a memory “knee” Up to the knee, Up to the knee, more memory helps a lot. more memory helps a lot.

Page 14: Windows Scalability: Technology Terminology, Trends Jim Gray Distinguished Engineer Research Microsoft Corporation

Three TPC Benchmarks: GBs help a LOT!

even if cpu clock is slower

0

25,000

50,000

75,000

100,000

4x1.6Ghz IA32+8GB 4x1.6Ghz IA32+32GB 4x1Ghz Itanium 2 +48GB

Tra

ns

ac

tio

ns

Pe

r S

ec

on

d

64 bit Reduces IO, saves disks64 bit Reduces IO, saves disks Large memory reduces IOLarge memory reduces IO 64-bit simplifies code64-bit simplifies code Processors can be faster (wider word)Processors can be faster (wider word) Ram is cheap (4 GB ~ 1k$ to 20k$)Ram is cheap (4 GB ~ 1k$ to 20k$) Can trade ram for disk IO Can trade ram for disk IO Better response time.Better response time. ExampleExample

tpcC tpcC 4x1Ghz Itanium2 vs 4x1Ghz Itanium2 vs 4x1.6Ghz IA32 4x1.6Ghz IA32 40 extra GB 40 extra GB

→ 60% extra throughput→ 60% extra throughput

4x1.6GhzIA328GB

4x1 GhzIA6448GB

4x1.6GhzIA3232GB

Page 15: Windows Scalability: Technology Terminology, Trends Jim Gray Distinguished Engineer Research Microsoft Corporation

AMD Hammer™ Coming SoonAMD Hammer™ Coming Soon AMD Hammer™ is 64bit capableAMD Hammer™ is 64bit capable 2003: millions of Hammer™ CPUs will ship 2003: millions of Hammer™ CPUs will ship 2004: most AMD CPUs will be 64bit 2004: most AMD CPUs will be 64bit 4GB ram is less than 1,000$ today4GB ram is less than 1,000$ today

less than 500$ in 2004 less than 500$ in 2004 Desktops (Hammer™) Desktops (Hammer™)

and servers (Opteron™). and servers (Opteron™). You do the math,…You do the math,…

Who will demand 64bit capable software?Who will demand 64bit capable software?

Page 16: Windows Scalability: Technology Terminology, Trends Jim Gray Distinguished Engineer Research Microsoft Corporation

A 1TB Main Memory A 1TB Main Memory Amdahl’s law: 1mips/MB , now 1:5Amdahl’s law: 1mips/MB , now 1:5

so ~20 x 10 Ghz cpus need 1TB ramso ~20 x 10 Ghz cpus need 1TB ram 1TB ram 1TB ram ~ 250k$ … 2m$ today~ 250k$ … 2m$ today

~ 25k$ … 200k$ in 5 years~ 25k$ … 200k$ in 5 years 128 million pages128 million pages

Takes a LONG time to fillTakes a LONG time to fill Takes a LONG time to refillTakes a LONG time to refill

Needs new algorithms Needs new algorithms Needs parallel processingNeeds parallel processing Which leads us to… Which leads us to…

The memory hierarchyThe memory hierarchy smp smp numanuma

Page 17: Windows Scalability: Technology Terminology, Trends Jim Gray Distinguished Engineer Research Microsoft Corporation

If cpu is always waiting for memoryIf cpu is always waiting for memoryPredict memory requests and prefetchPredict memory requests and prefetch donedone

If cpu still always waiting for memoryIf cpu still always waiting for memoryMulti-program it (Multi-program it (multiple hardware threads per cpumultiple hardware threads per cpu) ) Hyper Threading: Everything is SMPHyper Threading: Everything is SMP 2 now more later2 now more later Also multiple cpus/chipAlso multiple cpus/chip

If your program is single threadedIf your program is single threaded You waste ½ the cpu and memory bandwidthYou waste ½ the cpu and memory bandwidth Eventually waste 80% Eventually waste 80%

App builders need to plan for threads.App builders need to plan for threads.

Hyper-Threading: SMP on chipHyper-Threading: SMP on chip

Page 18: Windows Scalability: Technology Terminology, Trends Jim Gray Distinguished Engineer Research Microsoft Corporation

The Memory HierarchyThe Memory Hierarchy Locality REALLY mattersLocality REALLY matters CPU 2 G hz, RAM at 5 MhzCPU 2 G hz, RAM at 5 Mhz

RAM is no longer random access.RAM is no longer random access. Organizing the code gives 3x (or more)Organizing the code gives 3x (or more) Organizing the data gives 3x (or more)Organizing the data gives 3x (or more)

LevelLevel latencylatency (clocks)(clocks) size size RegistersRegisters 1 1 1 KB 1 KB L1L1 2 2 32 KB 32 KB L2L2 10 10 256 KB256 KB L3 L3 30 30 4 MB 4 MB Near RAMNear RAM 100100 16 GB 16 GB Far RAMFar RAM 300300 64 GB 64 GB

Page 19: Windows Scalability: Technology Terminology, Trends Jim Gray Distinguished Engineer Research Microsoft Corporation

RAM

Off chip

Icache

Arithmatic Logical Unit

Dcache

L2 cache

The Bus

Remote cache

DiskNetwork

Other Cpus

Other Cpus

Other Cpus

Other Cpus

registers

L1 cache

Remote RAM Remote RAM

Page 20: Windows Scalability: Technology Terminology, Trends Jim Gray Distinguished Engineer Research Microsoft Corporation

Scaleup Systems Scaleup Systems Non-Uniform Memory Architecture (NUMA)Non-Uniform Memory Architecture (NUMA)Coherent but… remote memory is even slowerCoherent but… remote memory is even slower

All cells see a common memory

Slow local main memory Slower remote main memory

Scaleup by adding cellsScaleup by adding cells

Planning for 64 cpu, 1TB ram Planning for 64 cpu, 1TB ram

Interconnect, Interconnect, Service Processor, Service Processor, Partition management Partition management are vendor specificare vendor specific

Several vendors doing thisSeveral vendors doing thisItanium and HammerItanium and HammerSystem interconnect System interconnect

Crossbar/SwitchCrossbar/Switch

Partition Partition managermanager

Config DBConfig DB

CPUCPU CPUCPUCPUCPU CPUCPU

MemMem MemMemMemMem MemMem

I/OI/O ChipsetChipset

CPUCPU CPUCPUCPUCPU CPUCPU

MemMem MemMemMemMem MemMem

I/OI/O ChipsetChipset

CPUCPU CPUCPUCPUCPU CPUCPU

MemMem MemMemMemMem MemMem

I/OI/O ChipsetChipset

CPUCPU CPUCPUCPUCPU CPUCPU

MemMem MemMemMemMem MemMem

I/OI/O ChipsetChipset

Service Service ProcessorProcessor

Service Service ProcessorProcessor

Page 21: Windows Scalability: Technology Terminology, Trends Jim Gray Distinguished Engineer Research Microsoft Corporation

Changed Ratios MatterChanged Ratios Matter

If everything changes by 2x, If everything changes by 2x, Then nothing changes.Then nothing changes.

So, it is the different rates that matter.So, it is the different rates that matter.

Improving FAST Improving FAST

CPU speedCPU speed

Memory & disk sizeMemory & disk size

Network BandwidthNetwork Bandwidth

Slowly changing Slowly changing

Speed of lightSpeed of light

People costsPeople costs

Memory bandwidthMemory bandwidth

WAN pricesWAN prices

Page 22: Windows Scalability: Technology Terminology, Trends Jim Gray Distinguished Engineer Research Microsoft Corporation

What’s REALLY NewWhat’s REALLY New

64 bit & TB size main memory64 bit & TB size main memory SMP on chip: everything’s smpSMP on chip: everything’s smp 32… 256 SMP: locality/affinity matters32… 256 SMP: locality/affinity matters TB size disksTB size disks High-speed LANsHigh-speed LANs iSCSI and NAS competitioniSCSI and NAS competition

We are here

Page 23: Windows Scalability: Technology Terminology, Trends Jim Gray Distinguished Engineer Research Microsoft Corporation

Disks are becoming tapesDisks are becoming tapes Capacity:Capacity:

150 GB now, 150 GB now, 300 GB this year, 300 GB this year, 1 TB by 2007 1 TB by 2007

Bandwidth:Bandwidth: 40 MBps now40 MBps now

150 MBps by 2007150 MBps by 2007 Read time Read time

2 hours sequential, 2 days random now2 hours sequential, 2 days random now4 hours sequential, 12 days random by 20074 hours sequential, 12 days random by 2007

150 IO/s 40 MBps150 IO/s 40 MBps

150 GB150 GB

200 IO/s 150 MBps200 IO/s 150 MBps

1 TB1 TB

Page 24: Windows Scalability: Technology Terminology, Trends Jim Gray Distinguished Engineer Research Microsoft Corporation

Disks are becoming tapesDisks are becoming tapesConsequencesConsequences

Use most disk capacity for archivingUse most disk capacity for archivingCopy on Write (COW) file system Copy on Write (COW) file system in Windows.NET Server 2003in Windows.NET Server 2003

RAID10 saves arms, costs space (OK!).RAID10 saves arms, costs space (OK!). Backup to diskBackup to disk

Pretend it is a 100GB disk + 1 TB diskPretend it is a 100GB disk + 1 TB disk Keep hot 10% of data on fastest part of disk.Keep hot 10% of data on fastest part of disk. Keep cold 90% on colder part of diskKeep cold 90% on colder part of disk

Organize computations to read/write Organize computations to read/write disks sequentially in large blocks.disks sequentially in large blocks.

Page 25: Windows Scalability: Technology Terminology, Trends Jim Gray Distinguished Engineer Research Microsoft Corporation

Networking: Networking: Great hardware & SoftwareGreat hardware & Software

WANs @ 5GBps (1WANs @ 5GBps (1 = 40 Gbps) = 40 Gbps) GbpsEthernet common (~100 MBps)GbpsEthernet common (~100 MBps)

Offload gives ~2 hz/ByteOffload gives ~2 hz/Byte Will improve with RDMA & zero-copy Will improve with RDMA & zero-copy 10 Gbps mainstream by 200410 Gbps mainstream by 2004

Faster I/OFaster I/O 1 GB/s today (measured)1 GB/s today (measured) 10 GB/s under development10 GB/s under development SATA (serial ATA) 150MBps/deviceSATA (serial ATA) 150MBps/device

Page 26: Windows Scalability: Technology Terminology, Trends Jim Gray Distinguished Engineer Research Microsoft Corporation

Wiring is going serial Wiring is going serial and getting FAST!and getting FAST!

Gbps Ethernet and SATA Gbps Ethernet and SATA built into chipsbuilt into chips

Raid Controllers: inexpensive and fast.Raid Controllers: inexpensive and fast. 1U storage bricks @ 2-10 TB1U storage bricks @ 2-10 TB SAN or NAS SAN or NAS

(iSCSI or CIFS/DAFS)(iSCSI or CIFS/DAFS)Enet

100MBps/link

8xSATA

150M

Bps/lin

k

Page 27: Windows Scalability: Technology Terminology, Trends Jim Gray Distinguished Engineer Research Microsoft Corporation

NAS – SAN Horse RaceNAS – SAN Horse Race Storage Hardware Storage Hardware

1k$/TB/y1k$/TB/yStorage Management 10k$...300k$/TB/yStorage Management 10k$...300k$/TB/y

So as with Server ConsolidationSo as with Server ConsolidationStorage Consolidation Storage Consolidation

Two styles: Two styles: NAS NAS (Network Attached Storage)(Network Attached Storage) File File ServerServer

SAN SAN (System Area Network)(System Area Network) Disk Disk ServerServer

Windows supports both models.Windows supports both models. We believe NAS is more manageable. We believe NAS is more manageable. Windows is a great NAS serverWindows is a great NAS server

Page 28: Windows Scalability: Technology Terminology, Trends Jim Gray Distinguished Engineer Research Microsoft Corporation

What’s REALLY New –What’s REALLY New –Windows Scale UpWindows Scale Up 64 bit & TB size main memory64 bit & TB size main memory SMP on chip: everything’s smpSMP on chip: everything’s smp 32… 256 SMP: locality/affinity matters32… 256 SMP: locality/affinity matters TB size disksTB size disks High-speed LANsHigh-speed LANs iSCSI and NAS competitioniSCSI and NAS competition

Page 29: Windows Scalability: Technology Terminology, Trends Jim Gray Distinguished Engineer Research Microsoft Corporation

Take Aways / Call to ActionTake Aways / Call to Action Threads: Plan for SMPs (threads)Threads: Plan for SMPs (threads)

32 cpu and (far) beyond….32 cpu and (far) beyond…. Locality: Use affinity, cache, disk, …Locality: Use affinity, cache, disk, … 64bit: Plan for VERY large memory64bit: Plan for VERY large memory Sequential IO and Disk-as-tapeSequential IO and Disk-as-tape

Plan for huge disks (with spare space)Plan for huge disks (with spare space) Low-overhead networking:Low-overhead networking:

LAN Converging on Ethernet, SATA, …?LAN Converging on Ethernet, SATA, …? Windows.Net Windows.Net Server 2003Server 2003 and successors and successors

will manage petabyte stores.will manage petabyte stores.

Page 30: Windows Scalability: Technology Terminology, Trends Jim Gray Distinguished Engineer Research Microsoft Corporation

OutlineOutline

Progress: an overviewProgress: an overview ScaleUp Nodes: technology trendsScaleUp Nodes: technology trends

Cpus, Memory, Disks, NetworkingCpus, Memory, Disks, Networking ScaleOut terminology: ScaleOut terminology:

clones, racks/packs, farms, geoplexclones, racks/packs, farms, geoplex

Page 31: Windows Scalability: Technology Terminology, Trends Jim Gray Distinguished Engineer Research Microsoft Corporation

Scale UP

Scaleable SystemsScaleable Systems

ScaleUP:ScaleUP: grow by grow by adding components adding components to a single system.to a single system.

ScaleOutScaleOut: grow by : grow by adding more systems.adding more systems.

Scale OUT

Page 32: Windows Scalability: Technology Terminology, Trends Jim Gray Distinguished Engineer Research Microsoft Corporation

ScaleUP ScaleUP andand Scale OUT Scale OUT Everyone does both.Everyone does both. Choice’s Choice’s

Size of a brickSize of a brick Clones or partitionsClones or partitions Size of a packSize of a pack

Who’s software?Who’s software? scaleup and scaleout scaleup and scaleout

both have a both have a largelarge software componentsoftware component

1M$/slice1M$/slice IBM S390?IBM S390? Sun E 10,000?Sun E 10,000?

100 K$/slice100 K$/slice Wintel 8x++ Wintel 8x++

10 K$/slice10 K$/slice Wintel 4x Wintel 4x

1 K$/slice1 K$/slice Wintel 1xWintel 1x

Page 33: Windows Scalability: Technology Terminology, Trends Jim Gray Distinguished Engineer Research Microsoft Corporation

Clones:Clones: Availability+Scalability Availability+Scalability Some applications areSome applications are

Read-mostly Read-mostly Low consistency requirementsLow consistency requirements Modest storage requirement (less than 1TB)Modest storage requirement (less than 1TB)

Examples:Examples: HTML web servers (IP sprayer/sieve + replication)HTML web servers (IP sprayer/sieve + replication) LDAP servers (replication via gossip)LDAP servers (replication via gossip)

Replicate app at all nodes (clones)Replicate app at all nodes (clones) Load BalanceLoad Balance::

Spray& Sieve:Spray& Sieve: requests across nodes. requests across nodes. RouteRoute: requests across nodes.: requests across nodes.

Grow:Grow: adding clones adding clones Fault toleranceFault tolerance: stop sending to that clone.: stop sending to that clone.

Page 34: Windows Scalability: Technology Terminology, Trends Jim Gray Distinguished Engineer Research Microsoft Corporation

Two Clone GeometriesTwo Clone Geometries Shared-Nothing:Shared-Nothing: exactexact replicas replicas Shared-DiskShared-Disk (state stored in server) (state stored in server)

Shared Nothing Clones Shared Disk Clones

If clones have any state: make it disposable. Manage clones by reboot, failing that replace.One person can manage thousands of clones.

Page 35: Windows Scalability: Technology Terminology, Trends Jim Gray Distinguished Engineer Research Microsoft Corporation

Clone RequirementsClone Requirements Automatic replication Automatic replication (if they have any state)(if they have any state)

Applications (and system software)Applications (and system software) Data Data

Automatic request routingAutomatic request routing Spray or sieveSpray or sieve

Management:Management: Who is up?Who is up? Update management & propagationUpdate management & propagation Application monitoring.Application monitoring.

Clones are very easy to manage:Clones are very easy to manage: Rule of thumb: 100’s of clones per admin. Rule of thumb: 100’s of clones per admin.

Page 36: Windows Scalability: Technology Terminology, Trends Jim Gray Distinguished Engineer Research Microsoft Corporation

PartitionsPartitions for Scalability for Scalability Clones are not appropriate for some apps.Clones are not appropriate for some apps.

State-full apps do not replicate wellState-full apps do not replicate well high update rates do not replicate well high update rates do not replicate well

ExamplesExamples EmailEmail DatabasesDatabases Read/write file server…Read/write file server… Cache managersCache managers chat chat

Partition state among serversPartition state among servers Partitioning:Partitioning:

must be transparentmust be transparent to client. to client. split & merge partitions onlinesplit & merge partitions online

Page 37: Windows Scalability: Technology Terminology, Trends Jim Gray Distinguished Engineer Research Microsoft Corporation

Packs Packs for Availabilityfor Availability Each partition may fail Each partition may fail (independent of others)(independent of others)

Partitions migrate to new node via fail-overPartitions migrate to new node via fail-over Fail-over in secondsFail-over in seconds

Pack:Pack: the nodes supporting a partition the nodes supporting a partition VMS Cluster, Tandem, SP2 HACMP,..VMS Cluster, Tandem, SP2 HACMP,.. IBM Sysplex™IBM Sysplex™ WinNT MSCS (wolfpack) WinNT MSCS (wolfpack)

Partitions typically grow in packs.Partitions typically grow in packs. ActiveActive:ActiveActive: all nodes provide service all nodes provide service ActivePassive:ActivePassive: hot standby is idle hot standby is idle

Cluster-In-A-Box now commodity Cluster-In-A-Box now commodity

Page 38: Windows Scalability: Technology Terminology, Trends Jim Gray Distinguished Engineer Research Microsoft Corporation

Partitions and PacksPartitions and Packs

PartitionsScalability

Packed PartitionsScalability + Availability

Page 39: Windows Scalability: Technology Terminology, Trends Jim Gray Distinguished Engineer Research Microsoft Corporation

Parts+Packs RequirementsParts+Packs Requirements Automatic partitioningAutomatic partitioning (in dbms, mail, files,…)(in dbms, mail, files,…)

Location transparentLocation transparent Partition split/merge Partition split/merge Grow without limits (100x10TB)Grow without limits (100x10TB) Application-centric request routingApplication-centric request routing

Simple fail-over modelSimple fail-over model Partition migration is transparentPartition migration is transparent MSCS-like model for servicesMSCS-like model for services

ManagementManagement:: Automatic partition management (split/merge)Automatic partition management (split/merge) Who is up?Who is up? Application monitoring.Application monitoring.

Page 40: Windows Scalability: Technology Terminology, Trends Jim Gray Distinguished Engineer Research Microsoft Corporation

GeoPlex: Farm PairsGeoPlex: Farm Pairs Two farms (or more) Two farms (or more) State State (your mailbox, bank account)(your mailbox, bank account)

stored at both farmsstored at both farms Changes from one Changes from one

sent to othersent to other When one farm failsWhen one farm fails

other provides serviceother provides service MasksMasks

Hardware/Software faultsHardware/Software faults Operations tasksOperations tasks (reorganize, upgrade move)(reorganize, upgrade move) Environmental faultsEnvironmental faults (power fail, earthquake, fire)(power fail, earthquake, fire)

Page 41: Windows Scalability: Technology Terminology, Trends Jim Gray Distinguished Engineer Research Microsoft Corporation

Fail-Over & Load BalancingFail-Over & Load Balancing

Routes request to right farmRoutes request to right farm Farm can be clone or partitionFarm can be clone or partition

At farm, routes request to right At farm, routes request to right serviceservice

At service routes request toAt service routes request to Any cloneAny clone Correct partition.Correct partition.

Routes around failures.Routes around failures.

Page 42: Windows Scalability: Technology Terminology, Trends Jim Gray Distinguished Engineer Research Microsoft Corporation

99 999well-managed nodes

well-managed packs & clones

well-managed GeoPlex

Masks some hardware failures

Masks hardware failures, Operations tasks (e.g. software upgrades)Masks some software failures

Masks site failures (power, network, fire, move,…) Masks some operations failuresA

vaila

bilit

y

Page 43: Windows Scalability: Technology Terminology, Trends Jim Gray Distinguished Engineer Research Microsoft Corporation

ClonedClonedPacked Packed

file file serversservers

Packed Partitions: Database Transparency

Cluster Scale Out ScenariosCluster Scale Out Scenarios

SQL Temp StateWeb File StoreA

ClonedFront Ends(firewall, sprayer,

web server)

SQL Partition 3

The FARM: Clones and Packs of Partitions

Web Clients

Web File StoreBreplication

SQL DatabaseSQL Partition 2 SQL Partition1

Load BalanceLoad Balance

Page 44: Windows Scalability: Technology Terminology, Trends Jim Gray Distinguished Engineer Research Microsoft Corporation

Some Examples:Some Examples: TerraServer:TerraServer:

6 IIS clone front-ends (wlbs)6 IIS clone front-ends (wlbs) 3-partition 4-pack backend: 3 active 1 passive3-partition 4-pack backend: 3 active 1 passive Partition by theme and geography (longitude)Partition by theme and geography (longitude) 1/3 sys admin1/3 sys admin

Hotmail:Hotmail: 1,000 IIS clone HTTP login 1,000 IIS clone HTTP login 3,400 IIS clone HTTP front door3,400 IIS clone HTTP front door + 1,000 clones for ad rotator, in/out bound… + 1,000 clones for ad rotator, in/out bound… 115 partition backend (partition by mailbox)115 partition backend (partition by mailbox) Cisco local director for load balancingCisco local director for load balancing 50 sys admin50 sys admin

Google: Google: (Inktomi is similar but smaller)(Inktomi is similar but smaller) 700 clone spider700 clone spider 300 clone indexer300 clone indexer 5-node geoplex (full replica)5-node geoplex (full replica) 1,000 clones/farm do search1,000 clones/farm do search 100 clones/farm for http100 clones/farm for http 10 sys admin 10 sys admin

Page 45: Windows Scalability: Technology Terminology, Trends Jim Gray Distinguished Engineer Research Microsoft Corporation

SummarySummary Terminology for scaleabilityTerminology for scaleability FarmsFarms of servers: of servers:

ClonesClones: identical: identicalScaleability + availabilityScaleability + availability

PartitionsPartitions: : ScaleabilityScaleability

PacksPacksPartition availability via fail-overPartition availability via fail-over

GeoPlexGeoPlex for disaster tolerance. for disaster tolerance.Architectural Blueprint for Large eSitesArchitectural Blueprint for Large eSites

http://msdn.microsoft.com/library/en-us/dndna/html/dnablueprint.asphttp://msdn.microsoft.com/library/en-us/dndna/html/dnablueprint.aspScalability Terminology:Scalability Terminology: Farms, Clones, Partitions, and Packs: Farms, Clones, Partitions, and Packs: ftp://ftp.research.microsoft.com/pub/tr/tr-99-85.docftp://ftp.research.microsoft.com/pub/tr/tr-99-85.doc

Farm

Clone

SharedNothing

SharedDisk

Partition

Pack

SharedNothing

Active-Active

Active-Passive

GeoPlex

Page 46: Windows Scalability: Technology Terminology, Trends Jim Gray Distinguished Engineer Research Microsoft Corporation

Call to ActionCall to Action Plan for 64 bit addressing everywherePlan for 64 bit addressing everywhere

it is in your future.it is in your future. Use threads Use threads

SMP is in your future SMP is in your future CarefullyCarefully

avoid locks, use locality/affinityavoid locks, use locality/affinity Think of disks as tape:Think of disks as tape:

Sequential vs randomSequential vs randomOnline archiveOnline archive

Windows now has ScaleUp Windows now has ScaleUp andand ScaleOut ScaleOut Think in terms of Geoplexes and FarmsThink in terms of Geoplexes and Farms

Page 47: Windows Scalability: Technology Terminology, Trends Jim Gray Distinguished Engineer Research Microsoft Corporation

© 2002 Microsoft Corporation. All rights reserved.© 2002 Microsoft Corporation. All rights reserved.This presentation is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY.This presentation is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY.