Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Copyright 2012, 2015, 2018 & 2019 – Noah Mendelsohn
Scalability, Performance &
Caching
Noah Mendelsohn Tufts University Email: [email protected] Web: http://www.cs.tufts.edu/~noah
COMP 150-IDS: Internet Scale Distributed Systems (Fall 2019)
© 2010 Noah Mendelsohn 2
Goals
Explore some general principles of performance, scalability and caching
Explore key issues relating to performance and scalability of the Web
© 2010 Noah Mendelsohn 3
Performance Concepts and Terminology
© 2010 Noah Mendelsohn
Performance, Scalability, Availability, Reliability
Performance – Get a lot done quickly – Preferably a low cost
Scalability – Low barriers to growth – A scalable system isn’t necessarily fast… – …but it can grow without slowing down
Availability – Always there when you need it
Reliability – Never does the wrong thing, never loses or corrupts data
4
© 2010 Noah Mendelsohn
Throughput vs. Response Time vs. Latency
Throughput: the aggregate rate at which a system does work
Response time: the time to get a response to a request
Latency: time spent waiting (e.g. for disk or network)
We can improve throughput by: – Minimizing work done per request – Doing enough work at once to keep all hardware resources busy… – …and when some work is delayed (latency) find other work to do – Using parallelism, often to work on multiple requests independently
We can improve response time by: – Minimizing total work and delay (latency) on critical path to a response – Applying parallel resources to an individual response…including streaming – Precomputing response values
5
© 2010 Noah Mendelsohn 6
Know How Fast Things Are
© 2010 Noah Mendelsohn
Typical “speeds n feeds”
CPU (e.g. Intel Core I7): – A few billions instructions / second per core – Memory: 20GB/sec (20 bytes/instruction executed)
Long distance network – Latency (ping time): 10-100ms – Bandwidth: 5 – 100 Mb/sec
Local area network (Gbit Ethernet) – Latency: 50-100usec (note microseconds) – Bandwidth: 1 Gb/sec (100mbytes/sec)
Hard disk – Rotational delay: 5ms – Seek time: 5 – 10 ms – Bandwidth from magenetic media: 1Gbit/sec
SSD – Setup time: 100usec – Bandwidth: 2Gbit/sec (typical) note: SSD wins big on latency, some on bandwidth
7
© 2010 Noah Mendelsohn 8
Making Systems Faster Single Thread Speed
© 2010 Noah Mendelsohn
MAIN MEMORY
CPU
Sharing the CPU
PowerPt Browser
Multiple Programs Running at once
OPERATING SYSTEM
© 2010 Noah Mendelsohn
MAIN MEMORY
CPU
What affects speed of a single program?
Browser Code
OPERATING SYSTEM
How well is code written?
In what language?
Compiler optimization?
© 2010 Noah Mendelsohn
MAIN MEMORY
CPU
What affects speed of a single program?
Browser Code
OPERATING SYSTEM
How well is code written?
In what language?
Compiler optimization?
System Library Code
How efficient are system libraries? (including malloc,
sqrt)
How efficient is the OS (including
file I/O, networking
stack)?
© 2010 Noah Mendelsohn
MAIN MEMORY
CPU
What affects speed of a single program?
Browser Code
OPERATING SYSTEM
System Library Code (e.g. sqrt)
+ GPU
How powerful is the GPU? How much memory does it have?
A GPU is intended to speed graphics
operations with a CPU-like core optimized for parallel work and data
streaming How well does application and
associated libraries use the
GPU?
Note that GPUs are also useful for general parallel computation
© 2010 Noah Mendelsohn
MAIN MEMORY
CPU
What affects speed of a single program?
Browser Code
OPERATING SYSTEM
System Library Code (e.g. sqrt)
Network connection
performance
Speed /capacity of storage devices
© 2010 Noah Mendelsohn 14
Making Systems Faster Hiding Latency
© 2010 Noah Mendelsohn
Hard disks are slow
Platter
Sector
© 2010 Noah Mendelsohn
Handling disk data the slow way
Sector
The Slow Way
• Read a block • Compute on block • Read another block • Compute on other block • Rinse and repeat
Computer waits msec while reading disk 1000s of instruction times!
© 2010 Noah Mendelsohn
Faster way: overlap to hide latency
The Faster Way
• Read a block • Start reading another block • Compute on 1st block • Start reading 3rd block • Compute on 2nd block • Rinse and repeat
Buffering: we’re reading ahead…computing while reading!
Sector
© 2010 Noah Mendelsohn 18
Making Systems Faster Bottlenecks and Parallelism
© 2010 Noah Mendelsohn
Parallelism and pipelining
19
Adjust contrast and sharpness
© 2010 Noah Mendelsohn
Parallelism
20
Multiple computers each take a piece of the image
© 2010 Noah Mendelsohn
Pipelining
21
Compute brightness range
Adjust brightness
Sharpen
© 2010 Noah Mendelsohn
Amdahl’s claim: parallel processing won’t scale
22
1967: Major controversy …will parallel computers work?
“Demonstration is made of the continued validity of the single processor approach and of the weaknesses of the multiple processor approach in terms of application to real problems and their attendant irregularities. Gene Amdahl*”
* Gene Amdahl, Validity of the single processor approach to achieving large scale computing capabilities, AFIPS spring joint computer conference, 1967 http://www-inst.eecs.berkeley.edu/~n252/paper/Amdahl.pdf
© 2010 Noah Mendelsohn
Amdahl: why no parallel scaling?
23
“The first characteristic of interest is the fraction of the computational load which is associated with data management housekeeping. This fraction […might eventually be reduced to 20%...]. The nature of this overhead appears to be sequential so that it is unlikely to be amenable to parallel processing techniques. Overhead alone would then place an upper limit on throughput of five to seven times the sequential processing rate. Gene Amdahl (Ibid)
In short: even if the part you’re optimizing went to zero time, the speedup would be only 5x.
Speedup = 1/(rs +(rp/n)) where rs and rp are sequential/parallel fractions of computation
As rp/n 0, Speedup -> 1/rs
© 2010 Noah Mendelsohn
So…why does parallelism work after all?
24
Amdahl missed that as we got more parallelism, we would work on bigger problems!
• Simulations with more data points • Indexing all the pages on the World Wide Web • Serving search queries from all users of the Web • Running word processors “in the cloud” for millions of users • Etc.
© 2010 Noah Mendelsohn 25
Web Performance and Scaling
© 2010 Noah Mendelsohn
Web Performance & Scalability Goals
Overall Web Goals: – Extraordinary scalability, with good performance – Therefore…very high aggregate throughput (think of all the accesses being
made this second) – Economical to deploy (modest cost/user) – Be a good citizen on the Internet
Web servers: – Decent performance, high throughput and scalability
Web clients (browsers): – Low latency (quick response for users) – Reasonable burden on PC – Minimize memory and CPU on small devices
26
© 2010 Noah Mendelsohn
What we’ve already studied about Web scalability…
Web Builds on scalable hi-perf. Internet infrastructure: – IP – DNS – TCP
Decentralized administration & deployment – The only thing resembling a global, central Web server is the DNS root – URI generation
Stateless protocols – Relatively easy to add servers
27
© 2010 Noah Mendelsohn
Web server scaling
Web-Server Application - logic
Browser Data store
Reservation Records
© 2010 Noah Mendelsohn
Stateless HTTP protocol helps scalability
Web-Server Application - logic
Browser
Data store
Web-Server Application - logic
Browser
Web-Server Application - logic
Browser
© 2010 Noah Mendelsohn 30
Caching
© 2010 Noah Mendelsohn 31
There are only two hard things in Computer Science: cache invalidation and naming things.
-- Phil Karlton
© 2010 Noah Mendelsohn
Why does caching work at all?
Locality: – In many computer systems, a small fraction of data gets most of the accesses – In other systems, a slowly changing set of data is accessed repeatedly
History: use of memory by typical programs – Denning’s Working Set Theory* – Early demonstration of locality in program access to memory – Justified paged virtual memory with LRU replacement algorithm – Also indirectly explains why CPU caches work
But…not all data-intensive programs follow the theory: – Video processing! – Many simulations – Hennessy and Patterson: running vector (think MMX/SIMD) data through the
CPU cache was a big mistake in IBM mainframe vector implementations
32
* Peter J. Denning, The Working Set Model for Program Behavior, 1968 http://denninginstitute.com/pjd/PUBS/WSModel_1968.pdf
Also 2008 overview on locality from Denning: http://denninginstitute.com/pjd/PUBS/ENC/locality08.pdf
© 2010 Noah Mendelsohn
Why is caching hard?
33
Things change
Telling everyone when things change adds overhead
So, we’re tempted to cheat… …caches out of sync with reality
© 2010 Noah Mendelsohn
CPU Caching – Simple System
34
CPU
Memory
CACHE
Read data
Read data
Read data Read request
Read request
© 2010 Noah Mendelsohn
CPU Caching – Simple System
35
CPU
Memory
CACHE Read data
Read data Repeated read request
Life is Good No Traffic to Slow Memory
© 2010 Noah Mendelsohn
CPU Caching – Store Through Writing
36
CPU
Memory
CACHE Write data
Write request
Write request
Everything is up-to-date… …but every write waits for slow memory!
© 2010 Noah Mendelsohn
CPU Caching – Store In Writing
37
CPU
Memory
CACHE Write data
Write request
The write is fast, but memory is out of date!
© 2010 Noah Mendelsohn
CPU Caching – Store In Writing
38
CPU
Memory
CACHE Write data
If we try to write data from memory to disk, the wrong data will go out!
© 2010 Noah Mendelsohn
Cache invalidation is hard!
39
CPU
Memory
CACHE Write data We can start to see why cache invalidation is hard!
© 2010 Noah Mendelsohn
Multi-core CPU caching
40
CPU
Memory
CACHE
Write request
CPU
CACHE
CACHE
Coherence Protocol
© 2010 Noah Mendelsohn
Data
Multi-core CPU caching
41
CPU
Memory
CACHE
Write request
CPU
CACHE
Read request
CACHE
Coherence Protocol
Data
Data
© 2010 Noah Mendelsohn
Data
Multi-core CPU caching
42
CPU
Memory
CACHE
Write request
CPU
CACHE
Disk read request
CACHE
Coherence Protocol
Data
Disk Data
A read from disk must flush all caches
Data
Data
© 2010 Noah Mendelsohn
Consistency vs. performance
Caching involves difficult tradeoffs
Coherence is the enemy of performance! – This proves true over and over in lots of systems – There’s a ton of research on weak-consistency models…
Weak consistency: let things get out of sync sometimes – Programming: compilers and libraries can hide or even exploit weak
consistency
Yet another example of leaky abstractions!
43
© 2010 Noah Mendelsohn 44
What about Web Caching?
Note: update rate on Web is mostly low – makes things easier!
© 2010 Noah Mendelsohn
Browsers have caches
E.g. Firefox E.g. Apache
Browser Usually includes a cache!
Web Server
Browser cache prevents repeated requests for same representations
© 2010 Noah Mendelsohn
Browsers have caches
E.g. Firefox E.g. Apache
Browser Usually includes a cache!
Web Server
Browser cache prevents repeated requests for same representations…even different pages share images stylesheets, etc.
© 2010 Noah Mendelsohn
Web Reservation System
Web Server Application - logic
Browser or Phone App Data Store
iPhone or Android Reservation Application
Flight Reservation Logic
Reservation Records
Many commercial applications work this way
E.g. Squid
Proxy Cache (optional!)
HTTP HTTP RPC? ODBC? Proprietary?
© 2010 Noah Mendelsohn
HTTP Caches Help Web to Scale
Browser
Browser
Browser
Data store
Web-Server Application -
logic
Web-Server Application -
logic
Web-Server Application -
logic
Web Proxy Cache
Web Proxy Cache
Web Proxy Cache
© 2010 Noah Mendelsohn 49
Web Caching Details
© 2010 Noah Mendelsohn
HTTP Headers for Caching Cache-control: max-age:
– Server indicates how long response is good Heuristics:
– If no explicit times, cache can guess Caches check with server when content has expired
– Sends ordinary GET w/validator headers – Validators: Modified time (1 sec resolution); Etag (opaque code) – Server returns “304 Not Modified” or new content
Cache-control: override default caching rules – E.g. client forces check for fresh copy – E.g. client explicitly allows stale data (for performance, availability)
Caches inform downstream clients/proxies of response age PUT/POST/DELETE clear caches, but…
– No guarantee that updates go through same proxies as all reads! – Don’t mark as cacheable things you expect to update through parallel proxies!
50
© 2010 Noah Mendelsohn 51
Summary
© 2010 Noah Mendelsohn 52
Summary
We have studied some key principles and techniques relating to peformance and scalability – Hardware performance – Single program issues (code quality, compiler, etc.) – Hiding latency – Parallelism and Amdahl’s law – Buffering and caching – Stateless protocols, etc.
The Web is a highly scalable system w/most OK performance Web approaches to scalability
– Built on scalable Internet infrastructure – Few single points of control (DNS root changes slowly and available in parallel) – Administrative scalability: no central Web site registry
Web performance and scalability – Very high parallelism (browsers, servers all run in parallel) – Stateless protocols support scale out – Caching