Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Digital Design & Computer Arch.Lecture 21a: Memory Organization
and Memory Technology
Prof. Onur Mutlu
ETH ZürichSpring 202014 May 2020
Readings for This Lecture and Nextn Memory Hierarchy and Caches
n Requiredq H&H Chapters 8.1-8.3q Refresh: P&P Chapter 3.5
n Recommendedq An early cache paper by Maurice Wilkes
n Wilkes, “Slave Memories and Dynamic Storage Allocation,” IEEE Trans. On Electronic Computers, 1965.
2
A Computing Systemn Three key componentsn Computation n Communicationn Storage/memory
3
Burks, Goldstein, von Neumann, “Preliminary discussion of thelogical design of an electronic computing instrument,” 1946.
Image source: https://lbsitbytes2010.wordpress.com/2013/03/29/john-von-neumann-roll-no-15/
What is A Computer?n We will cover all three components
4
Memory(programand data)
I/O
Processing
control(sequencing)
datapath
Memory (Programmer’s View)
5
Abstraction: Virtual vs. Physical Memoryn Programmer sees virtual memory
q Can assume the memory is “infinite”n Reality: Physical memory size is much smaller than what
the programmer assumesn The system (system software + hardware, cooperatively)
maps virtual memory addresses to physical memoryq The system automatically manages the physical memory
space transparently to the programmer
+ Programmer does not need to know the physical size of memory nor manage it à A small physical memory can appear as a huge one to the programmer à Life is easier for the programmer
-- More complex system software and architecture
A classic example of the programmer/(micro)architect tradeoff6
(Physical) Memory Systemn You need a larger level of storage to manage a small
amount of physical memory automaticallyà Physical memory has a backing store: disk
n We will first start with the physical memory system
n For now, ignore the virtualàphysical indirection
n We will get back to it later, if time permits…
7
Idealism
8
InstructionSupply
Pipeline(Instructionexecution)
DataSupply
- Zero latency access
- Infinite capacity
- Zero cost
- Perfect control flow
- No pipeline stalls
-Perfect data flow (reg/memory dependencies)
- Zero-cycle interconnect(operand communication)
- Enough functional units
- Zero latency compute
- Zero latency access
- Infinite capacity
- Infinite bandwidth
- Zero cost
Quick Overview of Memory Arrays
How Can We Store Data?n Flip-Flops (or Latches)
q Very fast, parallel accessq Very expensive (one bit costs tens of transistors)
n Static RAM (we will describe them in a moment)q Relatively fast, only one data word at a timeq Expensive (one bit costs 6 transistors)
n Dynamic RAM (we will describe them a bit later)q Slower, one data word at a time, reading destroys content
(refresh), needs special process for manufacturingq Cheap (one bit costs only one transistor plus one capacitor)
n Other storage technology (flash memory, hard disk, tape)q Much slower, access takes a long time, non-volatileq Very cheap (no transistors directly involved)
Array Organization of Memoriesn Goal: Efficiently store large amounts of data
q A memory array (stores data)q Address selection logic (selects one row of the array)q Readout circuitry (reads data out)
n An M-bit value can be read or written at each unique N-bit addressq All values can be accessed, but only M-bits at a timeq Access restriction allows more compact organization
Address
Data
ArrayN
M
Memory Arraysn Two-dimensional array of bit cells
q Each bit cell stores one bit
n An array with N address bits and M data bits:q 2N rows and M columnsq Depth: number of rows (number of words)q Width: number of columns (size of word)q Array size: depth × width = 2N × M
Address
Data
ArrayN
M
Address Data11100100
depth
0 1 01 0 01 1 00 1 1
width
Address
Data
Array2
3
n 22× 3-bit arrayn Number of words: 4n Word size: 3-bitsn For example, the 3-bit word stored at address 10 is 100
Address Data11100100
depth
0 1 01 0 01 1 00 1 1
width
Address
Data
Array2
3
Memory Array Example
Larger and Wider Memory Array Example
Address
Data
1024-word x32-bitArray
10
32
Memory Array Organization (I)n Storage nodes in one column connected to one bitlinen Address decoder activates only ONE wordlinen Content of one line of storage available at output
wordline311
10
2:4Decoder
Address
01
00
storedbit = 0wordline2
wordline1
wordline0
storedbit = 1
storedbit = 0
storedbit = 1
storedbit = 0
storedbit = 0
storedbit = 1
storedbit = 1
storedbit = 0
storedbit = 0
storedbit = 1
storedbit = 1
bitline2 bitline1 bitline0
Data2 Data1 Data0
2
Memory Array Organization (II)n Storage nodes in one column connected to one bitlinen Address decoder activates only ONE wordlinen Content of one line of storage available at output
wordline311
10
2:4Decoder
Address
01
00
storedbit = 0wordline2
wordline1
wordline0
storedbit = 1
storedbit = 0
storedbit = 1
storedbit = 0
storedbit = 0
storedbit = 1
storedbit = 1
storedbit = 0
storedbit = 0
storedbit = 1
storedbit = 1
bitline2 bitline1 bitline0
Data2 Data1 Data0
2
10
1 0 0
Active wordline
How is Access Controlled?n Access transistors configured as switches connect the bit
storage to the bitline.n Access controlled by the wordline
stored bit
wordlinebitline
wordlinebitline bitlinewordline
bitline
DRAM SRAM
Building Larger Memoriesn Requires larger memory arrays
n Large à slow
n How do we make the memory large without making it very slow?
n Idea: Divide the memory into smaller arrays and interconnect the arrays to input/output busesq Large memories are hierarchical array structuresq DRAM: Channel à Rank à Bank à Subarrays à Mats
18
General Principle: Interleaving (Banking)n Interleaving (banking)
q Problem: a single monolithic large memory array takes long to access and does not enable multiple accesses in parallel
q Goal: Reduce the latency of memory array access and enable multiple accesses in parallel
q Idea: Divide a large array into multiple banks that can be accessed independently (in the same cycle or in consecutive cycles)n Each bank is smaller than the entire memory storagen Accesses to different banks can be overlapped
q A Key Issue: How do you map data to different banks? (i.e., how do you interleave data across banks?)
19
Memory Technology:DRAM and SRAM
Memory Technology: DRAMn Dynamic random access memoryn Capacitor charge state indicates stored value
q Whether the capacitor is charged or discharged indicates storage of 1 or 0
q 1 capacitorq 1 access transistor
n Capacitor leaks through the RC pathq DRAM cell loses charge over timeq DRAM cell needs to be refreshed
21
row enable
_bitl
ine
n Static random access memoryn Two cross coupled inverters store a single bit
q Feedback path enables the stored value to persist in the “cell”q 4 transistors for storageq 2 transistors for access
Memory Technology: SRAM
22
row select
bitli
ne
_bitl
ine
Memory Bank Organization and Operationn Read access sequence:
1. Decode row address & drive word-lines
2. Selected bits drive bit-lines
• Entire row read
3. Amplify row data
4. Decode column address & select subset of row
• Send to output
5. Precharge bit-lines• For next access
23
SRAM (Static Random Access Memory)
24
bit-cell array
2n row x 2m-col
(n»m to minimizeoverall latency)
sense amp and mux2m diff pairs
2nn
m
1
row select
bitli
ne
_bitl
ine
n+m
Read Sequence1. address decode2. drive row select3. selected bit-cells drive bitlines
(entire row is read together)4. differential sensing and column select
(data is ready)5. precharge all bitlines
(for next read or write)
Access latency dominated by steps 2 and 3Cycling time dominated by steps 2, 3 and 5
- step 2 proportional to 2m
- step 3 and 5 proportional to 2n
DRAM (Dynamic Random Access Memory)
25
row enable_b
itlin
e
bit-cell array
2n row x 2m-col
(n»m to minimizeoverall latency)
sense amp and mux2m
2nn
m
1
RAS
CASA DRAM die comprises of multiple such arrays
Bits stored as charges on node capacitance (non-restorative)
- bit cell loses charge when read- bit cell loses charge over time
Read Sequence1~3 same as SRAM4. a “flip-flopping” sense amp
amplifies and regenerates the bitline, data bit is mux’ed out
5. precharge all bitlines
Destructive readsCharge loss over timeRefresh: A DRAM controller must periodically read each row within the allowed refresh time (10s of ms) such that charge is restored
DRAM vs. SRAMn DRAM
q Slower access (capacitor)q Higher density (1T 1C cell)q Lower costq Requires refresh (power, performance, circuitry)q Manufacturing requires putting capacitor and logic together
n SRAMq Faster access (no capacitor)q Lower density (6T cell)q Higher costq No need for refreshq Manufacturing compatible with logic process (no capacitor)
26
Digital Design & Computer Arch.Lecture 21a: Memory Organization
and Memory Technology
Prof. Onur Mutlu
ETH ZürichSpring 202014 May 2020
We did not cover the following slides in lecture. These are for your preparation for the next lecture.
The Memory Hierarchy
Memory in a Modern System
30
CORE 1
L2 CA
CH
E 0
SHA
RED
L3 CA
CH
E
DR
AM
INTER
FAC
E
CORE 0
CORE 2 CORE 3L2 C
AC
HE 1
L2 CA
CH
E 2
L2 CA
CH
E 3
DR
AM
BA
NK
S
DRAM MEMORY CONTROLLER
Ideal Memoryn Zero access time (latency)n Infinite capacityn Zero costn Infinite bandwidth (to support multiple accesses in parallel)
31
The Problemn Ideal memory’s requirements oppose each other
n Bigger is slowerq Bigger à Takes longer to determine the location
n Faster is more expensiveq Memory technology: SRAM vs. DRAM vs. Disk vs. Tape
n Higher bandwidth is more expensiveq Need more banks, more ports, higher frequency, or faster
technology
32
The Problemn Bigger is slower
q SRAM, 512 Bytes, sub-nanosecq SRAM, KByte~MByte, ~nanosecq DRAM, Gigabyte, ~50 nanosecq Hard Disk, Terabyte, ~10 millisec
n Faster is more expensive (dollars and chip area)q SRAM, < 10$ per Megabyteq DRAM, < 1$ per Megabyteq Hard Disk < 1$ per Gigabyteq These sample values (circa ~2011) scale with time
n Other technologies have their place as well q Flash memory (mature), PC-RAM, MRAM, RRAM (not mature yet)
33
Why Memory Hierarchy?n We want both fast and large
n But we cannot achieve both with a single level of memory
n Idea: Have multiple levels of storage (progressively bigger and slower as the levels are farther from the processor) and ensure most of the data the processor needs is kept in the fast(er) level(s)
34
The Memory Hierarchy
35
fastsmall
big but slow
move what you use here
backupeverythinghere
With good locality of reference, memory appears as fast asand as large as
fast
er p
er b
yte
chea
per p
er b
yte
Memory Hierarchyn Fundamental tradeoff
q Fast memory: smallq Large memory: slow
n Idea: Memory hierarchy
n Latency, cost, size, bandwidth
36
CPUMain
Memory(DRAM)RF
Cache
Hard Disk
Localityn One’s recent past is a very good predictor of his/her near
future.
n Temporal Locality: If you just did something, it is very likely that you will do the same thing again soonq since you are here today, there is a good chance you will be
here again and again regularly
n Spatial Locality: If you did something, it is very likely you will do something similar/related (in space)q every time I find you in this room, you are probably sitting
close to the same people
37
Memory Localityn A “typical” program has a lot of locality in memory
referencesq typical programs are composed of “loops”
n Temporal: A program tends to reference the same memory location many times and all within a small window of time
n Spatial: A program tends to reference a cluster of memory locations at a time q most notable examples:
n 1. instruction memory references n 2. array/data structure references
38
Caching Basics: Exploit Temporal Localityn Idea: Store recently accessed data in automatically
managed fast memory (called cache)n Anticipation: the data will be accessed again soon
n Temporal locality principleq Recently accessed data will be again accessed in the near
futureq This is what Maurice Wilkes had in mind:
n Wilkes, “Slave Memories and Dynamic Storage Allocation,” IEEE Trans. On Electronic Computers, 1965.
n “The use is discussed of a fast core memory of, say 32000 words as a slave to a slower core memory of, say, one million words in such a way that in practical cases the effective access time is nearer that of the fast memory than that of the slow memory.”
39
Caching Basics: Exploit Spatial Localityn Idea: Store addresses adjacent to the recently accessed
one in automatically managed fast memoryq Logically divide memory into equal size blocksq Fetch to cache the accessed block in its entirety
n Anticipation: nearby data will be accessed soon
n Spatial locality principleq Nearby data in memory will be accessed in the near future
n E.g., sequential instruction access, array traversalq This is what IBM 360/85 implemented
n 16 Kbyte cache with 64 byte blocksn Liptay, “Structural aspects of the System/360 Model 85 II: the
cache,” IBM Systems Journal, 1968.
40
The Bookshelf Analogyn Book in your handn Deskn Bookshelfn Boxes at homen Boxes in storage
n Recently-used books tend to stay on deskq Comp Arch books, books for classes you are currently takingq Until the desk gets full
n Adjacent books in the shelf needed around the same timeq If I have organized/categorized my books well in the shelf
41
Caching in a Pipelined Designn The cache needs to be tightly integrated into the pipeline
q Ideally, access in 1-cycle so that load-dependent operations do not stall
n High frequency pipeline à Cannot make the cache largeq But, we want a large cache AND a pipelined design
n Idea: Cache hierarchy
42
CPU
MainMemory(DRAM)
RFLevel1Cache
Level 2Cache
A Note on Manual vs. Automatic Management
n Manual: Programmer manages data movement across levels-- too painful for programmers on substantial programsq “core” vs “drum” memory in the 50’sq still done in some embedded processors (on-chip scratch pad SRAM in lieu of a cache) and GPUs (called “shared memory”)
n Automatic: Hardware manages data movement across levels, transparently to the programmer++ programmer’s life is easierq the average programmer doesn’t need to know about it
n You don’t need to know how big the cache is and how it works to write a “correct” program! (What if you want a “fast” program?)
43
Automatic Management in Memory Hierarchy
n Wilkes, “Slave Memories and Dynamic Storage Allocation,”IEEE Trans. On Electronic Computers, 1965.
n “By a slave memory I mean one which automatically accumulates to itself words that come from a slower main memory, and keeps them available for subsequent use without it being necessary for the penalty of main memory access to be incurred again.”
44
Historical Aside: Other Cache Papersn Fotheringham, “Dynamic Storage Allocation in the Atlas
Computer, Including an Automatic Use of a Backing Store,” CACM 1961.q http://dl.acm.org/citation.cfm?id=366800
n Bloom, Cohen, Porter, “Considerations in the Design of a Computer with High Logic-to-Memory Speed Ratio,” AIEE Gigacycle Computing Systems Winter Meeting, Jan. 1962.
45
A Modern Memory Hierarchy
46
Register File32 words, sub-nsec
L1 cache~32 KB, ~nsec
L2 cache512 KB ~ 1MB, many nsec
L3 cache, .....
Main memory (DRAM), GB, ~100 nsec
Swap Disk100 GB, ~10 msec
manual/compilerregister spilling
automaticdemand paging
AutomaticHW cachemanagement
MemoryAbstraction
Hierarchical Latency Analysisn For a given memory hierarchy level i it has a technology-intrinsic
access time of ti, The perceived access time Ti is longer than tin Except for the outer-most hierarchy, when looking for a given
address there is q a chance (hit-rate hi) you “hit” and access time is tiq a chance (miss-rate mi) you “miss” and access time ti +Ti+1 q hi + mi = 1
n ThusTi = hi·ti + mi·(ti + Ti+1)Ti = ti + mi ·Ti+1
hi and mi are defined to be the hit-rate and miss-rate of just the references that missed at Li-1
47
Hierarchy Design Considerationsn Recursive latency equation
Ti = ti + mi ·Ti+1
n The goal: achieve desired T1 within allowed costn Ti » ti is desirable
n Keep mi lowq increasing capacity Ci lowers mi, but beware of increasing ti
q lower mi by smarter management (replacement::anticipate what you don’t need, prefetching::anticipate what you will need)
n Keep Ti+1 lowq faster lower hierarchies, but beware of increasing costq introduce intermediate hierarchies as a compromise
48
n 90nm P4, 3.6 GHzn L1 D-cache
q C1 = 16Kq t1 = 4 cyc int / 9 cycle fp
n L2 D-cacheq C2 =1024 KB q t2 = 18 cyc int / 18 cyc fp
n Main memoryq t3 = ~ 50ns or 180 cyc
n Noticeq best case latency is not 1 q worst case access latencies are into 500+ cycles
if m1=0.1, m2=0.1T1=7.6, T2=36
if m1=0.01, m2=0.01T1=4.2, T2=19.8
if m1=0.05, m2=0.01T1=5.00, T2=19.8
if m1=0.01, m2=0.50T1=5.08, T2=108
Intel Pentium 4 Example
Cache Basics and Operation
Cachen Generically, any structure that “memoizes” frequently used
results to avoid repeating the long-latency operations required to reproduce the results from scratch, e.g. a web cache
n Most commonly in the on-die context: an automatically-managed memory hierarchy based on SRAMq memoize in SRAM the most frequently accessed DRAM
memory locations to avoid repeatedly paying for the DRAM access latency
51
Caching Basicsn Block (line): Unit of storage in the cache
q Memory is logically divided into cache blocks that map to locations in the cache
n On a reference:q HIT: If in cache, use cached data instead of accessing memoryq MISS: If not in cache, bring block into cache
n Maybe have to kick something else out to do it
n Some important cache design decisionsq Placement: where and how to place/find a block in cache?q Replacement: what data to remove to make room in cache?q Granularity of management: large or small blocks? Subblocks?q Write policy: what do we do about writes?q Instructions/data: do we treat them separately?
52
Cache Abstraction and Metrics
n Cache hit rate = (# hits) / (# hits + # misses) = (# hits) / (# accesses)n Average memory access time (AMAT)
= ( hit-rate * hit-latency ) + ( miss-rate * miss-latency )n Aside: Can reducing AMAT reduce performance?
53
AddressTag Store
(is the addressin the cache?
+ bookkeeping)
Data Store
(stores memory blocks)
Hit/miss? Data
A Basic Hardware Cache Designn We will start with a basic hardware cache design
n Then, we will examine a multitude of ideas to make it better
54
Blocks and Addressing the Cachen Memory is logically divided into fixed-size blocks
n Each block maps to a location in the cache, determined by the index bits in the addressq used to index into the tag and data stores
n Cache access: 1) index into the tag and data stores with index bits in address 2) check valid bit in tag store3) compare tag bits in address with the stored tag in tag store
n If a block is in the cache (cache hit), the stored tag should be valid and match the tag of the block
55
8-bit address
tag index byte in block
3 bits3 bits2b
Direct-Mapped Cache: Placement and Access
n Assume byte-addressable memory: 256 bytes, 8-byte blocks à 32 blocks
n Assume cache: 64 bytes, 8 blocksq Direct-mapped: A block can go to only one location
q Addresses with same index contend for the same locationn Cause conflict misses
56
Tag store Data store
Address
tag index byte in block
3 bits3 bits2b
V tag
=? MUXbyte in block
Hit? Data
Direct-Mapped Cachesn Direct-mapped cache: Two blocks in memory that map to
the same index in the cache cannot be present in the cache at the same timeq One index à one entry
n Can lead to 0% hit rate if more than one block accessed in an interleaved manner map to the same index q Assume addresses A and B have the same index bits but
different tag bitsq A, B, A, B, A, B, A, B, … à conflict in the cache indexq All accesses are conflict misses
57
n Addresses 0 and 8 always conflict in direct mapped cachen Instead of having one column of 8, have 2 columns of 4 blocks
Set Associativity
58
Tag store Data store
V tag
=?
V tag
=?
Addresstag index byte in block
3 bits2 bits3b
Logic
MUX
MUXbyte in block
Key idea: Associative memory within the set+ Accommodates conflicts better (fewer conflict misses)-- More complex, slower access, larger tag store
SET
Hit?
Higher Associativityn 4-way
+ Likelihood of conflict misses even lower-- More tag comparators and wider data mux; larger tags
59
Tag store
Data store
=?=? =?=?
MUX
MUXbyte in block
Logic Hit?
Full Associativityn Fully associative cache
q A block can be placed in any cache location
60
Tag store
Data store
=? =? =? =? =? =? =? =?
MUX
MUXbyte in block
Logic
Hit?
Associativity (and Tradeoffs)n Degree of associativity: How many blocks can map to the
same index (or set)?
n Higher associativity++ Higher hit rate-- Slower cache access time (hit latency and data access latency)-- More expensive hardware (more comparators)
n Diminishing returns from higherassociativity
61associativity
hit rate
Issues in Set-Associative Cachesn Think of each block in a set having a “priority”
q Indicating how important it is to keep the block in the cachen Key issue: How do you determine/adjust block priorities?n There are three key decisions in a set:
q Insertion, promotion, eviction (replacement)
n Insertion: What happens to priorities on a cache fill?q Where to insert the incoming block, whether or not to insert the block
n Promotion: What happens to priorities on a cache hit?q Whether and how to change block priority
n Eviction/replacement: What happens to priorities on a cache miss?q Which block to evict and how to adjust priorities
62
Eviction/Replacement Policyn Which block in the set to replace on a cache miss?
q Any invalid block firstq If all are valid, consult the replacement policy
n Randomn FIFOn Least recently used (how to implement?)n Not most recently usedn Least frequently used?n Least costly to re-fetch?
q Why would memory accesses have different cost?n Hybrid replacement policiesn Optimal replacement policy?
63
Implementing LRUn Idea: Evict the least recently accessed blockn Problem: Need to keep track of access ordering of blocks
n Question: 2-way set associative cache:q What do you need to implement LRU perfectly?
n Question: 4-way set associative cache: q What do you need to implement LRU perfectly?q How many different orderings possible for the 4 blocks in the
set? q How many bits needed to encode the LRU order of a block?q What is the logic needed to determine the LRU victim?
64
Approximations of LRUn Most modern processors do not implement “true LRU” (also
called “perfect LRU”) in highly-associative caches
n Why?q True LRU is complexq LRU is an approximation to predict locality anyway (i.e., not
the best possible cache management policy)
n Examples:q Not MRU (not most recently used)q Hierarchical LRU: divide the N-way set into M “groups”, track
the MRU group and the MRU way in each groupq Victim-NextVictim Replacement: Only keep track of the victim
and the next victim65
Hierarchical LRU (not MRU)n Divide a set into multiple groupsn Keep track of only the MRU groupn Keep track of only the MRU block in each group
n On replacement, select victim as:q A not-MRU block in one of the not-MRU groups (randomly pick
one of such blocks/groups)
66
Hierarchical LRU (not MRU): Questionsn 16-way cachen 2 8-way groups
n What is an access pattern that performs worse than true LRU?
n What is an access pattern that performs better than true LRU?
67
Victim/Next-Victim Policyn Only 2 blocks’ status tracked in each set:
q victim (V), next victim (NV)q all other blocks denoted as (O) – Ordinary block
n On a cache missq Replace Vq Demote NV to Vq Randomly pick an O block as NV
n On a cache hit to Vq Demote NV to Vq Randomly pick an O block as NVq Turn V to O
68
Victim/Next-Victim Policy (II)n On a cache hit to NV
q Randomly pick an O block as NVq Turn NV to O
n On a cache hit to Oq Do nothing
69
Victim/Next-Victim Example
70
Cache Replacement Policy: LRU or Randomn LRU vs. Random: Which one is better?
q Example: 4-way cache, cyclic references to A, B, C, D, E n 0% hit rate with LRU policy
n Set thrashing: When the “program working set” in a set is larger than set associativityq Random replacement policy is better when thrashing occurs
n In practice:q Depends on workloadq Average hit rate of LRU and Random are similar
n Best of both Worlds: Hybrid of LRU and Randomq How to choose between the two? Set sampling
n See Qureshi et al., “A Case for MLP-Aware Cache Replacement,“ISCA 2006.
71
What Is the Optimal Replacement Policy?n Belady’s OPT
q Replace the block that is going to be referenced furthest in the future by the program
q Belady, “A study of replacement algorithms for a virtual-storage computer,” IBM Systems Journal, 1966.
q How do we implement this? Simulate?
n Is this optimal for minimizing miss rate?n Is this optimal for minimizing execution time?
q No. Cache miss latency/cost varies from block to block!q Two reasons: Remote vs. local caches and miss overlappingq Qureshi et al. “A Case for MLP-Aware Cache Replacement,“
ISCA 2006.
72
Readingn Key observation: Some misses more costly than others as their latency is
exposed as stall time. Reducing miss rate is not always good for performance. Cache replacement should take into account MLP of misses.
n Moinuddin K. Qureshi, Daniel N. Lynch, Onur Mutlu, and Yale N. Patt,"A Case for MLP-Aware Cache Replacement"Proceedings of the 33rd International Symposium on Computer Architecture (ISCA), pages 167-177, Boston, MA, June 2006. Slides (ppt)
73