Tema 3: Memoria cachestudies.ac.upc.edu/ETSETB/SEGPAR/slides/tema3.pdf · Tema 3: Memoria cache Eduard Ayguadé i Josep Llosa These slides have been prepared using material from “Estructura

1

Tema 3: Memoria cacheEduard Ayguadé i Josep Llosa

These slides have been prepared using material from “Estructura de Computadores II” by professors A. Fernandez, J. Llosa & F. Sanchez and material available at the companion web site for “Computer Organization & Design. The Hardware/Software Interface. Copyright 1998 Morgan Kaufmann Publishers.”

Summary of memory technologySummary of memory technology

DRAM is slow but cheap and dense:Good choice for presenting the user with a BIG memory systemUses one transistor, must be refreshed60-120 ns, $5-$10 per Mbyte in 1997.

SRAM is fast but expensive and not very dense:Good choice for providing the user FAST access timeUses 4 to 6 transistors, holds state as long as power is supplied5-25 ns, $100-$250 per Mbyte in 1997.

What about magnetic disks?10-20 million ns, $0.1-$0.2 per Mbyte in 1997.

2

Processor - DRAM gapProcessor - DRAM gap

Processor - memory performance gap:grows 50% per year

µProc60%/yr.

DRAM7%/yr.

1

10

100

100019

8019

81

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

DRAM

CPU

1982

Perf

orm

ance

Time

Processor - DRAM gapProcessor - DRAM gap

Is this gap so important?A lot of work can be done inside the processor during a memoryaccess

Example:DRAM latency of 80 nsProcessor clock cycle of 1 ns80 clks to access data x 6 instructions per clk → or 480 instructions

3

Motivation for memory hierarchyMotivation for memory hierarchy

Goal:Present the user with large amounts of memory using thecheapest technology.Provide access at the speed offered by the fastest technology.

CPU

Level n

Level 2

Level 1

Levels in thememory hierarchy

Increasing distancefrom the CPU in

access time

Size of the memory at each level

Motivation for memory hierarchyMotivation for memory hierarchy

4

LocalityLocality

A principle that makes having a memory hierarchy a goodidea

If an item is referenced,it will tend to be referenced again soon (temporal locality)nearby items will tend to be referenced soon (spatial locality).

Examples:90/10 rule: 90% of the dynamic references to code are located in 10% of the memory address space (loops, procedures, …)Access to consecutive elements in data structures (vectors, arrays, lists, …)

LocalityLocality

Temporal locality suggests:“Once you reference a memory location, bring its contents to a level closer to the processor”

Spatial locality suggests:“Once you reference a memory location, also bring the contents of nearby memory locations”

Block or Line:A number of consecutive words in memory (e.g. 32 bytes, equivalent to 4 words x 8 bytes)Unit of information that can either be present or not present in a level of the memory hierarchyUnit of information that is transferred between two levels in thehierarchy

5

Block or lineBlock or line

Finding a piece of data in memory ...

processor @

byte inside word

word inside line

line in memory

1 line

Example:- 1 word contains 8 bytes = 64 bits- 1 line contains 4 words- memory contains 2048 lines

3211

Block or lineBlock or line

Memory hierarchy is inclusive01

N-1

01

M-1

Main memory

Cache

block x exists in both levels

block y exists only in the upper level

Processor

line

word

6

Line size trade-offLine size trade-off

In general, larger line sizes take advantage of spatiallocality, BUT:

Larger line size means larger miss penalty: takes longer time to fillup the lineIf line size is too big relative to cache size, miss rate will go up: too few cache linesPollution: with big lines you bring data with less probability ofbeing used (spatial locality)

Memory hierarchy terminologyMemory hierarchy terminology

Hit: data appears in some block in a level of the hierarchyHit rate (h): the fraction of memory access found in that levelHit time th: time to access that level: memory access time + timeto determine hit/miss

Miss: data needs to be retrieved from a block in the nextlevel of the hierarchy

Miss rate m = 1 - hMiss penalty (tm)= time to replace a block in that level with a blockin the next level

In general, Average Access Time = th x h + tm x m

7

Instruction and data cachesInstruction and data caches

Instructions and data behave in a different way

Harward architecture thatallows simultaneous accessto both I and D

Combined L2, usuallyL2>>L1

Impact on performanceImpact on performance

Suppose a processor executes atClock Rate = 1 GHz (1 ns per cycle)CPI = 1.350% arith/logic, 30% ld/st, 20% control

Suppose a 1% instruction miss rate with 50 cycle miss penalty:

CPI = ideal CPI + stalls per instruction = 1.3 + 0.01x50 = 1.931 % of the time waiting for instructions!

Suppose a 10% data miss rate with 50 cycle miss penalty:CPI = ideal CPI + average stalls per instruction = 1.3 + (0.30x0.10x50) = 2.8 cycles53 % of the time the processor is stalled waiting for memory!

8

More on memory hierarchy terminologyMore on memory hierarchy terminology

Mapping policy:Where a line can be found?Direct, Set associative, Fully associative

Replacement policy:When a level is full, which line is eliminated in order to have anempty slot for the new line?(temporal locality: the new one has higherprobability of being referenced in the near future)Random, FIFO, LRU

Write policy:Which levels in the hierarchy are updated on a write?On hit: Write through, copy back. On miss: write allocate, write no allocate

Mapping policyMapping policy

Where a line can be found in cache memory?

Fully associativecache

2-way setassociative cache

Main memory

0000

000

001

0001

000

011

0010

000

101

0011

000

111

0100

001

001

0101

001

011

0110

001

101

0111

001

111

1000

010

001

1001

010

011

1010

010

101

1011

010

111

1100

011

001

1101

011

011

1110

011

101

1111

011

111

data

data

data

data

data

data

data

data

data

data

data

data

data

data

data

data

data

data

data

data

data

data

data

data

data

data

data

data

data

data

data

data

Direct cache

000

001

010

011

100

101

110

111

data

data

data

data

data

data

data

data

000

001

010

011

100

101

110

111

data

data

data

data

data

data

data

data

00 0

00 1

01 0

01 1

10 0

10 1

11 0

11 1

data

data

data

data

data

data

data

data

9


How can one differenciate between the possible candidatesfor a cache line?

Main memory00

000

101

001

110

010

111

011

1

Direct cache

data

data

data

data

data

data

data

data

tag

tag

tag

tag

tag

tag

tag

tag

In cache line number 010 one can find:tag = 00 → line 00010 of main memorytag = 01 → line 01010 of main memorytag = 10 → line 10010 of main memorytag = 11 → line 11010 of main memory

0000

000

001

0001

000

011

0010

000

101

0011

000

111

0100

001

001

0101

001

011

0110

001

101

0111

001

111

1000

010

001

1001

010

011

1010

010

101

1011

010

111

1100

011

001

1101

011

011

1110

011

101

1111

011

111

data

data

data

data

data

data

data

data

data

data

data

data

data

data

data

data

data

data

data

data

data

data

data

data

data

data

data

data

data

data

data

data



Main memory

000

001

010

011

100

101

110

111

Fully associative

tag

tag

tag

tag

tag

tag

tag

tag

data

data

data

data

data

data

data

data

In cache line number 010 one can find:tag = 00000 → line 00000 of main memorytag = 00001 → line 00001 of main memory…tag = 11111 → line 00001 of main memory

0000

000

001

0001

000

011

0010

000

101

0011

000

111

0100

001

001

0101

001

011

0110

001

101

0111

001

111

1000

010

001

1001

010

011

1010

010

101

1011

010

111

1100

011

001

1101

011

011

1110

011

101

1111

011

111

data

data

data

data

data

data

data

data

data

data

data

data

data

data

data

data

data

data

data

data

data

data

data

data

data

data

data

data

data

data

data

data

10



Main memory00

000

101

001

110

010

111

011

1

Set associative

tag

tag

tag

tag

tag

tag

tag

tag

data

data

data

data

data

data

data

data

In cache line number 010 one can find:tag = 000 → line 00001 of main memorytag = 001 → line 00101 of main memory…tag = 111 → line 11101 of main memory

0000

000

001

0001

000

011

0010

000

101

0011

000

111

0100

001

001

0101

001

011

0110

001

101

0111

001

111

1000

010

001

1001

010

011

1010

010

101

1011

010

111

1100

011

001

1101

011

011

1110

011

101

1111

011

111

data

data

data

data

data

data

data

data

data

data

data

data

data

data

data

data

data

data

data

data

data

data

data

data

data

data

data

data

data

data

data

data

Implementing the mapping policyImplementing the mapping policy

Direct mapping # linetag

Line size = 1 word x 4 bytes

11


Direct mapping: four-word lines (32 bit per word)# linetag

Data


Fully associative

tag

12


4-w

ayse

tass

ocia

tive

# settag

Sources of cache missesSources of cache misses

Compulsory (cold start or process migration, firstreference): first access to a block

“Cold” fact of life: not a whole lot you can do about it

Conflict (collision): Multiple memory locations mapped tothe same cache location

Solution 1: increase cache sizeSolution 2: increase associativity

Capacity: Cache cannot contain all blocks accessed by theprogram

Solution: increase cache size

Invalidation: other process (e.g., I/O) updates memory

13

Sources of cache missesSources of cache misses

Mapping policies behaviour:

Reducing missesReducing misses

Assist cache:Bring line into cache if it showslocality, thus avoiding thereplacement of other lines withmore locality

Victim cache:Give a second opportunity tothose lines that are replaced

cachecache

mainmemory

mainmemory

ACAC

cachecache

mainmemory

mainmemory

VCVC

14

Evaluation: access time (th)Evaluation: access time (th)

0

10

20

30

40

50

60

70

80

1 Kb

yte

2 Kb

ytes

4 Kb

ytes

8 Kb

ytes

16 K

byte

s

32 K

byte

s

64 K

byte

s

128

Kbyt

es

256

Kbyt

es

512

Kbyt

es

1 M

byte

2 M

byte

s

4 M

byte

s

8 M

byte

s

1-way

2-way4-way8-way

ns

0.8µ technology

Evaluation: access time (th)Evaluation: access time (th)

0

10

20

30

40

50

60

70

1 K

byte

2 Kb

ytes

4 Kb

ytes

8 Kb

ytes

16 K

byte

s

32 K

byte

s

64 K

byte

s

128

Kbyt

es

256

Kbyt

es

512

Kbyt

es

1 M

byte

2 M

byte

s

4 M

byte

s

8 M

byte

s

16 bytes32 bytes64 bytes128 bytes256 bytes

ns

direct mapped

0.8µ technology

15

Evaluation: miss rateEvaluation: miss rate

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

1 Kby

te

2 Kb

ytes

4 Kb

ytes

8 Kb

ytes

16 K

byte

s

32 K

byte

s

64 K

byte

s

128

Kbyt

es

256

Kbyt

es

512

Kbyt

es

1 M

byte

2 M

byte

s

4 M

byte

s

8 M

byte

s

1-way2-way4-way

8-way

SPEC95, reference data set1012 instructions simulatednr = 0.3735 references to memory per instruction

data cache only

Evaluation: miss rateEvaluation: miss rate

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

0.50

1 Kby

te

2 Kby

tes

4 Kby

tes

8 Kby

tes

16 K

byte

s

32 K

byte

s

64 K

byte

s

128

Kby

tes

256

Kby

tes

512

Kby

tes

1 M

byte

2 M

byte

s

4 M

byte

s

8 M

byte

s

16 bytes

32 bytes

64 bytes

128 bytes

256 bytes

direct mapped


data cache only

16

Evaluation: average memory accessEvaluation: average memory access

0

20

40

60

80

100

120

1 Kby

te

2 Kb

ytes

4 Kb

ytes

8 Kb

ytes

16 K

byte

s

32 K

byte

s

64 K

byte

s

128

Kbyt

es

256

Kbyt

es

512

Kbyt

es

1 M

byte

2 M

byte

s

4 M

byte

s

8 M

byte

s

1-way2-way4-way

8-way

0.8µ technologyDRAM access time 250 ns


data cache onlywrite through / write allocate8-byte bus between MC and MP

ns

Evaluation: average memory accessEvaluation: average memory access

0

50

100

150

200

250

1 Kby

te

2 Kby

tes

4 Kby

tes

8 Kby

tes

16 K

byte

s

32 K

byte

s

64 K

byte

s

128

Kby

tes

256

Kby

tes

512

Kby

tes

1 M

byte

2 M

byte

s

4 M

byte

s

8 M

byte

s

16 bytes

32 bytes

64 bytes

128 bytes

256 bytes

0.8µ technologyDRAM access time 250 ns


data cache onlywrite through / write allocate8-byte bus between MC and MP

ns

17

Cache line replacement policyCache line replacement policy

Random Replacement:Hardware randomly selects a cache line out of the set andreplaces itCan be implemented with a single counter.

First-In First-Out (FIFO):The oldest one in the set is going to be replacedCan be implemented with a counter per set, assuming that lines in the set are filled/replaced in order.

Cache line replacement policyCache line replacement policy

Least Recently Used (LRU):Hardware keeps track of the access historyReplace the entry that has not been used for the longest timeFor two-way set associative cache one needs one bit for LRU replacement

Exercise: how to implement it?

18

Cache write policyCache write policy

Cache read is much easier to handle than cache write:Instruction cache is much easier to design than data cache

Cache write policy:It decides how do we keep data in the cache and memoryconsistent

Two options on a hit:Copy-back: write to cache only. Write the cache block to memorywhen that cache block is being replaced on a cache miss.

Need a “dirty bit” for each cache blockGreatly reduce the memory bandwidth requirement

Write-through: write to cache and memory at the same time.Replacements do not need to write lines back to memorySo … writes are done with a large access time!

Write buffer for write-through policyWrite buffer for write-through policy

A Write Buffer is needed between the cache and mainmemory

Processor: writes data into the cache and the write bufferCache controller: write contents of the buffer to memory

Write buffer is just a FIFO:Typical number of entries: 4

cachecache main memorymain memoryprocessor

word

19

Write buffer for copy-back policyWrite buffer for copy-back policy

Reducing miss penalty: in this case the buffer holds thosereplaced “dirty” lines that need to be updated in main memory

On a miss, now we need to check if the line is in the writebuffer

lines main memorymain memory

tags

@ issuedby processor

= = = = = == ==

hit in write buffer

mux

cachecache

Cache write policy (cont.)Cache write policy (cont.)

What happens on a write miss? Do we read in the block?Yes: Write-allocateNo: Write-no-allocate

Usually:Write-allocate goes together with copy-backWrite-no-allocate goes together with write-through

20

Moving lines around …Moving lines around …

Size of the bus between cache and main memory:


Single module memory / one-word wide bus

Multi module memory / one-line wide bus

Multi module memory / one-word wide bus

b0 b1 b2 b3

b0-b3

b0 b1 b2 b3

memory access transfer +cache load

21


Interleaved memory

0816

1917

21018

311

412

513

614

715

# line count % 8

b0 b1 b2 b3 b4 b5 b6 b7

access transfer

Memorias de SemiconductoresMemorias de Semiconductores

Tipos de Memoria de Semiconductores:Memoria Estática (SRAM, Static RAM). Cada celda de memoria equivale a 1 biestable (7-8 transistores). En comparación con las DRAM son rápidas, tienen un alto consumo, poca capacidad y son caras.

→ Memoria Cache

Memoria Dinámica (DRAM, Dynamic RAM). Cada celda se comporta como un condensador (1-1.x transistores). En comparación con las SRAM son lentas, tienen un bajo consumo, mucha capacidad y son baratas. Problema del refresco.

→ Memoria Principal

22


Celda SRAM de 6 Transistores

La información se almacena en 2 inversores acoplados.Al activar la word line el dato almacenado se lee a través de lasbit lines.Se obtiene el dato negado y sin negar.

bitlinebitline’

wordline


Celda SRAM de 6 Transistores

bitline’ bitline

23

Celda DRAM de 1 Transistor

La información se almacena en el condensador Cs.Al activar la word line el dato almacenado se lee a través de la bitline.El condensador se va descargando poco a poco, es necesario recargarloregularmente (refresco).


bitline

wordline

.

.

.

.

.

.

.

.

Estructura interna de una DRAMEstructura interna de una DRAM

celda

210 columnas

210 filas

Matriz de Memoria1024 x 1024 x 1 bit

wordline

bitline

Dec

odific

ador

de

fila

Amplificadores de señal

Decodificador de columna

. . . . . . . . .

R/W DATO

@FILA

@COLUMNA

10

10

FILA COLUMNA@

20 bits

wordline

bitline

24

.

.

.

.

.

.

.

.


Una operación de lectura:1) Decodificar @FILA, se activa la señal wordline

wordline

Dec

odific

ador

de

fila



. . . . . . . . .

R/W

@FILA


Una operación de lectura:2) Se accede a todas las celdas de la fila, los datos de toda la fila se

envían a los amplificadores de señal y se recupera la tensión (el dato está en un condensador que se va descargando poco a poco).

wordline

bitline bitline bitline

wordline

bitline bitline bitline

. . .

. . .

25

.

.

.

.

.

.

.

.



envían a los amplificadores de señal y se recupera la tensión.

Dec

odific

ador

de

fila


. . . . . . . . .

R/W

@FILA




.

.

.

.

.

.

.

.



envían a los amplificadores de señal y se recupera la tensión.

Dec

odific

ador

de

fila

. . . . . . . . .

R/W

@FILA

26


.

.

.

.

.

.

.

.


Una operación de lectura:3) Decodificar @COLUMNA, se selecciona una bitline y se envía el dato al

exterior a través del buffer R/W.

Dec

odific

ador

de

fila

R/W@COLUMNA Decodificador de columna DATO



.

.

.

.

.

.

.

.


Una operación de lectura:4) La lectura es destructiva, hay que reescribir la celda (y toda la fila) para

recuperar el valor original (equivale a precargar la fila para el siguiente acceso a memoria).

Dec

odific

ador

de

fila

. . . . . . . . .

R/W

@FILA

27


.

.

.

.

.

.

.

.


Una operación de escritura:3) Prácticamente igual, la única diferencia es que la celda se reescribe con

el dato que entra por el buffer R/W.

Dec

odific

ador

de

fila

R/W@COLUMNA Decodificador de columna DATO



.

.

.

.

.

.

.

.


Una operación de escritura:4) Hay que reescribir la celda con el nuevo valor (y el resto de la fila

con el valor original).

Dec

odific

ador

de

fila

. . . . . . . . .

R/W

@FILA

28


Una operación de lectura. Resumen:1) Decodificar @FILA, se activa la señal wordline2) Se accede a todas las celdas de la fila, los datos de toda la fila se envían a los

amplificadores de señal y se recupera la tensión (el dato está en un condensador que se va descargando poco a poco).

3) Decodificar @COLUMNA, se selecciona una bitline y se envía el dato al exterior a través del buffer R/W.

4) La lectura es destructiva, hay que reescribir la celda (y toda la fila) para recuperar el valor original (equivale a precargar la fila para el siguiente acceso a memoria).

Una operación de escritura:3) Prácticamente igual, la única diferencia es que la celda se reescribe con el dato

que entra por el buffer R/W.4) Hay que reescribir la celda con el nuevo valor (y el resto de la fila con el valor

original).


Cronograma simplificado de una operación de lectura

3 Valores fundamentales:Tiempo de acceso (tRAC): retardo máximo desde que se suministra la dirección de fila hasta que se obtiene el dato → latencia de memoria.

Tiempo de ciclo (tRC): intervalo de tiempo mínimo entre dos accesos consecutivos a memoria → ancho de banda.

Tiempo de acceso a columna (tCAC): retardo máximo desde que se suministra la dirección de columna hasta que se obtiene el dato.

@FILA @COL @FIL @COL

D DATA

tRC

tRAC

tCAC

1 2 3 4

29


Cronograma simplificado de una operación de lectura

Valores típicos:Tiempo de acceso (tRAC): 50ns.Tiempo de ciclo (tRC): 70ns.Tiempo de acceso a columna (tCAC): 20ns.

¿Posibles Mejoras? → Acceso a Bloques de Información.

Aprovechando que los datos de la fila están en los amplificadores de señal, sólo es necesario enviar la @COL+1, @COL+2, …

@FILA @COL @FIL @COL

D DATA

tRC

tRAC

tCAC

1 2 3 4

Idea Fundamental: una vez accedida la fila, se puede acceder a varias columnas simplemente cambiando la @COL.

→ Aprovechamos la localidad espacial

Valores típicos:1er acceso (tRAC): 50ns.2o acceso (tPC): 35ns.

FPM DRAM (Fast Page Mode DRAM)FPM DRAM (Fast Page Mode DRAM)

@FILA @COL @COL

D DATA

tRAC

tCACtCAC

DATA

tPC

30

Tipos de DRAMTipos de DRAM

Fast Page Mode DRAM (FPM DRAM)

Extended Data Out DRAM (EDO DRAM)

Burst EDO DRAM (BEDO DRAM)

Synchronous DRAM (SDRAM)

Double Data Rate SDRAM (DDR SDRAM)

DDR2 SDRAM

Evaluación FPM DRAMEvaluación FPM DRAM

Leer un bloque de 32 bytes (suponiendo que la MP está 8 entrelazada):Placa Base de 66MHz (tiempo de ciclo:15ns).4 Accesos a Memoria.Temporización: 5-3-3-3 (incluye precarga).Ancho de Banda: 152 Mbytes/s.

Problemas de la FPM DRAM: hay que esperar a que el dato sea leído antes de enviar la nueva @COL.

@FILA @COL @COL

D DATA

tRAC

tCACtCAC

DATA

tPC

31

Idea Fundamental: se añade un registro en la salida de datos

→ se puede solapar el acceso a los datos

→ con el envío de la nueva @COL


EDO DRAM (Extended Data Out DRAM)EDO DRAM (Extended Data Out DRAM)

@FILA @COL @COL

D DATA

tRAC

DATA

tPC

Estructura interna de una EDO DRAMEstructura interna de una EDO DRAM

.

.

.

.

.

.

.

.

celda

210 columnas

210 filas


wordline

bitline

Dec

odific

ador

de

fila



. . . . . . . . .

R/W DATO

@FILA

@COLUMNA

10

10

FILA COLUMNA@

20 bits

REG

32

Leer un bloque de 32 bytes:Placa Base de 66MHz (tiempo de ciclo:15ns).4 Accesos a Memoria.Temporización: 5-2-2-2 (incluye precarga).Ancho de Banda: 193 Mbytes/s.

Posibles mejoras de la EDO DRAM: Los accesos siempre son a posiciones consecutivas.

Evaluación EDO DRAMEvaluación EDO DRAM

@FILA @COL @COL

D DATA

tRAC

DATA

tPC

Idea Fundamental: añadir un contador para generar la nueva @COL.


BEDO DRAM (Burst EDO DRAM)BEDO DRAM (Burst EDO DRAM)

@FILA @COL

D DATA

tRAC

DATA

tPC

33

Estructura interna de una BEDO DRAMEstructura interna de una BEDO DRAM

.

.

.

.

.

.

.

.

celda

210 columnas

210 filas


wordline

bitlineD

ecod

ific

ador

de

fila



. . . . . . . . .

R/W DATO

@FILA

@COLUMNA

10

10

FILA COLUMNA@

20 bits

REG10

CONT

Leer un bloque de 32 bytes:Placa Base de 66MHz (tiempo de ciclo:15ns).4 Accesos a Memoria.Temporización: 5-1-1-1 (incluye precarga).Ancho de Banda: 266 Mbytes/s.

Problema: Es una memoria asíncrona. Las memorias asíncronas son difíciles de mejorar por problemas de ruido. Es muy difícil que soporten frecuencias superiores a los 66MHz.

Evaluación BEDO DRAMEvaluación BEDO DRAM

@FILA @COL

D DATA

tRAC

DATA

tPC

34

Problema: las memorias asíncronas no se pueden mejorar.

Solución Arquitectónica:Segmentar el funcionamiento interno de las memoriasHacer que funcionen de forma SÍNCRONA.

→ Desaparecen los problemas de ruido

→ Se puede aumentar la frecuencia de funcionamiento

Solución ArquitectónicaSolución Arquitectónica

Funcionamiento segmentado.

Síncrona.

Puede funcionar a mucha más frecuencia que una DRAM asíncrona.

Autoincremento de la @COL.

Programable vía comandos.

Dispone de múltiples bancos (permite ocultar la precarga).

El funcionamiento interno es muy similar a una DRAM asíncrona.

SDRAM (Synchronous DRAM)SDRAM (Synchronous DRAM)

35

Es una SDRAM que envía los datos a doble velocidad que una SDRAM convencional.

Modificando exclusivamente la circuitería encargada de la entrada/salida de datos se dobla el ancho de banda.

La latencia de memoria es prácticamente la misma.

DDR SDRAM (Double Data Rate SDRAM)DDR SDRAM (Double Data Rate SDRAM)

DATA

CLOCK

SDRAM DATA DATA DATA

DDR SDRAM DATA DATA DATA DATA

Es una DDR mejorada en los siguientes aspectos:Reducción del tamaño de página (número de columnas del array):

Reduce el tiempo de accesoReduce el consumo de energía

Aumenta el número de bancosAumenta las posibilidades de solapar, con un buen controlador de memoria, la precarga y el acceso a fila de bancos independientes.

Aumento de la frecuencia

DDR2 SDRAMDDR2 SDRAM

36

Tipos de DRAMTipos de DRAM

30 /2.5 ns

36 / 6 ns

50 / 20 ns

Latencia

1r dato / resto

813.8 Mbytes/s200 MHz256 MbitsDDR SDRAM2003

565.1 Mbytes/s167 MHz128 MbitsSDRAM2002

203.4 Mbytes/s66 MHz64 MbitsEDO DRAM2000

Ancho Banda lect. 32bytes

Frec. Placa Base

CapacidadTipo MemoriaAño

Ejemplos de Memorias DRAM comerciales

Memoria EntrelazadaMemoria Entrelazada

Memoria Principal 8-entrelazada de 256 Mbytes

Líneas 32 bytes

1

2

3

4

228-1

EspacioLógico

0

256MB32MB

LÍNEA 0

1 byte

8 bytes

0 1 2 … 7

8 9 10 … 15

228-8 228-7 228-6 … 228-1

16 17 18 … 23

24 25 26 … 31

… … … … …

M0 M1 M2 M7

37

0 1 2 … 7

8 9 10 … 15

228-8 228-7 228-6 … 228-1

16 17 18 … 23

24 25 26 … 31

… … … … …



32MB

DIMM con 8 chips de Memoria DRAM

8 bytes

1 byte

Chip DRAM con 4 bancos

M0 M1 M2 M7

M0 M1 M2 M3 M4 M5 M6 M7

8MB226+0226+8

226+16…

08

16…

227+0227+8

227+16…

227+226+0227+226+8

227+226+16…

8MB

8MB

8MB


M0 M1 M2 M3 M4 M5 M6 M7

M0 M1 M2 M3 M4 M5 M6 M7

64

8 8 8 8 8 8 8 8

Bus 64 bits

Banco 0

M0

B0

B1

B2

B3

Banco i

38



Cada banco contiene 4 bytes de cada línea de 32 bytes

8K

1KB

1680 24 …

1 byte

DRAM con 4 bancos

8MB226+0226+8

226+16…

08

16…

227+0227+8

227+16…

227+226+0227+226+8

227+226+16…

8MB

8MB

8MB

Estructura de un banco de 8 MB

8 celdas de 1 bit



Distribución de los 4 bytes de cada línea dentro de un módulo DRAM

DRAM con 4 bancos

1 byte

8MB226+0226+8

226+16…

08

16…

227+0227+8

227+16…

227+226+0227+226+8

227+226+16…

8MB

8MB

8MB

8K

1KB

…

línea 0 línea 1 línea 2 … línea 255

línea 256 línea 257 línea 258 línea 511…

línea 220-1…

39

1 1 1 00 1 0 00 0 0 00 0 1 00 0 0 00 0 0 0 0 0 0 0



Líneas de 32 bytes

¿Dónde está el byte 000204Eh?

1 1 1 00 1 0 00 0 0 00 0 1 00 0 0 00 0 0 0 0 0 0 0línea 258 byte 14

módulo 6columna 9fila 1banco 0

8K

1KB

línea 0 línea 1 línea 2 … línea 255

línea 256 línea 257 línea 258 línea 511…

línea 220-1…


fila 1

columna 9

M0 M1 M2 M3 M4 M5 M6 M7

¡Byte 000204Eh!

1 1 1 00 1 0 00 0 0 00 0 1 00 0 0 00 0 0 0 0 0 0 0


1 byte

8MB

226+0

226+8

226+24

…

0

8

24

…

227+0

227+8

227+24

…

227+226+0

227+226+8

227+226+24

…

8MB

8MB

8MB

banco 0

banco 1

banco 2

banco 3

banco 0

módulo 6

40

0 X X X0 1 0 00 0 0 00 0 1 00 0 0 00 0 0 0 0 0 0 0


La transmisión de datos entre MP y MC se hace por líneas completas

Es preciso cambiar la dirección de acceso, para acceder al primer byte de la línea

Se accede simultáneamente a todos los módulos

1 1 1 00 1 0 00 0 0 00 0 1 00 0 0 00 0 0 0 0 0 0 0

1 1 1 00 1 0 00 0 0 00 0 1 00 0 0 00 0 0 0 0 0 0 0línea 258 byte 14


Todos losmóduloscolumna 8fila 1banco 0

Memoria entrelazadaMemoria entrelazada

Queremos montar 1 GByte de memoria RAM usando módulos DIMM de 256 Mbytes.

La dirección tiene 30 bitsNecesitamos 4 módulos DIMMCada uno de los módulos se direcciona con 28 bitsTodos los módulos están físicamente conectados al mismo bus de datos (triestados)Sólo uno de los módulos puede funcionar en cada momentoLos dos bits de mayor peso de la dirección determinan cuál de los módulos funciona (CS)Los 28 bits restantes determinan cómo se accede al módulo

41

Memoria entrelazadaMemoria entrelazada

Queremos montar 1 GByte de memoria RAM usando módulos DIMM de 256 Mbytes.

Ejemplo: Acceso a la dirección 2000204Eh

1 1 1 00 1 0 00 0 0 00 0 1 00 0 0 00 0 0 0 0 0 0 0


1 0

DIMM 2

Entrelazado “horizontal”

Entrelazado “vertical”



Líneas de 32 bytes¿Cómo queda distribuida la línea 0 en los chips de Memoria?

… … …1680 24 1791 25 18102 26 23157 31 …

. . .

4032 33 41 34 42 39 47

Ejemplo: Lectura de una línea de 32 bytesLÍNEA 0

M0 M1 M2 M7

42


Memoria Principal de 256 Mbytes¿Cómo queda la línea 0 distribuida físicamente en Memoria?

… … …1680 24 1791 25 18102 26 23157 31 …

. . .

4032 33 41 34 42 39 47

Ejemplo: Lectura de una línea de 32 bytesLÍNEA 0

M0 M1 M2 M7


Memoria Principal de 256 Mbytes¿Cómo se lee la línea 0 de MP?

Enviar / decodificar la @fila (fila 0)

Leer la fila y enviarla a los amplificadores de señal

… … …1680 24 1791 25 18102 26 23157 31 …

. . .

4032 33 41 34 42 39 47

M0 M1 M2 M7

43



Enviar / decodificar la @fila (fila 0)

Leer la fila y enviarla a los amplificadores de señal

… … …1680 24 1791 25 18102 26 23157 31 …

. . .

4032 33 41 34 42 39 47

M0 M1 M2 M7



Enviar/ decodificar la @col (col 0)

Leer el 1er dato de la línea en todos los módulos

Enviar los primeros 8 bytes a la MC

… … …1680 24 1791 25 18102 26 23157 31 …4032 33 41 34 42 39 47

0 1 2 7

a Memoria Cache

M0 M1 M2 M7

44



Incrementar la @col (col=1)

Leer el 2o dato de la línea en todos los módulos

Enviar los siguientes 8 bytes a la MC

. . .

… … …1680 24 1791 25 18102 26 23157 31 …4032 33 41 34 42 39 47

8 9 10 15

a Memoria Cache

M0 M1 M2 M7

Lectura de 1 línea de 32 bytes, la MP está organizada en DIMMs de 8 bytes de ancho.

Cronograma Simplificado:Latencia fila (4 ciclos), latencia columna (4 ciclos), velocidad transferencia (8 bytes por ciclo)

La velocidad de salida / transferencia de los datos dependerá del tipo de Memoria y de la placa base (buses)


@FIL @COL

datos datos datos datos

45

Escritura de 1 línea de 32 bytes, la MP está organizada en DIMMs de 8 bytesde ancho.

Cronograma Simplificado:Latencia fila (4 ciclos), latencia columna (4 ciclos), velocidad transferencia (8 bytes por ciclo)

¡El cronograma es idéntico!


@FIL @COL

datos datos datos datos


Ejemplo de cronograma de transferencia de una línea entre Memoria Principal y Memoria Cache:

Tamaño de línea: 8 bytes.Latencia de fila: 3 ciclos.Latencia de columna: 2 ciclos.Memoria Principal organizada en “DIMMs” de 2 bytes de anchoVelocidad de transferencia entre MP y MC: 2 bytes por cicloMemoria cache tarda 1 ciclo en detectar miss (se requiere transferencia de 1 línea desde MP)Memoria cache tarda 1 ciclo en escribir la línea recibida y enviar el dato al procesador (hit)

46


PeticiónCPU

LecturaMP

M H

11 ciclos

Acceso en fallo

Lectura de línea de MP

Acceso en acierto

0,1 2,3 4,5 6,7lat. F. lat. C.

Reducing miss penaltyReducing miss penalty

Load into cache as words are transferred:

Write-buffer for copy-back

@ CPU

Write to MP

read from MP

Load into MC

tm

@ CPU

Write to MP

read from MP

Load into MC

tm

47

Reducing miss penaltyReducing miss penalty

Early restart:

Reorder data:

b0 b1 b2 b3

@ CPU

Write to MP

read from MP

Load into MC

tm

b2 b3 b0 b1

@ CPU

Write to MP

read from MP

Load into MC

tm

Lab sessionLab session

Code 1: Matrix product, different ijk forms

#define N 256main() {

int i, j, k;float A[N][N], B[N][N], C[N][N];/* initialization of A, B and C*/for (i=0; i<N; i++)

for (j=0; j<N; j++) {A[i][j] = 1.2;B[i][j] = 3.4;C[i][j] = 0.0;

}/* Computation of C */for (i=0; i<N; i++)

for (j=0; j<N; j++) for (k=0; k<N; k++)

C[i][j] = C[i][j] + A[i][k] * B[k][j];}

48


Code 1: Matrix product, different ijk forms

0,000%0,000%0,000%Instruction cache miss ratio

5,149%0,354%2,606%Data cache miss ratio

3,636,204,81MFLOPS

103,86177,21137,58MIPS

9,2345,4126,971Time (seconds)33.757.4892.321.92317.086.764Data cache misses

2.5372.3842.571Instruction cache misses655.617.326655.613.221655.615.970Memory accesses (read and write)33.555.46333.555.46333.555.463Floating point operations

959.067.394959.064.702959.064.862Instructions executed1.846.706.5141.082.413.9381.392.048.480Cycles

jki formikj formijk form

Note: execution on a Pentium MMX at 200 MHz


Code 1: Matrix product using scalar temporal variable#define N 256main() {

int i, j, k;float t, A[N][N], B[N][N], C[N][N];/* initialization of A, B and C*/for (i=0; i<N; i++)

for (j=0; j<N; j++) {A[i][j] = 1.2;B[i][j] = 3.4;C[i][j] = 0.0;

}/* Computation of C */for (i=0; i<N; i++)

for (j=0; j<N; j++) {t = C[i][j]; for (k=0; k<N; k++)

t = t + A[i][k] * B[k][j];C[i][j] = t;

}}

49


Code 1: Matrix product using scalar temporal variable

0,000%0,000%0,000%Instruction cache miss ratio

10,558%0,724%9,224%Data cache miss ratio

4,3828,8027,675MFLOPS

77,118154,90389,148MIPS

7,6573,8124,372Time (seconds)33.821.1262.317.92317.192.127Data cache misses

2.7232.2902.366Instruction cache misses320.333.194320.329.494186.374.856Memory accesses (read and write)33.554.46333.554.46333.554.463Floating point operations

590.492.384590.490.215389.753.758Instructions executed1.531.322.928762.446.346874.348.935Cycles

jki formikj formijk form

Note: execution on a Pentium MMX at 200 MHz


Questions:Why the ikj is the best form?In the second version using a scalar temporal variable:

Why the miss ratio grows if the number of misses is smaller?Why the execution time is better if the miss ratio is higherWhy the MIPS metric is worse if the execution time is better?

50


Code 2:#define ITER 1000000#define N 128*1024#define mask N-1main() {

int i, j, stride;float v[N]; /* float = 4 bytes */for (stride=1; stride<16; stride++) {

for (i=0; i<N; i++)v[i]=1.3;

misses();i=0; for(j=0; j<ITER; j++){

v[i]=v[i]+11.2;i = (i+stride) & mask;

}printf(“%d: %d\n”, stride, misses());

}}


Code 2: execution on a Pentium MMX at 200 MHz

What can you guess from this plot?

0

200000

400000

600000

800000

1000000

1200000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

stride

num

. of m

isse

s

51


Code 3:#define ITER 1000000#define N 16*1024

main() {int i, j, stride;float v[N]; /* float = 4 bytes */stride = 8;for (j=256; j<=N; j=j+256) {

for (i=0; i<N; i++)v[i]=1.3;

misses();k=0;for (i=0; i<ITER; i++) {

v[k] = v[k] + 1.1;k = k+stride;if (k >=j) k=0;

}printf("%d: %d\n", j*4, misses());

}}


Code 3: execution on a Pentium MMX at 200 MHz

What can you guess from this plot?

0

200000

400000

600000

800000

1000000

1200000

0 4 8 12 16 20 24 28 32

j x 4 (K)

num

. of m

isse

s

Documents

Tema 3: Memoria cachestudies.ac.upc.edu/ETSETB/SEGPAR/slides/tema3.pdf · Tema 3: Memoria cache Eduard Ayguadé i Josep Llosa These slides have been prepared using material from “Estructura