29
1 André Seznec Caps Team IRISA/INRIA A 256 Kbits L-TAGE branch predictor André Seznec IRISA/INRIA/HIPEAC

André Seznec Caps Team IRISA/INRIA 1 A 256 Kbits L-TAGE branch predictor André Seznec IRISA/INRIA/HIPEAC

Embed Size (px)

DESCRIPTION

André Seznec Caps Team Irisa 3 TAGE: TAgged GEometric history length predictors The genesis

Citation preview

Page 1: André Seznec Caps Team IRISA/INRIA 1 A 256 Kbits L-TAGE branch predictor André Seznec IRISA/INRIA/HIPEAC

1

André Seznec Caps Team

IRISA/INRIA

A 256 Kbits L-TAGE branch predictor

André SeznecIRISA/INRIA/HIPEAC

Page 2: André Seznec Caps Team IRISA/INRIA 1 A 256 Kbits L-TAGE branch predictor André Seznec IRISA/INRIA/HIPEAC

2André Seznec

Caps TeamIrisa

Directly derived from:

A case for (partially) tagged branch predictors, A. Seznec and P. Michaud JILP Feb. 2006

+Tricks:

Loop predictorKernel/user histories

Page 3: André Seznec Caps Team IRISA/INRIA 1 A 256 Kbits L-TAGE branch predictor André Seznec IRISA/INRIA/HIPEAC

3André Seznec

Caps TeamIrisa

TAGE:TAgged GEometric history length predictors

The genesis

Page 4: André Seznec Caps Team IRISA/INRIA 1 A 256 Kbits L-TAGE branch predictor André Seznec IRISA/INRIA/HIPEAC

4André Seznec

Caps TeamIrisa

Back around 2003

2bcgskew was state-of-the-art, but: but was lagging behind neural inspired

predictors on a few benchmarks Just wanted to get best of both behaviors

and maintain: Reasonable implementation cost:

• Use only global history • Medium number of tables

In-time response

Page 5: André Seznec Caps Team IRISA/INRIA 1 A 256 Kbits L-TAGE branch predictor André Seznec IRISA/INRIA/HIPEAC

5André Seznec

Caps TeamIrisa

L(0) ?

L(4)

L(3)L(2)

L(1)

TOT1

T2T3

T4

The basis : A Multiple length global history predictor

Page 6: André Seznec Caps Team IRISA/INRIA 1 A 256 Kbits L-TAGE branch predictor André Seznec IRISA/INRIA/HIPEAC

6André Seznec

Caps TeamIrisa

GEometric History Length predictor

L(1)1iαL(i)

0 L(0)

The set of history lengths forms a geometric series

What is important: L(i)-L(i-1) is drastically increasing

most of the storage for short history !!

{0, 2, 4, 8, 16, 32, 64, 128}

Capture correlation on very long histories

Page 7: André Seznec Caps Team IRISA/INRIA 1 A 256 Kbits L-TAGE branch predictor André Seznec IRISA/INRIA/HIPEAC

7André Seznec

Caps TeamIrisa

Combining multiple predictions ?

Classical solution: Use of a meta predictor

“wasting” storage !?! chosing among 5 or 10 predictions ??

Neural inspired predictors, Jimenez and Lin 2001 Use an adder tree instead of a meta-predictor

Partial matching Use tagged tables and the longest matching historyChen et al 96, Michaud 2005

Page 8: André Seznec Caps Team IRISA/INRIA 1 A 256 Kbits L-TAGE branch predictor André Seznec IRISA/INRIA/HIPEAC

8André Seznec

Caps TeamIrisa

L(0) ∑

L(4)

L(3)L(2)

L(1)

TOT1

T2T3

T4

CBP-1 (2004): OGEHL

Final computation through a sum

Prediction=Sign

12 components 3.670 misp/KI

Page 9: André Seznec Caps Team IRISA/INRIA 1 A 256 Kbits L-TAGE branch predictor André Seznec IRISA/INRIA/HIPEAC

9André Seznec

Caps TeamIrisa

pc h[0:L1]

ctr u tag

hash hash

=?

ctr u tag

hash hash

=?

ctr u tag

hash hash

=?

prediction

pc pc h[0:L2] pc h[0:L3]

11 1 1 1 1 1

1

1

TAGEGeometric history length + PPM-like

+ optimized update policy

Tagless base predictor

Page 10: André Seznec Caps Team IRISA/INRIA 1 A 256 Kbits L-TAGE branch predictor André Seznec IRISA/INRIA/HIPEAC

10André Seznec

Caps TeamIrisa

=? =? =?

11 1 1 1 1 1

1

1

Hit

Hit

Altpred

Pred

Miss

Page 11: André Seznec Caps Team IRISA/INRIA 1 A 256 Kbits L-TAGE branch predictor André Seznec IRISA/INRIA/HIPEAC

11André Seznec

Caps TeamIrisa

Prediction computation

General case: Longest matching component provides the prediction

Special case: Many mispredictions on newly allocated entries: weak Ctr On many applications, Altpred more accurate than Pred Property dynamically monitored through a single 4-bit

counter

Page 12: André Seznec Caps Team IRISA/INRIA 1 A 256 Kbits L-TAGE branch predictor André Seznec IRISA/INRIA/HIPEAC

12André Seznec

Caps TeamIrisa

TAGE update policy

General principle:

Minimize the footprint of the prediction.

Just update the longest history matching component and allocate at most one entry on mispredictions

Page 13: André Seznec Caps Team IRISA/INRIA 1 A 256 Kbits L-TAGE branch predictor André Seznec IRISA/INRIA/HIPEAC

13André Seznec

Caps TeamIrisa

A tagged table entry

Ctr: 3-bit prediction counter U: 2-bit useful counter

Was the entry recently useful ? Tag: partial tag

Tag CtrU

Page 14: André Seznec Caps Team IRISA/INRIA 1 A 256 Kbits L-TAGE branch predictor André Seznec IRISA/INRIA/HIPEAC

14André Seznec

Caps TeamIrisa

Updating the U counter

If (Altpred ≠ Pred) then• Pred = taken : U= U + 1• Pred ≠ taken : U = U - 1

Graceful aging:Periodic shift of all U counters• implemented through the reset of a single bit

Page 15: André Seznec Caps Team IRISA/INRIA 1 A 256 Kbits L-TAGE branch predictor André Seznec IRISA/INRIA/HIPEAC

15André Seznec

Caps TeamIrisa

Allocating a new entry on a misprediction

Find a single “useless” entry with a longer history: Priviledge the smallest possible history

• To minimize footprint But not too much

• To avoid ping-pong phenomena

Initialize Ctr as weak and U as zero

Page 16: André Seznec Caps Team IRISA/INRIA 1 A 256 Kbits L-TAGE branch predictor André Seznec IRISA/INRIA/HIPEAC

16André Seznec

Caps TeamIrisa

Improve the global history

Address + conditional branch history: path confusion on short histories

Address + path: Direct hashing leads to path confusion

1. Represent all branches in branch history2. Use also path history ( 1 bit per branch, limited to 16

bits)

Page 17: André Seznec Caps Team IRISA/INRIA 1 A 256 Kbits L-TAGE branch predictor André Seznec IRISA/INRIA/HIPEAC

17André Seznec

Caps TeamIrisa

Design tradeoff for CBP2 (1)

13 components:Bring the best accuracy on distributed traces

• 8 components not very far !

History length:Min=4 , Max = 640 Could use any Min in [2,6] and any Max in

[300, 2000]

Page 18: André Seznec Caps Team IRISA/INRIA 1 A 256 Kbits L-TAGE branch predictor André Seznec IRISA/INRIA/HIPEAC

18André Seznec

Caps TeamIrisa

Design tradeoff for CBP2 (2)

Tag width tradeoff: (destructive) false match is better tolerated

on shorter history7 bits on T1 to 15 bits on T12

Tuning the number of table entries:Smaller number for very long historiesSmaller number for very short histories

Page 19: André Seznec Caps Team IRISA/INRIA 1 A 256 Kbits L-TAGE branch predictor André Seznec IRISA/INRIA/HIPEAC

19André Seznec

Caps TeamIrisa

Adding a loop predictor

The loop predictor captures the number of iterations of a loop

When successively encounters 4 times the same number of iterations, the loop predictor provides the prediction.

Advantages: Very reliable Small storage budget: 256 52-bit entries

Complexity ? Might be difficult to manage speculative iteration numbers on

deep pipelines

Page 20: André Seznec Caps Team IRISA/INRIA 1 A 256 Kbits L-TAGE branch predictor André Seznec IRISA/INRIA/HIPEAC

20André Seznec

Caps TeamIrisa

Using a kernel history and a user history

Traces mix user and kernel activities: Kernel activity after exception

• Global history pollution

Solution: use two separate global histories

User history is updated only in user mode Kernel history is updated in both modes

Page 21: André Seznec Caps Team IRISA/INRIA 1 A 256 Kbits L-TAGE branch predictor André Seznec IRISA/INRIA/HIPEAC

21André Seznec

Caps TeamIrisa

L-TAGE submission accuracy (distributed traces)

3.314 misp/KI

Page 22: André Seznec Caps Team IRISA/INRIA 1 A 256 Kbits L-TAGE branch predictor André Seznec IRISA/INRIA/HIPEAC

22André Seznec

Caps TeamIrisa

Reducing L-TAGE complexity

Included 241,5 Kbits TAGE predictor:3.368 misp/KI

Loop predictor beneficial only on gzip:Might not be worth the extra complexity

Page 23: André Seznec Caps Team IRISA/INRIA 1 A 256 Kbits L-TAGE branch predictor André Seznec IRISA/INRIA/HIPEAC

23André Seznec

Caps TeamIrisa

Using less tables

8 components 256 Kbits TAGE predictor:3.446 misp/KI

Page 24: André Seznec Caps Team IRISA/INRIA 1 A 256 Kbits L-TAGE branch predictor André Seznec IRISA/INRIA/HIPEAC

24André Seznec

Caps TeamIrisa

TAGE prediction computation time ?

3 successive steps: Index computation Table read Partial match + multiplexor

Does not fit on a single cycle: But can be ahead pipelined !

Page 25: André Seznec Caps Team IRISA/INRIA 1 A 256 Kbits L-TAGE branch predictor André Seznec IRISA/INRIA/HIPEAC

25André Seznec

Caps TeamIrisa

Ahead pipelining a global history branch predictor (principle)

Initiate branch prediction X+1 cycles in advance to provide the prediction in time Use information available:

• X-block ahead instruction address• X-block ahead history

To ensure accuracy: Use intermediate path information

Page 26: André Seznec Caps Team IRISA/INRIA 1 A 256 Kbits L-TAGE branch predictor André Seznec IRISA/INRIA/HIPEAC

26André Seznec

Caps TeamIrisa

Practice

Ahead pipelined TAGE:4// prediction computations

bc

Ha

A

A B C

Page 27: André Seznec Caps Team IRISA/INRIA 1 A 256 Kbits L-TAGE branch predictor André Seznec IRISA/INRIA/HIPEAC

27André Seznec

Caps TeamIrisa

3-branch ahead pipelined 8 component 256 Kbits TAGE

3.552 misp/KI

Page 28: André Seznec Caps Team IRISA/INRIA 1 A 256 Kbits L-TAGE branch predictor André Seznec IRISA/INRIA/HIPEAC

28André Seznec

Caps TeamIrisa

A final case for the Geometric History Length predictors

delivers state-of-the-art accuracy

uses only global information: Very long history: 300+ bits !!

can be ahead pipelined

many effective design points OGEHL or TAGE Nb of tables, history lengths

Page 29: André Seznec Caps Team IRISA/INRIA 1 A 256 Kbits L-TAGE branch predictor André Seznec IRISA/INRIA/HIPEAC

29André Seznec

Caps TeamIrisa

The End