30
Jason Law Byeong Kil Lee TM5400/5600 TM5500/5800 TM6000

Jason Law Byeong Kil Lee

  • Upload
    buzz

  • View
    30

  • Download
    0

Embed Size (px)

DESCRIPTION

TM5400/5600 TM5500/5800 TM6000. Jason Law Byeong Kil Lee. Outline. Crusoe technology Crusoe processors / architecture Code morphing software Crusoe hardware support for code morphing LongRun power management Performance comparison Conclusion. Crusoe Technology. - PowerPoint PPT Presentation

Citation preview

Page 1: Jason Law Byeong Kil Lee

Jason Law

Byeong Kil Lee

TM5400/5600TM5500/5800

TM6000

Page 2: Jason Law Byeong Kil Lee

Outline

• Crusoe technology• Crusoe processors / architecture• Code morphing software• Crusoe hardware support for code morphing• LongRun power management• Performance comparison• Conclusion

Page 3: Jason Law Byeong Kil Lee

Crusoe Technology• Crusoe processor = Software + hardware

VLIW hardware• 128 bit Very long Instruction Word Processor• Simple and fast• Fewer transistors

Low powerx86 compatibility PC performance

Code Morphing software• Dynamically translates x86 instructions into VLIW instructions• Provides x86 compatibility• Optimization and scheduling by software

3/4

1/4

Page 4: Jason Law Byeong Kil Lee

Crusoe VLIW

Page 5: Jason Law Byeong Kil Lee

Crusoe ProcessorsL1 cache : 128 K

DDRAM-SDRAM (100 to 133MHz)SDRAM (66 to 133MHz)

Page 6: Jason Law Byeong Kil Lee

Features

• Lighter• Longer• Cooler• x86 compatibility (windows / Linux)• Upgradeable (by software)• Lower cost• MMX support ( not support for SSE / 3dnow! )• Target : ultra-light mobile notebooks, internet appliance, hi

gh-density servers, embedded devices• Products : SONY, Fusitsu, NEC, RLX technology, ….

Page 7: Jason Law Byeong Kil Lee

Crusoe ArchitectureTM5800

Page 8: Jason Law Byeong Kil Lee

Cont.

• VLIW CPU : executing up to 4 operations in each cycle– Molecule: long instruction word (128 bits molecule)

– All atoms within a molecule are executed in parallel, in order

• 2 ALU, 1FP, 1 load/store, 1 branch unit

• In-order 7-stage integer/10-stage FP pipeline

• 64 integer registers, 32 FP registers

Page 9: Jason Law Byeong Kil Lee

• The blue stuff is silicon, and the yellow is software • Crusoe's blue part is smaller• All of those hardware was moved off the die and into software

Crusoe vs. x86

Page 10: Jason Law Byeong Kil Lee

Code Morphing Software : A dynamic translation system, reside in a ROM,

First program to start executing when booting

• Drawing the H/W and S/W line– Software: decoding x86 instructions and generating parallel molecule

– Hardware: execute using a simple, high-speed VLIW engine

• Decoding and scheduling– Translation cache : CMS translates instructions once,

saving the resulting translation for re-use

Skip the translation in the next time

Page 11: Jason Law Byeong Kil Lee

Caching

• Translation cache : – Resides in a separate memory space

– The size can be set at boot time, or OS can make the size adjustable

• Crusoe’s CMS monitor actual execution– Keep track of which blocks of code execute most often

Optimizes them accordingly

– Keep track of which branches are most often taken

Annotate the code accordingly

Code Morphing Software

Page 12: Jason Law Byeong Kil Lee

Filtering & Prediction

• Filtering : a wide choice of execution modes for x86 code

– Interpretation (no translation overhead),

– Translation,

– Highly optimized code(takes longest to generate)

: Run faster once translated

• Prediction– Highly biased branch : frequently taken path

– Otherwise : execute both path, select later

Code Morphing Software

Page 13: Jason Law Byeong Kil Lee

Translation Process

• 1st pass (frontend) – Translate the x86 instructions into a simple sequences of atoms

(temporary register used)

• 2nd pass(optimizer) – Well-known compiler optimization

Common subexpression elimination, loop invariant removal,

Dead code elimination

• 3rd pass (scheduler) : – Reorders the optimized atoms and groups them into individual

molecules

(Scheduling by software, more effective scheduling algorithms

and consider a larger window of instructions)

Code Morphing Software

Page 14: Jason Law Byeong Kil Lee

Advantages of the Code Morphing Software

Traditional x86 Processors

Crusoe Processor

with Code Morphing software

Translates each x86 instruction

every time it is encountered

Translates instructions once,

saving the resultant translation in a cache

for re-use

Full of complex, power-hungry

Transistors

Much of the processor functionality

is implemented in software- less logic transistors, less power

- use effective optimization/schedule algorithm

- use a larger window of instruction

- …

Page 15: Jason Law Byeong Kil Lee

Crusoe Hardware Support for Code Morphing

: Crusoe hardware has been designed specifically

with dynamic translation in mind.

• Crusoe's solution of exceptions – All registers holding x86 state are shadowed

(two copies of each register, a working copy and a shadow copy)

– Normal atoms only update the working copy of the register

i) without encountering an exception :"commit" operation : copies all working register into shadow registers

ii) exception occurs :"rollback" operation : copies the shadow register values back into

the working registers.

Page 16: Jason Law Byeong Kil Lee

Cont.

• Store operations by holding store data in a "gated store buffer "– Only released to the memory system at the time of a commit

– On a rollback, stores not yet committed : dropped from the store buffer

• Safe reordering loads ahead of stores (Alias Hardware)– The load a "load-and-protect" (data, the address and size of data)

– The store a "store-under-alias-mask " (checks for protected regions)

* In the event that the store operation overwrite the previously loaded data the process raises an exception, and the runtime system can take corrective action.

Page 17: Jason Law Byeong Kil Lee

Sample Translation Code

X86 instructions

Translated VLIW molecule

: They use 2 integer ALU atoms in a molecule

Page 18: Jason Law Byeong Kil Lee

LongRun Power Management

• Crusoe was designed for good performance at very

low power

• Power = 1/2 CV2F

• Reduce transistor count to decrease capacitance

• Scale voltage and frequency dynamically to give just

enough performance for current workload

Page 19: Jason Law Byeong Kil Lee

Dynamic Power Management

• Frequency changes in steps of 33 MHz

• Voltage changes in steps of 25mV

• Supports up to 200 frequency/voltage changes per

second

• Can give cubic reductions in power consumption– Reduce C2 and F

LongRun Power ManagementLongRun Power Management

Page 20: Jason Law Byeong Kil Lee

Conventional Power Profile

LongRun Power Management

Page 21: Jason Law Byeong Kil Lee

LongRun Power Profile

LongRun Power Management

Page 22: Jason Law Byeong Kil Lee

ACPI Standard

• ACPI - Advanced Configuration and Power Interface– joint standard of Microsoft, Intel, and Toshiba

• System level technique to reduce power

• Allows three low-power states that can be alternated– AutoHALT - processor executes HLT instr

• Processor stops its internal clock

– QuickStart - Southbridge gives processor STPCLK signal• Processor maintains cache coherency

– Deep Sleep - Southbridge disables processor CLK input• Southbridge maintains cache coherency

LongRun Power Management

Page 23: Jason Law Byeong Kil Lee

ACPI vs. LongRun

LongRun Power Management

Page 24: Jason Law Byeong Kil Lee

Intel Speed Step

• Statically lowers voltage/frequency settings at startup

• Two operating points:– AC power -- full performance

– DC power -- slightly lower performance

• Low granularity misses opportunities for power

savings

LongRun Power Management

Page 25: Jason Law Byeong Kil Lee

How LongRun Compares

LongRun Power Management

Page 26: Jason Law Byeong Kil Lee

Performance

The 700 MHz TM5400 was quoted as having comparable performance to a 500-550 MHz Pentium III. Transmeta didn't offer any conventional benchmarks. Rather, it compared the power utilized on a mobile Pentium III to the power utilized on a Crusoe when completing various tasks.

It appears that Transmeta would like to dictate to the mobile industry that power is what it's all about, not speed. That is Transmeta's strong suit, but some normal benchmarks would have been nice. Why not show them? If Crusoe did well in those benchmarks, do you think Transmeta wouldn't show them? I'm convinced that the Crusoe is not performing as well as mobile AMD or Intel chips. For the markets it's aimed at, that's not too big a deal, but I'd like to know.

- From a article by Rob Hughes, Jan 20, 2000

Page 27: Jason Law Byeong Kil Lee

Relative Performance While Mobile (on Batteries)TM5800 vs. Pentium III ULV

1.0

0.75

0.5

0.25

0

2001

Page 28: Jason Law Byeong Kil Lee

CPUmark99 v1.1 ComparisonCPU + Core Logic power

8.0

6.0

4.0

2.0

0

Watt

Page 29: Jason Law Byeong Kil Lee

Business Graphics Winmark v1.1 ComparisonCPU + Core Logic power

8.0

6.0

4.0

2.0

0

Watt

Page 30: Jason Law Byeong Kil Lee

Conclusion

• Combination of hardware and software• Using software - To decompose complex instructions into simple atoms

- To schedule and optimize the atoms for parallel execution

Saves millions of logic transistors

Cuts power consumption (60~70%)

Enabling aggressive code optimization techniques

• LongRun power management Cuts power consumption by factor of 2 to 10