Daniele Di Donato, Product ManagerOctober 2019
Tech Symposia 2019
Mali-G57: Premium GPU Performance for
Mainstream Devices
2 © 2019 Arm Limited
Mal
i GP
Us
are
in:
Over 1BnMali GPUs shipped in 2018
Smartphones
~50%
Arm Mali GPUs: The World’s #1 Shipping Graphics Processor
AI
VR
183SmartTVs
~80%Mobile VR
~50%
Tota
l Mal
i lic
en
ses
Gaming
AR
© 2
01
9A
rm L
imit
ed
© 2
01
9A
rm L
imit
ed
Arm Mali Graphics Processor Roadmap
Mainstream
Mali-470 Mali-450Ultra-Efficient Mali-G31
Mali-G76High Performance Mali-G71 Mali-G72
Mali-T830 Mali-G51 Mali-G52
Mali-G77
Mali-G57
© 2
01
9A
rm L
imit
ed
Complex and Challenging GPU Powered Use Cases
On-device machine learningAugmented reality and virtual realityHigh-fidelity mobile gaming
© 2
01
9A
rm L
imit
ed
© 2
01
9A
rm L
imit
ed
Arm Mali Graphics Processor Generations
Unified shader cores, SIMD ISA, OpenGL ES 3.x, OpenCL, Vulkan
Mali-T600 GPU series Mali-T800 GPU seriesMali-T700 GPU seriesMIDGARD
Mali-G71 Mali-G51BIFROST
Unified shader cores, scalar ISA, clause execution, full coherency, Vulkan, OpenCL
Mali-G72 Mali-G52 Mali-G31 Mali-G76
Mali-G77VALHALL
Superscalar engine, simplified scalar ISA, dynamic instruction scheduling
Mali-G57
© 2
01
9A
rm L
imit
ed
© 2
01
9A
rm L
imit
ed
First Valhall GPU for Mainstream Market Delivers Outstanding Device Performance
Compared with Mali-G52 3EE running complex content with same process node under similar conditions
1.3xbetter
performance
© 2
01
9A
rm L
imit
ed
© 2
01
9A
rm L
imit
ed
Leap in Gaming Performance and Efficiency
30%more
performance density
30% better energy
efficiency
60%improvement for machine learning
Efficiently supporting growing graphics and ML complexity
Compared to Mali-G52 3EE on same process node under similar conditions
© 2
01
9A
rm L
imit
ed
Complex Game 1 Complex Game 2 Complex Game 3 Causal Content
Relative Game Performance
Mali-G52 3EE fps/mm2 Mali-G57 fps/mm2
Improved High-Fidelity and Casual Gaming Performance
• Mali-G57 delivers more performance-per-square millimetre
• Up to 2x more compute capabilities when compared with G52 2EE
• Quad texture mapper has large impact on some texture heavy games
ISO process and frequency
1.4x1.2x 1.25x 1.25x
© 2
01
9A
rm L
imit
ed
© 2
01
9A
rm L
imit
ed Complex content 1 Complex content 2 Complex content 3 Complex content 4
Relative Energy Efficiency
Mali-G52 3EE Mali-G57
Delivers Even Longer Game Play
• Mali-G57 boosts energy efficiency across all workloads
• Delivers longer battery life for mainstream products
• Average 1.3x improvement in energy efficiency across wide range of content
1.24x 1.29x 1.20x
1.39x
ISO process and frequency
© 2
01
9A
rm L
imit
ed
© 2
01
9A
rm L
imit
ed
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
Mali-G52 3EE Mali-G57
Relative performance improvement in ML networks
Mali-G52 3EE Mali-G57
Enhanced On-Device Intelligence
• Mali-G57 significantly improves Machine Learning inference performance
• Average improvement for multiple NN networks
ISO process and frequency
1.6x
© 2
01
9A
rm L
imit
ed
© 2
01
9A
rm L
imit
ed
Excellent improvements, configurability and flexibility
• Introduction of Vahall architecture for Mainstream markets
• Single execution engine per shader core with increased FMAs
• Quad texture mapper
• Optimized Load and Store cache for ML networks
• Configurable 1 to 6 shader cores
© 2
01
9 A
rm L
imit
ed
65
43
2
MALI™-G57Inter-core task management
Shader core
Control/Scheduler
Register File
Datapath
Messaging
Register File
Datapath
Control
Shader core 1
Advanced Tiling Unit
Memory management unit
L2 cache L2 cache
AMBA©4 ACE AMBA©4 ACE
12 © 2019 Arm Limited
Valhall Architecture Goals
• The new Mali architecture following Bifrost
• 2nd generation of Arm GPU scalar architecture for high-performance, high-efficient GPUs
• 16-wide warp-based execution model
• New simplified and compiler-friendly instruction set
• Aligned to new APIs
Register File
Messaging
Warp control, scheduler, icache
Dat
apat
h
Co
ntr
ol
Datapath
4 lanes
Mali-G51 Execution Engine
4 wide per engine3 engines per core
8 wide per engine2/3 engines per core
Mali-G52 Execution Engine
Register File
Messaging
Warp control, scheduler, icache
Dat
apat
h
Co
ntr
ol
Datapath
4 lanesDatapath
4 lanes
Mali-G57 Execution Engine
16 wide warp per cluster2 clusters per engine1 engine per core
control, scheduler, icache
Messaging
Register File Register File
Datapath16 lanes
Datapath16 lanes
Dat
apat
h
Co
ntr
ol
© 2
01
9A
rm L
imit
ed
Valhall Fundamentals
• Warp-based execution model• 16 threads executed in lockstep in a warp
• New instruction set• Operational equivalence to Bifrost• Regular, unconstrained instruction encoding
• Dynamic scheduling of instructions• Done by HW• No more clauses, tuples and fixed issuing
• Dependency system
• New features• AFBC1.3• Support for FP16 render targets• HW allocated vertex shader outputs
T0.z
T0.y
T0.x
T1.z
T1.y
T1.x
…
…
…
T15.z
T15.y
T15.x
Cycle 3
Cycle 2
Cycle 1
Time
Warp threads
© 2
01
9A
rm L
imit
ed
© 2
01
9A
rm L
imit
ed
0
5
10
15
20
25
30
35
Mali-G51 Mali-G52 2EE Mali-G52 3EE Mali-G57
FMAs per core
Efficient Shader Core with increased compute capabilitiesNew gaming content becoming more complex
32 FMAs per-core
2.6x FMA compared to G512 FMA compared to G52 2EE
1.3x FMA compared to G52 3EE
© 2
01
9A
rm L
imit
ed
© 2
01
9A
rm L
imit
ed
0
1
2
3
4
5
Mali-G51 Mali-G52 Mali-G57
Bilinear Texels/clock
Quad Texture Mapper Doubles ThroughputNew gaming content becoming more complex
4 texels/cycle
2x Mali-G52 throughput
© 2
01
9A
rm L
imit
ed
© 2
01
9A
rm L
imit
ed
Improved Load and Store cache
ThroughputimprovementsInternal datapath is cacheline wide
Latency improvementsNumber of pipeline stages reduced by half
© 2
01
9A
rm L
imit
ed
0.00
0.50
1.00
1.50
2.00
2.50
Relative NN performance
Mali-G52 3EE
Mali-G57
© 2
01
9A
rm L
imit
ed
© 2
01
9A
rm L
imit
ed
Bringing Premium Device Experiences Mainstream
• High-end graphics performance at increased efficiency
• First mainstream GPU with new Valhall architecture
• Outstanding Mali GPU performance improvement to enable premium use cases on mainstream devices
*Compared to Mali-G52 3EE on same process node under similar conditions
more energy efficient*
30%more performance
density*
30%machine learning
improvement*
60%
Thank YouDankeMerci谢谢
ありがとうGracias
Kiitos감사합니다
धन्यवाद
شكًراתודה
© 2019 Arm Limited
The Arm trademarks featured in this presentation are registered trademarks or trademarks of Arm Limited (or its subsidiaries) in
the US and/or elsewhere. All rights reserved. All other marks featured may be trademarks of their respective owners.
www.arm.com/company/policies/trademarks