Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
3 © 2019 Arm Limited 3 Arm Limited © 2019
An exciting road ahead
26 years 4 years!
20171991 2021
100 billionchips shipped
100 billionchips shipped
4 © 2019 Arm Limited 4 Arm Limited © 2019
The Next 100 Billion in 4 Years
Another 100 billion chips are projected to ship from 2017 – 2021.
Forecast mix of Arm chips (Classic, Cortex-A, R, M)
Embedded and Automotive: 40%
Infrastructure: 15% Mobile and Consumer Electronics:45%
5 © 2019 Arm Limited 5 Arm Limited © 2019
Distributing Intelligence from Edge to Cloud
On-device learning for enhanced user privacy
Compute performance to deliver a hi-fidelity world
Real-time inference for autonomous systems
Security and privacy for your data
4k, HDR and 5G for more human-like interfaces
5 Arm Limited © 2019
6 © 2019 Arm Limited 6 Arm Limited © 2019
Today’s compute model
Central Data Centers Billions of people
Media Content
Consume
4G Network
Create & Distribute
7 © 2019 Arm Limited 7 Arm Limited © 2019
Data consumption is driving future designs
Analyze & Store
Critical DataEdge
Edge
Edge
Edge
Edge
Edge
Filter & React
Massive Amounts of
Data5G
Local Decisions
Trillionsof Devices
Cloud Data Centers
8 © 2019 Arm Limited 8 Arm Limited © 2019
Transforming infrastructure
Cloud Data Center
Edge Cloud
Gateways, uCPE
5G
Edge Access
5G network
Value added uServices
Data privacy
Real time decisions made at the source
Edge cloud
5G network
Content cache
Cloud application deployment to meet
latency
Network analytics and management
Core cloud
Analytics/storage of critical data
9 © 2019 Arm Limited 9 Arm Limited © 2019
Motivating the Edge
Latency
BandwidthConstraints
Security Privacy
10 © 2019 Arm Limited 10 Arm Limited © 2019
High Performance, Secure IP and Architectures
Diverse Solutions and Ecosystem
Scalable from Hyperscale to the Edge
The Cloud to Edge Infrastructure Foundationfor a World of 1T Intelligent Devices
11 © 2019 Arm Limited 11 Arm Limited © 2019
Each generation, faster performance
~30% per Gen Faster Performance & New Features
16nm
(A72, A75)
CosmosPlatform
Today
7nm
AresPlatform
7+nm
ZeusPlatform
PoseidonPlatform
5nm
(N1, E1)
12 © 2019 Arm Limited 12 Arm Limited © 2019
Scalable system solutions from cloud to edge
8ch DDRPCIe 100GbE CCIX
Arm CPUs128 big
256 data plane
Bandwidth1 TB/s
System cache128MB
HBM8
DDR channels8
4
20 GB/s
0 MB
0
1
1ch DDR
10G
Radio
Edge
Edge
Edge
Edge
Edge
Edge
5G
Cloud Data Centers
14 © 2019 Arm Limited
Addressing Moore’s Law
• Can we use the additional transistors to unlock more CPU performance?
• By processing more instructions or more data in parallel per cycle.
• New microarchitecture can extract more Instruction-Level Parallelism (ILP).
• But there are limits to this hardware magic.
• Could new architecture allow us to express greater parallelism in our code?
• Tackling the unbending nature of Amdahl’s Law requires far more parallelisation.
• But without needing to rewrite the world’s software.
15 © 2019 Arm Limited
Scalable Vector Extension (SVE), recapA vector extension to the ARMv8-A architecture with some major new features
Gather-load and scatter-storeLoads a single register from several non-contiguous memory locations.
Per-lane predicationOperations work on individual lanes under control of a predicate register.
Predicate-driven loop control and managementEliminate scalar loop heads and tails by processing partial vectors.
Vector partitioning and software-managed speculationFirst Faulting Load instructions allow memory accesses to cross into invalid pages.
Extended floating-point horizontal reductionsIn-order and tree-based reductions trade-off performance and repeatability.
1 2 3 45 5 5 51 0 1 0
6 2 8 4
+
=
pred
1 2 0 01 1 0 0
+pred
1 2
1 + 2 + 3 + 4
1 + 2
+
3 + 4
3 7= =
=
=
n-2
1 01 0CMPLT n
n-1 n n+1INDEX i
for (i = 0; i < n; ++i)
16 © 2019 Arm Limited
• Start porting and tuning for future architectures early• Reduce time to market • Save development and debug time with Arm support
• Run 64-bit user-space Linux code that uses new hardware features on current Arm hardware• SVE support available now.• Tested with Arm Architecture Verification Suite (AVS)
• Near native speed with commercial support• Integrates with DynamoRIO allowing arbitrary instrumentation extension• Emulates only unsupported instructions• Integrated with other commercial Arm tools including compiler and profiler• Maintained and supported by Arm for a wide range of Arm-based SoCs
Commercially Supportedby ARM
Runs at close to native speed
Develop software for tomorrow’s hardware today
Develop your user-space applications for future hardware today
Arm Instruction Emulator
18 © 2019 Arm Limited
Applying ArmIE methodology to workloads
• Compile application with SVE-capable compiler and run it through ArmIE:
$ armie -msve-vector-bits=512 -i libinscount_emulated.so -- ./sve_app
• (Optional) Use Region-of-Interest (RoI) in the code to delimit various region of interests
• Select between several SVE-ready instrumentation clients
• Successfully applied to various popular mini-apps and benchmarks (e.g. SVE)
Arm Community Blog “Emulating SVE on existing Armv8-A hardware using DynamoRIO and ArmIE” Miguel Tairum -- http://bit.ly/2wN4P6M
Arm Community Blog “Parallelizing HPCG's main kernels”Daniel Ruiz -- http://bit.ly/2ZtstSb
20 © 2019 Arm Limited
Ookami (Stony Brook, USA)
"Three perspectives on message passing" - Robert Harrison, Director of the Institute of Advanced Computational Science (IACS) and Brookhaven Computational Science Center (CSC) -- https://www.youtube.com/watch?v=WkepRUw0ri0
21 © 2019 Arm Limited
Scalable Vector Extension v2 (SVE2)Scalable Data-Level Parallelism (DLP) for more applications
Built on the SVE foundation.• Scalable vectors with hardware choice from 128 to 2048 bits.• Vector-length agnostic programming for “write once, run anywhere”.• Predication and gather/scatter allows more code to be vectorized.• Tackles some obstacles to compiler auto-vectorisation.
Scaling single-thread performance to exploit long vectors.• SVE2 adds NEON™-style fixed-point DSP/multimedia plus other new features.• Performance parity and beyond with classic NEON DSP/media SIMD.• Tackles further obstacles to compiler auto-vectorization.
Enables vectorization of a wider range of applications than SVE.• Multiple use cases in Client, Edge, Server and HPC.
– DSP, Codecs/filters, Computer vision, Photography, Game physics, AR/VR, Networking, Baseband, Database, Cryptography, Genomics, Web serving.
• Improves competitiveness of Arm-based CPU vs proprietary solutions.• Reduces s/w development time and effort.
Built on SVE
Improved scalability
Vectorization ofmore workloads
Announced by Nigel Stephens, Arm Fellow, at Linaro Connect BKK, April 2019
22 © 2019 Arm Limited
SVE2 enhancements
▪ NEON-style “DSP” instructions
• Trad NEON fixed-p, widen, narrow & pairwise ops
• Fixed-point complex dot product, etc. (LTE)
• Interleaved add w/ carry (wide multiply, BigNums)
• Multi-register table lookup (LTE, CV, shuffle)
• Enhanced vector extract (FIR, FFT)
▪ Cross-lane match detect / count
• In-memory histograms (CV, HPC, sorting)
• In-register histograms (CV, G/S pointer de-alias)
• Multi-character search (parsers, packet inspection)
▪ Non-temporal Gather / Scatter
• Explicit cache segregation (CV, HPC, sorting)
▪ Bitwise operations
• PMULL32→64, EORBT, EORTB (CRC, ECC, etc.)
• BCAX, BSL, EOR3, XAR (ternary logic + rotate)
▪ Bit shuffle
• BDEP, BEXT, BGRP (LTE, compression, genomics)
▪ Cryptography
• AES, SM4, SHA3, PMULL64→128
▪ Miscellaneous vectorisation
• WHILEGE/GT/HI/HS (down-counting loops)
• WHILEWR/RW (contiguous pointer de-alias)
• FLOGB (other vector trig)
▪ ID register changes only for SVE Linux kernel
23 © 2019 Arm Limited
Transactional Memory Extension (TME)Scalable Thread-Level Parallelism (TLP) for multi-threaded applications
Hardware Transactional Memory (HTM) for the Arm architecture.• Improved competitiveness with other architectures that support HTM.• Strong isolation between threads.• Failure atomicity.
Scaling multi-thread performance to exploit many-core designs.• Database.• Network dataplane.• Dynamic web serving.
Simplifies software design for massively multi-threaded code.• Supports Transactional Lock Elision (TLE) for existing locking code.• Low-level concurrent access to shared data is easier to write and debug.
Improved scalability
Hardware Transactional Memory
Simpler software design
Announced by Nigel Stephens, Arm Fellow, at Linaro Connect BKK, April 2019
24 © 2019 Arm Limited
Seriously committed toward High Performance
• Arm committed to enable innovation across all compute continuum (including HPC!).• New ways to scale performance and exploit additional transistors with Moore’s Law slowing.
• Extracting more parallelism from existing software.
• SVE2: improved auto-vectorization with support for DSP/Media hand-coded SIMD.• Scalable vectorization for increased fine-grain Data Level Parallelism (DLP).
• More work done per instruction.
• TME: easier lock-free programming for lightly-contended shared data structures.• Scalable concurrency to increase coarse-grain Thread Level Parallelism (TLP).
• More done work per thread.
• SVE2 and TME may be able to combine for even greater performance scaling.• Tackling Amdahl’s law on multiple fronts with a mix of DLP and TLP in multi-threaded applications.
These new technologies are not yet part of any announced product roadmap.
25 © 2019 Arm Limited
What we are up to in Research / SLSS
• Collaborative HPC activities with newly established Centre of Excellence (Filippo Spiga)• Workload evaluation continues with gem5/armie waiting real SVE HW • Exploring other computational areas (e.g. genomics)
• Applied Analytics and Irregular Applications (Doug Joseph)• Funded projects around unsupervised and semi-supervised• Graph Analytics
• High Performance Networking and Direct IO Compute (Pavel Shamis)• Co-design to tighten coupling between off-chip interconnect and compute elements
• Reliability for large-scale Arm-based systems (Reiley Jeyapaul)• Involvement in MontBlanc2020
• Edge Computing, Edge-to-Cloud and Smarter Cities (Eric Van Hensbergen)
• Enable scalable Deep Learning training on Arm using SVE (Filippo Spiga)• Enablement across the full stack, from library to automatic generation of optimized operators • Is DL/AI a killer app for SVE?