THE PROGRAMMER’S GUIDE TO A
UNIVERSE OF POSSIBILITY
Phil Rogers
AMD
Corporate Fellow
Heterogeneous System Architecture
2 | The Programmer’s Guide to a Universe of Possibility | June 12, 2012
Most parallel code runs on CPUs designed for scalar workloads
3 | The Programmer’s Guide to a Universe of Possibility | June 12, 2012
WHAT DID WE HEAR FROM TOM MALLOY THIS MORNING?
4 | The Programmer’s Guide to a Universe of Possibility | June 12, 2012
CHANGING THE THINKING
Typically platform builders create innovative new
hardware and offer an API for software to access it
That tired thinking has only ever had niche success!
5 | The Programmer’s Guide to a Universe of Possibility | June 12, 2012
HETEROGENEOUS SYSTEM ARCHITECTURE ROADMAP
6 | The Programmer’s Guide to a Universe of Possibility | June 12, 2012
HETEROGENEOUS SYSTEM ARCHITECTURE Brings All the Processors in a System into Unified Coherent Memory
POWER EFFICIENT
EASY TO PROGRAM
FUTURE LOOKING
ESTABLISHED TECHNOLOGY FOUNDATION
OPEN STANDARD
INDUSTRY SUPPORT
7 | The Programmer’s Guide to a Universe of Possibility | June 12, 2012
SOLUTION
PROBLEM
THE HSA OPPORTUNITY ON MODERN APPLICATIONS
Developer
Return (Differentiation in
performance,
reduced power,
features,
time to market)
Developer Investment (Effort, time, new skills)
Good user experiences
Historically, developers program CPUs
HSA + Libraries = productivity & performance with low power
Wide range of differentiated experiences
~4M apps
~10+M* CPU
coders
PROBLEM
Significant niche value
GPU/HW blocks hard to program
Not all workloads accelerate
~200 apps
~100K GPU
coders
Few 100Ks HSA apps
Few M HSA
coders
*IDC
8 | The Programmer’s Guide to a Universe of Possibility | June 12, 2012
APPLICATION AREAS WITH ABUNDANT PARALLEL WORKLOADS
Biometric Recognition
Secure, fast, accurate: face, voice, fingerprints
Beyond HD Experiences
Streaming media, new codecs, 3D, transcode, audio
Augmented Reality
Superimpose graphics, audio, and other digital information as a virtual overlay
AV Content Management
Searching, indexing and tagging of video & audio. multimedia data mining
Natural UI & Gestures
Touch, gesture, and voice
Content Everywhere
Content from any source to any display seamlessly
9 | The Programmer’s Guide to a Universe of Possibility | June 12, 2012
HAAR Face Detection
CORNERSTONE TECHNOLOGY
FOR COMPUTERVISION
11 | The Programmer’s Guide to a Universe of Possibility | June 12, 2012
LOOKING FOR FACES IN ALL THE RIGHT PLACES
12 | The Programmer’s Guide to a Universe of Possibility | June 12, 2012
LOOKING FOR FACES IN ALL THE RIGHT PLACES
Quick HD Calculations
Search square = 21 x 21
Pixels = 1920 x 1080 = 2,073,600
Search squares = 1900 x 1060 = ~2 Million
13 | The Programmer’s Guide to a Universe of Possibility | June 12, 2012
LOOKING FOR DIFFERENT SIZE FACES – BY SCALING THE VIDEO FRAME
14 | The Programmer’s Guide to a Universe of Possibility | June 12, 2012
LOOKING FOR DIFFERENT SIZE FACES – BY SCALING THE VIDEO FRAME
More HD Calculations
70% scaling in H and V
Total Pixels = 4.07 Million
Search squares = 3.8 Million
15 | The Programmer’s Guide to a Universe of Possibility | June 12, 2012
Feature l
Feature m
Feature p
Feature r
Feature q
HAAR CASCADE STAGES
Feature k
Stage N
Stage N+1
Face still possible? Yes
No
REJECT FRAME
16 | The Programmer’s Guide to a Universe of Possibility | June 12, 2012
22 CASCADE STAGES, EARLY OUT BETWEEN EACH
STAGE 22 STAGE 21 STAGE 2 STAGE 1
NO FACE
FACE CONFIRMED
Final HD Calculations
Search squares = 3.8 million
Average features per square = 124
Calculations per feature = 100
Calculations per frame = 47 GCalcs
Calculation Rate
30 frames/sec = 1.4TCalcs/second
60 frames/sec = 2.8TCalcs/second
…and this only gets front-facing faces
17 | The Programmer’s Guide to a Universe of Possibility | June 12, 2012
CASCADE DEPTH ANALYSIS
0
5
10
15
20
25Cascade Depth
20-25
15-20
10-15
5-10
0-5
18 | The Programmer’s Guide to a Universe of Possibility | June 12, 2012
UNBALANCING DUE TO EXITS IN EARLIER CASCADE STAGES
Live
Dead
When running on the GPU, we run each search rectangle on a separate work item
Early out algorithms, like HAAR, exhibit divergence between work items
Some work items exit early
Their neighbors continue
SIMD packing suffers as a result
19 | The Programmer’s Guide to a Universe of Possibility | June 12, 2012
0
10
20
30
40
50
60
70
80
90
100
1 2 3 4 5 6 7 8 9-22
Tim
e (
ms)
Cascade Stage
“Trinity” A10-4600M (6CU@497Mhz, 4 cores@2700Mhz)
GPU
CPU
PROCESSING TIME/STAGE
AMD A10 4600M APU with Radeon™ HD Graphics; CPU: 4 cores @ 2.3 MHz (turbo 3.2 GHz); GPU: AMD Radeon HD 7660G,
6 compute units, 685MHz; 4GB RAM; Windows 7 (64-bit); OpenCL™ 1.1 (873.1)
20 | The Programmer’s Guide to a Universe of Possibility | June 12, 2012
0
2
4
6
8
10
12
0 1 2 3 4 5 6 7 8 22
Imag
es/S
ec
Number of Cascade Stages on GPU
“Trinity” A10-4600M (6CU@497Mhz, 4 cores@2700Mhz)
CPU
HSA
GPU
PERFORMANCE CPU-VS-GPU
AMD A10 4600M APU with Radeon™ HD Graphics; CPU: 4 cores @ 2.3 MHz (turbo 3.2 GHz); GPU: AMD Radeon HD 7660G,
6 compute units, 685MHz; 4GB RAM; Windows 7 (64-bit); OpenCL™ 1.1 (873.1)
21 | The Programmer’s Guide to a Universe of Possibility | June 12, 2012
HAAR SOLUTION – RUN DIFFERENT CASCADES ON GPU AND CPU
By seamlessly sharing data between CPU and GPU,
HSA allows the right processor to handle its appropriate workload
+2.5x
-2.5x
INCREASED
PERFORMANCE DECREASED ENERGY
PER FRAME
22 | The Programmer’s Guide to a Universe of Possibility | June 12, 2012
ACCELERATING MEMCACHED
CLOUD SERVER WORKLOAD
23 | The Programmer’s Guide to a Universe of Possibility | June 12, 2012
DATACENTER WORKLOAD
Generally used for short-term storage and caching, handling requests
that would otherwise require database or file system accesses
Used by Facebook, YouTube, Twitter, Wikipedia, Flickr, and others
Effectively a large distributed hash table
Responds to store and get requests received over the network
Conceptually:
store(key, object)
object = get(key)
24 | The Programmer’s Guide to a Universe of Possibility | June 12, 2012
100%
80%
60%
40%
20%
0 0
1
2
3
4
Key Look Up Performance Execution Breakdown
Data Transfer Execution
OFFLOADING MEMCACHED KEY LOOKUP TO THE GPU
T. H. Hetherington, T. G. Rogers, L. Hsu, M. O’Connor, and T. M. Aamodt, “Characterizing and Evaluating a Key-Value Store Application on Heterogeneous CPU-GPU Systems,”
Proceedings of the 2012 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS 2012), April 2012.
http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6189209
Multithreaded CPU Radeon HD 5870 “Trinity” A10-5800K Zacate E-350
25 | The Programmer’s Guide to a Universe of Possibility | June 12, 2012
ACCELERATING JAVA
GOING BEYOND NATIVE LANGUAGES
26 | The Programmer’s Guide to a Universe of Possibility | June 12, 2012
JAVA ENABLEMENT BY APARAPI
Developer creates Java™ source Source compiled to class files (bytecode)
using standard compiler (javac)
Classes packaged and deployed using established Java™ tool chain
Aparapi = Runtime capable of converting Java™ bytecode to OpenCL™
For execution on any OpenCL™ 1.1+ capable device
OR execute via a thread pool if OpenCL™ is not available
27 | The Programmer’s Guide to a Universe of Possibility | June 12, 2012
JAVA AND APARAPI HSA ENABLEMENT ROADMAP
HSAIL
HSA-Enabled JVM
Application
HSA GPU HSA CPU
HSA Finalizer
CPU ISA GPU ISA
HSA Runtime
LLVM Optimizer
HSAIL
IR
JVM
Application
Aparapi
HSA GPU HSA CPU
HSA Finalizer
CPU ISA GPU ISA CPU ISA GPU ISA
JVM
Application
Aparapi
GPU CPU
OpenCL™
HSAIL
JVM
Application
Aparapi
HSA GPU HSA CPU
HSA Finalizer
CPU ISA GPU ISA
28 | The Programmer’s Guide to a Universe of Possibility | June 12, 2012
HSA SOFTWARE STACKS
29 | The Programmer’s Guide to a Universe of Possibility | June 12, 2012
INTRODUCING HSA BOLT – PARALLEL PRIMITIVES LIBRARY FOR HSA
Easily leverage the inherent power efficiency of GPU computing
Common routines such as scan, sort, reduce, transform
More advanced routines like heterogeneous pipelines
Bolt library works with OpenCL or C++ AMP
Enjoy the unique advantages of the HSA platform
Move the computation not the data
Finally a single source code base for the CPU and GPU!
Developers can focus on core algorithms See Ben Sander’s session tomorrow
for a deep dive on HSA Bolt!
30 | The Programmer’s Guide to a Universe of Possibility | June 12, 2012
HSA SOLUTION STACK
CPU(s) GPU(s) Other
Accelerators
HSA Finalizer
Legacy Drivers
Application
Domain Specific Libs (Bolt, OpenCV™, … many others)
HSA Runtime
Application SW
Drivers
Differentiated HW
DirectX Runtime
Other Runtime
HSAIL
GPU ISA
OpenCL™ Runtime
HSA Software
Knl Driver
Ctl
31 | The Programmer’s Guide to a Universe of Possibility | June 12, 2012
AMD’S OPEN SOURCE COMMITMENT TO HSA
Component Name AMD Specific Rationale
HSA Bolt Library No Enable understanding and debug
OpenCL HSAIL Code Generator No Enable research
LLVM Contributions No Industry and academic collaboration
HSA Assembler No Enable understanding and debug
HSA Runtime No Standardize on a single runtime
HSA Finalizer Yes Enable research and debug
HSA Kernel Driver Yes For inclusion in linux distros
We will open source our linux execution and compilation stack
Jump start the ecosystem
Allow a single shared implementation where appropriate
Enable university research in all areas
32 | The Programmer’s Guide to a Universe of Possibility | June 12, 2012
EASE OF PROGRAMMING
CODE COMPLEXITY VS. PERFORMANCE
33 | The Programmer’s Guide to a Universe of Possibility | June 12, 2012
0
50
100
150
200
250
300
350
LO
C
LINES-OF-CODE AND PERFORMANCE FOR DIFFERENT PROGRAMMING MODELS
Copy-back Algorithm Launch Copy Compile Init Performance
Serial CPU TBB Intrinsics+TBB OpenCL™-C OpenCL™ -C++ C++ AMP HSA Bolt
Pe
rform
an
ce
35.00
30.00
25.00
20.00
15.00
10.00
5.00
0 Copy-back
Algorithm
Launch
Copy
Compile
Init.
Copy-back
Algorithm
Launch
Copy
Compile
Copy-back
Algorithm
Launch
Algorithm
Launch
Algorithm
Launch
Algorithm
Launch
Algorithm
Launch
(Exemplary ISV “Hessian” Kernel)
AMD A10-5800K APU with Radeon™ HD Graphics – CPU: 4 cores, 3800MHz (4200MHz Turbo); GPU: AMD Radeon HD 7660D, 6 compute units, 800MHz; 4GB RAM.
Software – Windows 7 Professional SP1 (64-bit OS); AMD OpenCL™ 1.2 AMD-APP (937.2); Microsoft Visual Studio 11 Beta
34 | The Programmer’s Guide to a Universe of Possibility | June 12, 2012
THE HSA FUTURE
Highly productive programmers
+ Scalable performance
+ Power efficiency
= AMAZING USER EXPERIENCES
ANNOUNCING…
THE HSA FOUNDATION PHILIP ROGERS, PRESIDENT
THE HSA FOUNDATION: ACTIVITIES
Nonprofit, open standardization body for HSA platforms that will own the
development and evangelization of the architecture going forward
Make heterogeneous programming easy and a first-class pervasive
complement to CPU computing
Continue to increase the power efficiency of HSA, keeping it the platform of
choice from smartphones to the cloud
Bring to market strong development solutions (tools, libraries, OS runtimes)
to drive innovative advanced content and applications
Foster growth of heterogeneous computing talent through HSA developer
training and academic programs to drive both learning and innovation
© Copyright 2012 HSA Foundation. All Rights Reserved. 37
AMD’S CONTRIBUTION TO DATE
HSA draft specifications
HSA Programmer Reference Manual
HSA Hardware System Architecture Specification
HSA Software System Architecture Specification
Open source execution stack and compiler technology
HSA Bolt library – standard template library
Initial funding for incorporation
© Copyright 2012 HSA Foundation. All Rights Reserved. 38
FOUNDATION CATEGORIES OF MEMBERS
© Copyright 2012 HSA Foundation. All Rights Reserved. 39
Founder
Promoter
Supporter
Contributor
Academic
Associate
HSA FOUNDATION INITIAL FOUNDERS
represented here by ,
ARM Fellow and VP of Technology, Media Processing
BRINGING VISUAL
COMPUTING TO LIFE JEM DAVIES
ARM Fellow , VP of Technology
Media Processing Division, ARM
42 | The Programmer’s Guide to a Universe of Possibility | June 12, 2012
ARM COMMITTED TO HETEROGENEOUS COMPUTING
HSA FOUNDATION INITIAL FOUNDERS
represented here by ,
ARM Fellow and VP of Technology, Media Processing
represented here by ,
President, Imagination Technologies USA
represented here by ,
President, MediaTek USA, Inc.
represented here by ,
Director, Linux Development Center
represented here by ,
CVP, Heterogeneous Applications and Developer Solutions
THE HSA FOUNDATION
© Copyright 2012 HSA Foundation. All Rights Reserved. 44
www.hsafoundation.com
46 | The Programmer’s Guide to a Universe of Possibility | June 12, 2012
Disclaimer & Attribution The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions
and typographical errors.
The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited
to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product
differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. There is no
obligation to update or otherwise correct or revise this information. However, we reserve the right to revise this information and to
make changes from time to time to the content hereof without obligation to notify any person of such revisions or changes.
NO REPRESENTATIONS OR WARRANTIES ARE MADE WITH RESPECT TO THE CONTENTS HEREOF AND NO
RESPONSIBILITY IS ASSUMED FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS
INFORMATION.
ALL IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE ARE EXPRESSLY
DISCLAIMED. IN NO EVENT WILL ANY LIABILITY TO ANY PERSON BE INCURRED FOR ANY DIRECT, INDIRECT, SPECIAL
OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF
EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
AMD, the AMD arrow logo, the HSA logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. OpenCL™ is
a trademark of Apple Corp. which is licensed to the Khronos Organization. All other names used in this presentation are for
informational purposes only and may be trademarks of their respective owners.
© 2012 Advanced Micro Devices, Inc.