Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
1
COPYRIGHT © 2015 ARM
Measuring the Whole System Holistic Profiling of CPU and GPU for Optimal Vision
Applications on ARM Platforms
Tim Hartley
2
COPYRIGHT © 2015 ARM
The Evolution of Mobile GPU Compute
2007 2009 2010 2012 2013,2014,2015
OpenGL® ES 1.1 Fixed pipeline
OpenGL ES 2.0 Programmable pipeline
OpenCL™ Full Profile / RenderScript Portable Heterogeneous Parallel Computation
OpenGL ES 3.1 Compute Shaders GPU Compute within OpenGL ES API
Mali-200
Mali-300
ARM® Mali™-55 GPU
Mali-400 MP
Mali-450 MP
Mali-T600
Series
Mali-T700 &
T800 Series
3
COPYRIGHT © 2015 ARM
Measuring the Whole System
Computer Vision will, for some time, succeed in using every drop of
processing power we give it And techniques in computer vision still evolving rapidly
New, complex, sustained low power use cases
Building computer vision applications an ever more complex process The availability of more processors and processor types makes this even more so
Capturing and analyzing accurate and effective measurements from platforms plays
a vital role in achieving optimal performance
4
COPYRIGHT © 2015 ARM
DSP
GPU
CPU Core CPU Core
CPU Core CPU Core
NEON NEON
NEON NEON
Modern Computer Vision Applications
Vision Application
5
COPYRIGHT © 2015 ARM
SIMD: Several components
per operation
128-bit registers
VLIW: Several operations per
instruction word
Some operations are “free”
Built in function library
Accelerated in hardware
Inside an ARM Mali Midgard Core
),,,max( 10 TexLSAAT
6
COPYRIGHT © 2015 ARM
Hardware Counters
Counters per core Active cycles
Pipe activity
L1 cache
Counters for the GPU Active cycles
L2 caches
MMU
Accessed through DS-5 Streamline Timeline of all hardware counters, and more
Explore the execution of the full application
Zoom in on details
7
COPYRIGHT © 2015 ARM
DS-5 Streamline Identify hotspots and system bottlenecks at a glance
Select from CPU/GPU counters
OS level and custom data sources
Accumulate counters, measure time
and find instant hotspots
Select one or more tasks to
isolate their contribution
Combined task switching trace and
sample-based profile
8
COPYRIGHT © 2015 ARM
Example: Complex Computer Vision Application
9
COPYRIGHT © 2015 ARM
Lane and Car Detection
10
COPYRIGHT © 2015 ARM
Streamline
11
COPYRIGHT © 2015 ARM
Streamline: OpenCL Timeline
12
COPYRIGHT © 2015 ARM
Streamline: OpenCL Timeline
13
COPYRIGHT © 2015 ARM
kernel memory
Mem ops
Arithmetic
Yes
No
Yes
No
No Yes
Yes No
Limited by kernel execution
time or mem management?
Limited by Arith ops or
Mem ops?
Limited to 64 threads?
Large no. of register bank conflicts?
Large no. of instruction cache misses?
Reduce register pressure.
Simplify or shorten kernels
Vectorise the kernel if possible.
Decrease the arith work if possible.
High number of instruction
re-issues?
Limited to 64 threads?
Large no. of instruction cache misses?
Reduce register pressure.
Simplify or shorten kernels
Vectorise the LS operations if possible.
Decrease mem accesses if possible.
Ensure you are not copying
memory unnecessarily
Improve memory access
pattern to improve cache
efficiency
Limited by same factors?
Done optimising Reiterate
Optimisation
Overview
14
COPYRIGHT © 2015 ARM
Deriving Meaning from Hardware Counters
Counters on their own usually don’t mean a huge amount
Combining counters is more useful Comparing values to determine limiting pipes
Calculating more meaningful values from multiple values
New graph traces can be added from these counters …and become an integral part of the timeline
15
COPYRIGHT © 2015 ARM
Custom Charts: Bringing Counters Together
100 * $MaliLoadStorePipeLSInstructions / $MaliLoadStorePipeLSInstructionIssues
100 * MaliLoadStorePipeLSInstructionIssues / $MaliCoreCyclesTripipeCycles
100 * $MaliArithmeticPipeAInstructions / $MaliCoreCyclesTripipeCycles
100 * $MaliCoreCyclesTripipeCycles / $MaliJobManagerCyclesGPUCycles
16
COPYRIGHT © 2015 ARM
One load
One store
“n” ALU operations
ALU Bound kernel __kernel void kernel_alu_bound( global float* arr, uint n) { float value = arr[get_global_id(0)]; for(uint i = 0; i < n; i++) { value += sin(value); } arr[get_global_id(0)] = value; }
17
COPYRIGHT © 2015 ARM
One load
One store
“n” ALU operations
ALU Bound kernel __kernel void kernel_alu_bound( global float* arr, uint n) { float value = arr[get_global_id(0)]; for(uint i = 0; i < n; i++) { value += sin(value); } arr[get_global_id(0)] = value; }
18
COPYRIGHT © 2015 ARM
One load
One store
No ALU operation
L/S Bound kernel
__kernel void kernel_memcpy( global float *a, global float *b) { float4 v = vload4(0, a); vstore4(v, get_global_id(0), b); }
19
COPYRIGHT © 2015 ARM
One load
One store
No ALU operation
L/S Bound kernel
__kernel void kernel_memcpy( global float *a, global float *b) { float4 v = vload4(0, a); vstore4(v, get_global_id(0), b); }
20
COPYRIGHT © 2015 ARM
One byte read every 64 bytes
One byte written every 64 bytes
Really bad cache utilisation!
Cache misses
__kernel void kernel_cache_misses( global uchar *a, global uchar *b) { b[64 * get_global_id(0)] = a[64 * get_global_id(0)]; }
21
COPYRIGHT © 2015 ARM
One byte read every 64 bytes
One byte written every 64 bytes
Really bad cache utilisation!
Cache misses
__kernel void kernel_cache_misses( global uchar *a, global uchar *b) { b[64 * get_global_id(0)] = a[64 * get_global_id(0)]; }
22
COPYRIGHT © 2015 ARM
What does good whole-system optimisation look like?
23
COPYRIGHT © 2015 ARM
Conclusions
Tomorrow at the EVA Summit, 4pm:
“Understanding the Role of Integrated GPUs in Vision Applications”, Roberto Mijat
Computer vision applications need careful optimisation
Understanding your system as a whole is a vital first step
Understanding each individual processor core type is the next
Use tools to measure hardware counters across the entire platform
Whole-system views of the relative performance of heterogeneous architectures are invaluable
Allows you to decide where there is capacity to move workloads
And how to target optimisations by exposing the limiting component within individual cores
Ideally, use these tool throughout the development process, not just at the end
The Mali Ecosystem is making GPU Compute a reality today
ARM enables developers with platforms, drivers, tools and support
Industry leaders take advantage of ARM Mali GPU capabilities to innovate and deliver
Be one of them!
24
COPYRIGHT © 2015 ARM
Ecosystem Resources
www.malideveloper.com
Download guides, papers, tools (including DS-5 Streamline), etc.
http://community.arm.com/welcome
Community forums, blogs and more
Graphics and GPU Compute developer support
http://malideveloper.arm.com/develop-for-mali/opencl-renderscript-tutorials/
A range of video and written tutorials for GPU Compute
http://malideveloper.arm.com/develop-for-mali/features/mali-t6xx-gpu-user-space-drivers/
ARM Mali-T600 series GPU user-space binary drivers available for download
Linaro BSP now available with Mali-T600 series GPU support
25
COPYRIGHT © 2015 ARM
Tim Hartley
Measuring the Whole System Holistic Profiling of CPU and GPU for Optimal Vision
Applications on ARM Platforms