25
1 COPYRIGHT © 2015 ARM Measuring the Whole System Holistic Profiling of CPU and GPU for Optimal Vision Applications on ARM Platforms Tim Hartley

Measuring the Whole System · 2016-12-13 · 2007 2009 2010 2012 2013,2014,2015 OpenGL® ES 1.1 Fixed pipeline OpenGL ES 2.0 Programmable pipeline OpenCL ™ Full Profile / RenderScript

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Measuring the Whole System · 2016-12-13 · 2007 2009 2010 2012 2013,2014,2015 OpenGL® ES 1.1 Fixed pipeline OpenGL ES 2.0 Programmable pipeline OpenCL ™ Full Profile / RenderScript

1

COPYRIGHT © 2015 ARM

Measuring the Whole System Holistic Profiling of CPU and GPU for Optimal Vision

Applications on ARM Platforms

Tim Hartley

Page 2: Measuring the Whole System · 2016-12-13 · 2007 2009 2010 2012 2013,2014,2015 OpenGL® ES 1.1 Fixed pipeline OpenGL ES 2.0 Programmable pipeline OpenCL ™ Full Profile / RenderScript

2

COPYRIGHT © 2015 ARM

The Evolution of Mobile GPU Compute

2007 2009 2010 2012 2013,2014,2015

OpenGL® ES 1.1 Fixed pipeline

OpenGL ES 2.0 Programmable pipeline

OpenCL™ Full Profile / RenderScript Portable Heterogeneous Parallel Computation

OpenGL ES 3.1 Compute Shaders GPU Compute within OpenGL ES API

Mali-200

Mali-300

ARM® Mali™-55 GPU

Mali-400 MP

Mali-450 MP

Mali-T600

Series

Mali-T700 &

T800 Series

Page 3: Measuring the Whole System · 2016-12-13 · 2007 2009 2010 2012 2013,2014,2015 OpenGL® ES 1.1 Fixed pipeline OpenGL ES 2.0 Programmable pipeline OpenCL ™ Full Profile / RenderScript

3

COPYRIGHT © 2015 ARM

Measuring the Whole System

Computer Vision will, for some time, succeed in using every drop of

processing power we give it And techniques in computer vision still evolving rapidly

New, complex, sustained low power use cases

Building computer vision applications an ever more complex process The availability of more processors and processor types makes this even more so

Capturing and analyzing accurate and effective measurements from platforms plays

a vital role in achieving optimal performance

Page 4: Measuring the Whole System · 2016-12-13 · 2007 2009 2010 2012 2013,2014,2015 OpenGL® ES 1.1 Fixed pipeline OpenGL ES 2.0 Programmable pipeline OpenCL ™ Full Profile / RenderScript

4

COPYRIGHT © 2015 ARM

DSP

GPU

CPU Core CPU Core

CPU Core CPU Core

NEON NEON

NEON NEON

Modern Computer Vision Applications

Vision Application

Page 5: Measuring the Whole System · 2016-12-13 · 2007 2009 2010 2012 2013,2014,2015 OpenGL® ES 1.1 Fixed pipeline OpenGL ES 2.0 Programmable pipeline OpenCL ™ Full Profile / RenderScript

5

COPYRIGHT © 2015 ARM

SIMD: Several components

per operation

128-bit registers

VLIW: Several operations per

instruction word

Some operations are “free”

Built in function library

Accelerated in hardware

Inside an ARM Mali Midgard Core

),,,max( 10 TexLSAAT

Page 6: Measuring the Whole System · 2016-12-13 · 2007 2009 2010 2012 2013,2014,2015 OpenGL® ES 1.1 Fixed pipeline OpenGL ES 2.0 Programmable pipeline OpenCL ™ Full Profile / RenderScript

6

COPYRIGHT © 2015 ARM

Hardware Counters

Counters per core Active cycles

Pipe activity

L1 cache

Counters for the GPU Active cycles

L2 caches

MMU

Accessed through DS-5 Streamline Timeline of all hardware counters, and more

Explore the execution of the full application

Zoom in on details

Page 7: Measuring the Whole System · 2016-12-13 · 2007 2009 2010 2012 2013,2014,2015 OpenGL® ES 1.1 Fixed pipeline OpenGL ES 2.0 Programmable pipeline OpenCL ™ Full Profile / RenderScript

7

COPYRIGHT © 2015 ARM

DS-5 Streamline Identify hotspots and system bottlenecks at a glance

Select from CPU/GPU counters

OS level and custom data sources

Accumulate counters, measure time

and find instant hotspots

Select one or more tasks to

isolate their contribution

Combined task switching trace and

sample-based profile

Page 8: Measuring the Whole System · 2016-12-13 · 2007 2009 2010 2012 2013,2014,2015 OpenGL® ES 1.1 Fixed pipeline OpenGL ES 2.0 Programmable pipeline OpenCL ™ Full Profile / RenderScript

8

COPYRIGHT © 2015 ARM

Example: Complex Computer Vision Application

Page 9: Measuring the Whole System · 2016-12-13 · 2007 2009 2010 2012 2013,2014,2015 OpenGL® ES 1.1 Fixed pipeline OpenGL ES 2.0 Programmable pipeline OpenCL ™ Full Profile / RenderScript

9

COPYRIGHT © 2015 ARM

Lane and Car Detection

Page 10: Measuring the Whole System · 2016-12-13 · 2007 2009 2010 2012 2013,2014,2015 OpenGL® ES 1.1 Fixed pipeline OpenGL ES 2.0 Programmable pipeline OpenCL ™ Full Profile / RenderScript

10

COPYRIGHT © 2015 ARM

Streamline

Page 11: Measuring the Whole System · 2016-12-13 · 2007 2009 2010 2012 2013,2014,2015 OpenGL® ES 1.1 Fixed pipeline OpenGL ES 2.0 Programmable pipeline OpenCL ™ Full Profile / RenderScript

11

COPYRIGHT © 2015 ARM

Streamline: OpenCL Timeline

Page 12: Measuring the Whole System · 2016-12-13 · 2007 2009 2010 2012 2013,2014,2015 OpenGL® ES 1.1 Fixed pipeline OpenGL ES 2.0 Programmable pipeline OpenCL ™ Full Profile / RenderScript

12

COPYRIGHT © 2015 ARM

Streamline: OpenCL Timeline

Page 13: Measuring the Whole System · 2016-12-13 · 2007 2009 2010 2012 2013,2014,2015 OpenGL® ES 1.1 Fixed pipeline OpenGL ES 2.0 Programmable pipeline OpenCL ™ Full Profile / RenderScript

13

COPYRIGHT © 2015 ARM

kernel memory

Mem ops

Arithmetic

Yes

No

Yes

No

No Yes

Yes No

Limited by kernel execution

time or mem management?

Limited by Arith ops or

Mem ops?

Limited to 64 threads?

Large no. of register bank conflicts?

Large no. of instruction cache misses?

Reduce register pressure.

Simplify or shorten kernels

Vectorise the kernel if possible.

Decrease the arith work if possible.

High number of instruction

re-issues?

Limited to 64 threads?

Large no. of instruction cache misses?

Reduce register pressure.

Simplify or shorten kernels

Vectorise the LS operations if possible.

Decrease mem accesses if possible.

Ensure you are not copying

memory unnecessarily

Improve memory access

pattern to improve cache

efficiency

Limited by same factors?

Done optimising Reiterate

Optimisation

Overview

Page 14: Measuring the Whole System · 2016-12-13 · 2007 2009 2010 2012 2013,2014,2015 OpenGL® ES 1.1 Fixed pipeline OpenGL ES 2.0 Programmable pipeline OpenCL ™ Full Profile / RenderScript

14

COPYRIGHT © 2015 ARM

Deriving Meaning from Hardware Counters

Counters on their own usually don’t mean a huge amount

Combining counters is more useful Comparing values to determine limiting pipes

Calculating more meaningful values from multiple values

New graph traces can be added from these counters …and become an integral part of the timeline

Page 15: Measuring the Whole System · 2016-12-13 · 2007 2009 2010 2012 2013,2014,2015 OpenGL® ES 1.1 Fixed pipeline OpenGL ES 2.0 Programmable pipeline OpenCL ™ Full Profile / RenderScript

15

COPYRIGHT © 2015 ARM

Custom Charts: Bringing Counters Together

100 * $MaliLoadStorePipeLSInstructions / $MaliLoadStorePipeLSInstructionIssues

100 * MaliLoadStorePipeLSInstructionIssues / $MaliCoreCyclesTripipeCycles

100 * $MaliArithmeticPipeAInstructions / $MaliCoreCyclesTripipeCycles

100 * $MaliCoreCyclesTripipeCycles / $MaliJobManagerCyclesGPUCycles

Page 16: Measuring the Whole System · 2016-12-13 · 2007 2009 2010 2012 2013,2014,2015 OpenGL® ES 1.1 Fixed pipeline OpenGL ES 2.0 Programmable pipeline OpenCL ™ Full Profile / RenderScript

16

COPYRIGHT © 2015 ARM

One load

One store

“n” ALU operations

ALU Bound kernel __kernel void kernel_alu_bound( global float* arr, uint n) { float value = arr[get_global_id(0)]; for(uint i = 0; i < n; i++) { value += sin(value); } arr[get_global_id(0)] = value; }

Page 17: Measuring the Whole System · 2016-12-13 · 2007 2009 2010 2012 2013,2014,2015 OpenGL® ES 1.1 Fixed pipeline OpenGL ES 2.0 Programmable pipeline OpenCL ™ Full Profile / RenderScript

17

COPYRIGHT © 2015 ARM

One load

One store

“n” ALU operations

ALU Bound kernel __kernel void kernel_alu_bound( global float* arr, uint n) { float value = arr[get_global_id(0)]; for(uint i = 0; i < n; i++) { value += sin(value); } arr[get_global_id(0)] = value; }

Page 18: Measuring the Whole System · 2016-12-13 · 2007 2009 2010 2012 2013,2014,2015 OpenGL® ES 1.1 Fixed pipeline OpenGL ES 2.0 Programmable pipeline OpenCL ™ Full Profile / RenderScript

18

COPYRIGHT © 2015 ARM

One load

One store

No ALU operation

L/S Bound kernel

__kernel void kernel_memcpy( global float *a, global float *b) { float4 v = vload4(0, a); vstore4(v, get_global_id(0), b); }

Page 19: Measuring the Whole System · 2016-12-13 · 2007 2009 2010 2012 2013,2014,2015 OpenGL® ES 1.1 Fixed pipeline OpenGL ES 2.0 Programmable pipeline OpenCL ™ Full Profile / RenderScript

19

COPYRIGHT © 2015 ARM

One load

One store

No ALU operation

L/S Bound kernel

__kernel void kernel_memcpy( global float *a, global float *b) { float4 v = vload4(0, a); vstore4(v, get_global_id(0), b); }

Page 20: Measuring the Whole System · 2016-12-13 · 2007 2009 2010 2012 2013,2014,2015 OpenGL® ES 1.1 Fixed pipeline OpenGL ES 2.0 Programmable pipeline OpenCL ™ Full Profile / RenderScript

20

COPYRIGHT © 2015 ARM

One byte read every 64 bytes

One byte written every 64 bytes

Really bad cache utilisation!

Cache misses

__kernel void kernel_cache_misses( global uchar *a, global uchar *b) { b[64 * get_global_id(0)] = a[64 * get_global_id(0)]; }

Page 21: Measuring the Whole System · 2016-12-13 · 2007 2009 2010 2012 2013,2014,2015 OpenGL® ES 1.1 Fixed pipeline OpenGL ES 2.0 Programmable pipeline OpenCL ™ Full Profile / RenderScript

21

COPYRIGHT © 2015 ARM

One byte read every 64 bytes

One byte written every 64 bytes

Really bad cache utilisation!

Cache misses

__kernel void kernel_cache_misses( global uchar *a, global uchar *b) { b[64 * get_global_id(0)] = a[64 * get_global_id(0)]; }

Page 22: Measuring the Whole System · 2016-12-13 · 2007 2009 2010 2012 2013,2014,2015 OpenGL® ES 1.1 Fixed pipeline OpenGL ES 2.0 Programmable pipeline OpenCL ™ Full Profile / RenderScript

22

COPYRIGHT © 2015 ARM

What does good whole-system optimisation look like?

Page 23: Measuring the Whole System · 2016-12-13 · 2007 2009 2010 2012 2013,2014,2015 OpenGL® ES 1.1 Fixed pipeline OpenGL ES 2.0 Programmable pipeline OpenCL ™ Full Profile / RenderScript

23

COPYRIGHT © 2015 ARM

Conclusions

Tomorrow at the EVA Summit, 4pm:

“Understanding the Role of Integrated GPUs in Vision Applications”, Roberto Mijat

Computer vision applications need careful optimisation

Understanding your system as a whole is a vital first step

Understanding each individual processor core type is the next

Use tools to measure hardware counters across the entire platform

Whole-system views of the relative performance of heterogeneous architectures are invaluable

Allows you to decide where there is capacity to move workloads

And how to target optimisations by exposing the limiting component within individual cores

Ideally, use these tool throughout the development process, not just at the end

The Mali Ecosystem is making GPU Compute a reality today

ARM enables developers with platforms, drivers, tools and support

Industry leaders take advantage of ARM Mali GPU capabilities to innovate and deliver

Be one of them!

Page 24: Measuring the Whole System · 2016-12-13 · 2007 2009 2010 2012 2013,2014,2015 OpenGL® ES 1.1 Fixed pipeline OpenGL ES 2.0 Programmable pipeline OpenCL ™ Full Profile / RenderScript

24

COPYRIGHT © 2015 ARM

Ecosystem Resources

www.malideveloper.com

Download guides, papers, tools (including DS-5 Streamline), etc.

http://community.arm.com/welcome

Community forums, blogs and more

[email protected]

Graphics and GPU Compute developer support

http://malideveloper.arm.com/develop-for-mali/opencl-renderscript-tutorials/

A range of video and written tutorials for GPU Compute

http://malideveloper.arm.com/develop-for-mali/features/mali-t6xx-gpu-user-space-drivers/

ARM Mali-T600 series GPU user-space binary drivers available for download

Linaro BSP now available with Mali-T600 series GPU support

Page 25: Measuring the Whole System · 2016-12-13 · 2007 2009 2010 2012 2013,2014,2015 OpenGL® ES 1.1 Fixed pipeline OpenGL ES 2.0 Programmable pipeline OpenCL ™ Full Profile / RenderScript

25

COPYRIGHT © 2015 ARM

Tim Hartley

Measuring the Whole System Holistic Profiling of CPU and GPU for Optimal Vision

Applications on ARM Platforms