24
1 Application Performance Analysis of the Cortex-A9 MPCore Bryan Lawrence This project in ARM is in part funded by ICT-eMuCo, a European project supported under the Seventh Framework Programme (7FP) for research and technological development

This project in ARM is in part funded by ICT-eMuCo, a ... · Compilation of FFmpeg and Linux Kernel shown here 1GHz dual-core Versatile Express . 15 Compilation

Embed Size (px)

Citation preview

Page 1: This project in ARM is in part funded by ICT-eMuCo, a ... · Compilation of FFmpeg and Linux Kernel shown here 1GHz dual-core Versatile Express . 15 Compilation

1

Application Performance Analysis

of the Cortex-A9 MPCore

Bryan Lawrence

This project in ARM is in part

funded by ICT-eMuCo, a

European project supported

under the Seventh Framework

Programme (7FP) for research

and technological development

Page 2: This project in ARM is in part funded by ICT-eMuCo, a ... · Compilation of FFmpeg and Linux Kernel shown here 1GHz dual-core Versatile Express . 15 Compilation

2

Agenda

Motivation

Experimentation platforms

Performance exploration of different application classes

Performance evaluation of multiple concurrent applications

Summary and conclusion

Page 3: This project in ARM is in part funded by ICT-eMuCo, a ... · Compilation of FFmpeg and Linux Kernel shown here 1GHz dual-core Versatile Express . 15 Compilation

3

Phone ++ Upcoming Use Cases

Mobile Internet Browsing

Video conferencing

Gaming on the Go

Multi-player over 3G / 4G

Network

3D Navigation

Page 4: This project in ARM is in part funded by ICT-eMuCo, a ... · Compilation of FFmpeg and Linux Kernel shown here 1GHz dual-core Versatile Express . 15 Compilation

4

Mobile Phone Applications

Compute

Intensive

Page 5: This project in ARM is in part funded by ICT-eMuCo, a ... · Compilation of FFmpeg and Linux Kernel shown here 1GHz dual-core Versatile Express . 15 Compilation

5

Tablet Applications

Compute

Intensive

Page 6: This project in ARM is in part funded by ICT-eMuCo, a ... · Compilation of FFmpeg and Linux Kernel shown here 1GHz dual-core Versatile Express . 15 Compilation

6

Achieving Scalable Performance

Clock frequency of processor not the only metric of

performance

Scalable, energy efficient performance required from mobile

devices – phones, tablets to large enterprise computing

Can multicore processors provide a potential solution ?? .....

Page 7: This project in ARM is in part funded by ICT-eMuCo, a ... · Compilation of FFmpeg and Linux Kernel shown here 1GHz dual-core Versatile Express . 15 Compilation

7

Hardware Platforms

Versatile Express

ARM-NEC Cortex™-A9 processor

test-chip ~400MHz

Cortex-A9 x 4

4x NEON™/FPU

32KB I&D invidual L1 caches

512K L2 cache

1GB RAM (32b DDR2)

Early Partner Silicon

Cortex-A9 x 2 @ 1GHz

1GB RAM

Page 8: This project in ARM is in part funded by ICT-eMuCo, a ... · Compilation of FFmpeg and Linux Kernel shown here 1GHz dual-core Versatile Express . 15 Compilation

8

Video Decode / Encode

Hardware encoder/decoders are common in consumer

Video/audio codecs standards evolve rapidly

Many codecs are used infrequently to justify h/w

Consumer applications involve other video processing

Different from encode / decode (E.g. video editing)

Simultaneous encode / decode required for video

conferencing

Page 9: This project in ARM is in part funded by ICT-eMuCo, a ... · Compilation of FFmpeg and Linux Kernel shown here 1GHz dual-core Versatile Express . 15 Compilation

9

FFmpeg used for decode

X264 library used with FFmeg for video encode

CIF & VGA resolutions

Commonly used in video conf.

Movie trailers used

Order of computation more than video conf. Streams

Compression factor of 100 - 200

H.264 Decode / Encode

Page 10: This project in ARM is in part funded by ICT-eMuCo, a ... · Compilation of FFmpeg and Linux Kernel shown here 1GHz dual-core Versatile Express . 15 Compilation

10

H.264 Decode / Encode

Results for single core operation

Normalized logarithmic scales used

Encode is more compute intensive than decode (at least ~2-3 times)

Writing out decoded streams

to secondary storage media

limited by media bandwidth

Page 11: This project in ARM is in part funded by ICT-eMuCo, a ... · Compilation of FFmpeg and Linux Kernel shown here 1GHz dual-core Versatile Express . 15 Compilation

11

H.264 Decode / Encode

Concurrent video decode + encode

Important use case for video conferencing

Excellent scalability is observed for up to all 4 cores

Encoding is at least

2-3 times or more compute

intensive than decode

Ideally more resources

should be dedicated to

encode

Page 12: This project in ARM is in part funded by ICT-eMuCo, a ... · Compilation of FFmpeg and Linux Kernel shown here 1GHz dual-core Versatile Express . 15 Compilation

12

On2/Google VP8

Libvpx library used for decoding VP8 (from WebM project)

Libvpx uses multi-threading and actively takes advantage of

parallelizability available in the VP8 codec.

Comparative results obtained on Versatile Express and 1GHz

dual core platforms

Page 13: This project in ARM is in part funded by ICT-eMuCo, a ... · Compilation of FFmpeg and Linux Kernel shown here 1GHz dual-core Versatile Express . 15 Compilation

13

On2/Google VP8

Shows good scalability with the

number of cores.

Scalability is relatively independent

of the number of partitions in the

video frame

Saturation is observed for no. of

threads > no. of cores

Designers can query the platform

to fetch the no. of cores –

determine available paralelizability

1GHz dual-core

Versatile Express

Page 14: This project in ARM is in part funded by ICT-eMuCo, a ... · Compilation of FFmpeg and Linux Kernel shown here 1GHz dual-core Versatile Express . 15 Compilation

14

Compilation - ffmpeg

Code compilation has inherent

parallelism in terms of modules

Most build systems allow for this

compilation to be exploited

E.g. make –j 4

Compilation of FFmpeg and

Linux Kernel shown here

1GHz dual-core

Versatile Express

Page 15: This project in ARM is in part funded by ICT-eMuCo, a ... · Compilation of FFmpeg and Linux Kernel shown here 1GHz dual-core Versatile Express . 15 Compilation

15

Compilation – Linux Kernel

1GHz dual-core

Versatile Express

Almost linear speed-up is observed

with no. of cores for both cases

Effectively doubles (quadruples)

the utilized memory bandwidth

for 2 cores (4 cores)

Page 16: This project in ARM is in part funded by ICT-eMuCo, a ... · Compilation of FFmpeg and Linux Kernel shown here 1GHz dual-core Versatile Express . 15 Compilation

16

Browsers

Browser benchmark using collection of web-pages

similar to the mix found in common browsing

Speed-up of 1.54 times observed between single and

dual core execution

The ‘webcore’ fraction of the pie grows for multicore

execution

Normalized Performance Execution time decomposition

1.54x

Page 17: This project in ARM is in part funded by ICT-eMuCo, a ... · Compilation of FFmpeg and Linux Kernel shown here 1GHz dual-core Versatile Express . 15 Compilation

17

Multiple Concurrent Applications

Multitasking is becoming mainstream

in mobile devices today

Common combinations include

Browser + Audio playback

E.g. Internet Radio

Browser + background download

Independent applications can

benefit immensely from

parallelization

Page 18: This project in ARM is in part funded by ICT-eMuCo, a ... · Compilation of FFmpeg and Linux Kernel shown here 1GHz dual-core Versatile Express . 15 Compilation

18

Browser + Pandora Internet Radio

Speed up factor of 1.9

Super linear speed-up can

be observed sometimes

due to reduced cache

pollution from conflicting

applications

The speed-up can be

traded for energy by

slowing the cores down

(depends on the

fabrication process

technology used)

Normalized Performance

Execution time decomposition

1.9x

Page 19: This project in ARM is in part funded by ICT-eMuCo, a ... · Compilation of FFmpeg and Linux Kernel shown here 1GHz dual-core Versatile Express . 15 Compilation

19

Browser + Internet File Download

Speed up factor of 1.64x

Common use case

involves downloading an

App from an application

store or market-place

while browsing the

internet

Email synchronization in

the bakground also forms

a similar use case

Normalized Performance

Execution time decomposition

1.64x

Page 20: This project in ARM is in part funded by ICT-eMuCo, a ... · Compilation of FFmpeg and Linux Kernel shown here 1GHz dual-core Versatile Express . 15 Compilation

20

Cortex-A9 MP Benefits – Performance

Browser

(single app)

1

1.54

1 Core

2 Core

Page 21: This project in ARM is in part funded by ICT-eMuCo, a ... · Compilation of FFmpeg and Linux Kernel shown here 1GHz dual-core Versatile Express . 15 Compilation

21

Cortex-A9 MP Benefits – Richer Experience

Browser

(single app)

1

1.54

Browser +

Pandora

0.78

1.50

Browser +

Download

0.73

1.20

1 Core

2 Core

Page 22: This project in ARM is in part funded by ICT-eMuCo, a ... · Compilation of FFmpeg and Linux Kernel shown here 1GHz dual-core Versatile Express . 15 Compilation

22

Cortex-A9 MP Benefits – Richer Experience

Browser

(single app)

1

1.54

Browser +

Pandora

0.78

1.50

Browser +

Download

0.73

1.20

1 Core

2 Core

1.64x 1.9x

Page 23: This project in ARM is in part funded by ICT-eMuCo, a ... · Compilation of FFmpeg and Linux Kernel shown here 1GHz dual-core Versatile Express . 15 Compilation

23

Summary and Conclusion

This presentation demonstrates the scalability of the ARM

Cortex-A9 MPCore™ processor across various classes of

applications, on today’s currently available software

Better power/performance can be achieved using an efficient

low power ARM multicore processor, as compared to a single

processor at much higher freq.

Next generation software will make more intensive use of

threads, and scalability will improve further.

Page 24: This project in ARM is in part funded by ICT-eMuCo, a ... · Compilation of FFmpeg and Linux Kernel shown here 1GHz dual-core Versatile Express . 15 Compilation

24

Thank You

Please visit www.arm.com for ARM related technical details

For any queries contact < [email protected] >