Upload
trinhhanh
View
218
Download
0
Embed Size (px)
Citation preview
1
Application Performance Analysis
of the Cortex-A9 MPCore
Bryan Lawrence
This project in ARM is in part
funded by ICT-eMuCo, a
European project supported
under the Seventh Framework
Programme (7FP) for research
and technological development
2
Agenda
Motivation
Experimentation platforms
Performance exploration of different application classes
Performance evaluation of multiple concurrent applications
Summary and conclusion
3
Phone ++ Upcoming Use Cases
Mobile Internet Browsing
Video conferencing
Gaming on the Go
Multi-player over 3G / 4G
Network
3D Navigation
4
Mobile Phone Applications
Compute
Intensive
5
Tablet Applications
Compute
Intensive
6
Achieving Scalable Performance
Clock frequency of processor not the only metric of
performance
Scalable, energy efficient performance required from mobile
devices – phones, tablets to large enterprise computing
Can multicore processors provide a potential solution ?? .....
7
Hardware Platforms
Versatile Express
ARM-NEC Cortex™-A9 processor
test-chip ~400MHz
Cortex-A9 x 4
4x NEON™/FPU
32KB I&D invidual L1 caches
512K L2 cache
1GB RAM (32b DDR2)
Early Partner Silicon
Cortex-A9 x 2 @ 1GHz
1GB RAM
8
Video Decode / Encode
Hardware encoder/decoders are common in consumer
Video/audio codecs standards evolve rapidly
Many codecs are used infrequently to justify h/w
Consumer applications involve other video processing
Different from encode / decode (E.g. video editing)
Simultaneous encode / decode required for video
conferencing
9
FFmpeg used for decode
X264 library used with FFmeg for video encode
CIF & VGA resolutions
Commonly used in video conf.
Movie trailers used
Order of computation more than video conf. Streams
Compression factor of 100 - 200
H.264 Decode / Encode
10
H.264 Decode / Encode
Results for single core operation
Normalized logarithmic scales used
Encode is more compute intensive than decode (at least ~2-3 times)
Writing out decoded streams
to secondary storage media
limited by media bandwidth
11
H.264 Decode / Encode
Concurrent video decode + encode
Important use case for video conferencing
Excellent scalability is observed for up to all 4 cores
Encoding is at least
2-3 times or more compute
intensive than decode
Ideally more resources
should be dedicated to
encode
12
On2/Google VP8
Libvpx library used for decoding VP8 (from WebM project)
Libvpx uses multi-threading and actively takes advantage of
parallelizability available in the VP8 codec.
Comparative results obtained on Versatile Express and 1GHz
dual core platforms
13
On2/Google VP8
Shows good scalability with the
number of cores.
Scalability is relatively independent
of the number of partitions in the
video frame
Saturation is observed for no. of
threads > no. of cores
Designers can query the platform
to fetch the no. of cores –
determine available paralelizability
1GHz dual-core
Versatile Express
14
Compilation - ffmpeg
Code compilation has inherent
parallelism in terms of modules
Most build systems allow for this
compilation to be exploited
E.g. make –j 4
Compilation of FFmpeg and
Linux Kernel shown here
1GHz dual-core
Versatile Express
15
Compilation – Linux Kernel
1GHz dual-core
Versatile Express
Almost linear speed-up is observed
with no. of cores for both cases
Effectively doubles (quadruples)
the utilized memory bandwidth
for 2 cores (4 cores)
16
Browsers
Browser benchmark using collection of web-pages
similar to the mix found in common browsing
Speed-up of 1.54 times observed between single and
dual core execution
The ‘webcore’ fraction of the pie grows for multicore
execution
Normalized Performance Execution time decomposition
1.54x
17
Multiple Concurrent Applications
Multitasking is becoming mainstream
in mobile devices today
Common combinations include
Browser + Audio playback
E.g. Internet Radio
Browser + background download
Independent applications can
benefit immensely from
parallelization
18
Browser + Pandora Internet Radio
Speed up factor of 1.9
Super linear speed-up can
be observed sometimes
due to reduced cache
pollution from conflicting
applications
The speed-up can be
traded for energy by
slowing the cores down
(depends on the
fabrication process
technology used)
Normalized Performance
Execution time decomposition
1.9x
19
Browser + Internet File Download
Speed up factor of 1.64x
Common use case
involves downloading an
App from an application
store or market-place
while browsing the
internet
Email synchronization in
the bakground also forms
a similar use case
Normalized Performance
Execution time decomposition
1.64x
20
Cortex-A9 MP Benefits – Performance
Browser
(single app)
1
1.54
1 Core
2 Core
21
Cortex-A9 MP Benefits – Richer Experience
Browser
(single app)
1
1.54
Browser +
Pandora
0.78
1.50
Browser +
Download
0.73
1.20
1 Core
2 Core
22
Cortex-A9 MP Benefits – Richer Experience
Browser
(single app)
1
1.54
Browser +
Pandora
0.78
1.50
Browser +
Download
0.73
1.20
1 Core
2 Core
1.64x 1.9x
23
Summary and Conclusion
This presentation demonstrates the scalability of the ARM
Cortex-A9 MPCore™ processor across various classes of
applications, on today’s currently available software
Better power/performance can be achieved using an efficient
low power ARM multicore processor, as compared to a single
processor at much higher freq.
Next generation software will make more intensive use of
threads, and scalability will improve further.
24
Thank You
Please visit www.arm.com for ARM related technical details
For any queries contact < [email protected] >