Upload
intel-software-brasil
View
264
Download
3
Embed Size (px)
DESCRIPTION
Leo Borges Intel Software Conference 2014 Brazil May 2014
Citation preview
Methods and practices to analyze the performance of your
application with Intel® VTune™
Amplifier XELeo BorgesIntel Software Conference 2014 BrazilMay 2014
Copyright© Copyright© Copyright© Copyright© 2013, 2013, 2013, 2013, Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
Copyright © , Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Xeon Phi, Core, VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other countries.
Optimization Notice
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804
Legal Disclaimer & Optimization NoticeLegal Disclaimer & Optimization NoticeLegal Disclaimer & Optimization NoticeLegal Disclaimer & Optimization Notice
Copyright© Copyright© Copyright© Copyright© 2012, 2012, 2012, 2012, Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
2
Copyright© Copyright© Copyright© Copyright© 2013, 2013, 2013, 2013, Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
Agenda
• Intel® VTune Amplifier XE Intro
• Microarchitecture Review
• The Top-Down Characterization details
• Intel® VTune™ Amplifier XE Implementation
• Demo**Sources for current presentation:
� http://software.intel.com/en-us/articles/advanced-profiling-with-intel-vtune-amplifier-xe-part-1-find-the-bottleneck
3
Copyright© Copyright© Copyright© Copyright© 2013, 2013, 2013, 2013, Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
Two Ways to Collect Data - Intel® VTune™ Amplifier XE
4
Software CollectorSoftware CollectorSoftware CollectorSoftware CollectorHotspots, Concurrency, Locks & Waits
Hardware CollectorHardware CollectorHardware CollectorHardware CollectorLightweight Hotspots, Advanced Analysis
Uses OS interrupts Uses the on chip Performance Monitoring
Unit (PMU)
Collects from a single process tree Collect system wide or from a
single process tree.
~10ms default resolution ~1ms default resolution (finer granularity - finds small functions)
Collect on both Intel® and compatible
processors
Requires a genuine Intel® processor for
collection
Call stacks show calling sequence New! Optionally collect call stacks
Works in virtual environments Works in virtual environments only when
supported by the VM
(e.g., vSphere* 5.1)
No driver required Requires a driver
No special recompiles No special recompiles No special recompiles No special recompiles ---- C, C++, C#, Fortran, Java, AssemblyC, C++, C#, Fortran, Java, AssemblyC, C++, C#, Fortran, Java, AssemblyC, C++, C#, Fortran, Java, Assembly
Copyright© Copyright© Copyright© Copyright© 2013, 2013, 2013, 2013, Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
Two Ways to Collect Data - Intel® VTune™ Amplifier XE
5
Software CollectorSoftware CollectorSoftware CollectorSoftware CollectorHotspots, Concurrency, Locks & Waits
Hardware CollectorHardware CollectorHardware CollectorHardware CollectorLightweight Hotspots, Advanced Analysis
Uses OS interrupts Uses the on chip Performance Monitoring
Unit (PMU)
Collects from a single process tree Collect system wide or from a
single process tree.
~10ms default resolution ~1ms default resolution (finer granularity - finds small functions)
Collect on both Intel® and compatible
processors
Requires a genuine Intel® processor for
collection
Call stacks show calling sequence New! Optionally collect call stacks
Works in virtual environments Works in virtual environments only when
supported by the VM
(e.g., vSphere* 5.1)
No driver required Requires a driver
No special recompiles No special recompiles No special recompiles No special recompiles ---- C, C++, C#, Fortran, Java, AssemblyC, C++, C#, Fortran, Java, AssemblyC, C++, C#, Fortran, Java, AssemblyC, C++, C#, Fortran, Java, Assembly
Copyright© Copyright© Copyright© Copyright© 2013, 2013, 2013, 2013, Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
Microarchitecture basics
6
FetchFetchFetchFetch DecodeDecodeDecodeDecode ExecuteExecuteExecuteExecute RetireRetireRetireRetire
• Classic 4-stage pipeline depicted here.
• Memory not shown.
• Pipeline on current processors capable of speculative
and out of order execution.
Copyright© Copyright© Copyright© Copyright© 2013, 2013, 2013, 2013, Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
Intuitive approach to EBS
• Use a small list of metrics to monitor level of optimization
• Example 1: Cycles per instruction (CPI)
• Example 2: Instruction retirement ratio
� m instructions issued n retired
� Retirement ratio = n/m
� % executed but not retired = (1 – n/m)*100
7Intel Confidential5/30/2014
Copyright© Copyright© Copyright© Copyright© 2013, 2013, 2013, 2013, Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
Microarchitecture Review
8
FetchFetchFetchFetch DecodeDecodeDecodeDecode ExecuteExecuteExecuteExecute MemoryMemoryMemoryMemory CommitCommitCommitCommit
The traditional 5-stage pipeline. Pipeline on current
processors capable of out of order execution.
Copyright© Copyright© Copyright© Copyright© 2013, 2013, 2013, 2013, Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
Microarchitecture Review
9
FetchFetchFetchFetch DecodeDecodeDecodeDecode ExecuteExecuteExecuteExecute MemoryMemoryMemoryMemory CommitCommitCommitCommit
The traditional 5-stage pipeline. Pipeline on current
processors capable of out of order execution.
Copyright© Copyright© Copyright© Copyright© 2013, 2013, 2013, 2013, Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
Intel® Software Conference 2014Microarchitecture Review
10
FetchFetchFetchFetch DecodeDecodeDecodeDecode ExecuteExecuteExecuteExecute MemoryMemoryMemoryMemory CommitCommitCommitCommit
FrontFrontFrontFront----EndEndEndEnd
The front-end fetches instructions IN ORDER, decodes them into u-ops(micro-operations), and sends the u-ops to the back-end.
Copyright© Copyright© Copyright© Copyright© 2013, 2013, 2013, 2013, Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
Microarchitecture Review
11
FetchFetchFetchFetch DecodeDecodeDecodeDecode ExecuteExecuteExecuteExecute MemoryMemoryMemoryMemory CommitCommitCommitCommit
FrontFrontFrontFront----EndEndEndEnd BackBackBackBack----EndEndEndEnd
The back-end receives u-ops, executes them OUT OF ORDER, accesses memory as needed, and commits results to memory
IN ORDER.
Copyright© Copyright© Copyright© Copyright© 2013, 2013, 2013, 2013, Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
Microarchitecture Review
12
FetchFetchFetchFetch DecodeDecodeDecodeDecode ExecuteExecuteExecuteExecute MemoryMemoryMemoryMemory CommitCommitCommitCommit
FrontFrontFrontFront----EndEndEndEnd BackBackBackBack----EndEndEndEnd
AllocationAllocationAllocationAllocation
Allocation is the point where u-ops transfer from the front-end to the back-end. The front-end can allocate 4
u-ops per cycle.
Copyright© Copyright© Copyright© Copyright© 2013, 2013, 2013, 2013, Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
Microarchitecture Review
13
FetchFetchFetchFetch DecodeDecodeDecodeDecode ExecuteExecuteExecuteExecute MemoryMemoryMemoryMemory CommitCommitCommitCommit
FrontFrontFrontFront----EndEndEndEnd BackBackBackBack----EndEndEndEnd
AllocationAllocationAllocationAllocation RetirementRetirementRetirementRetirement
Retirement is the point where u-ops leave the back-end. The back-end can retire 4 u-ops per cycle.
Copyright© Copyright© Copyright© Copyright© 2013, 2013, 2013, 2013, Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
And a New Term: the Pipeline Slot
14
FetchFetchFetchFetch DecodeDecodeDecodeDecode ExecuteExecuteExecuteExecute MemoryMemoryMemoryMemory CommitCommitCommitCommit
FrontFrontFrontFront----EndEndEndEnd BackBackBackBack----EndEndEndEnd
4 Potential4 Potential4 Potential4 PotentialAllocations Allocations Allocations Allocations per Cycleper Cycleper Cycleper Cycle
4 Potential4 Potential4 Potential4 PotentialRetirementsRetirementsRetirementsRetirementsper Cycleper Cycleper Cycleper Cycle
In reality, there are many queues, buffers, and pieces of logicthroughout the pipeline to allow up to 4 allocations and 4
retirements per cycle.
Copyright© Copyright© Copyright© Copyright© 2013, 2013, 2013, 2013, Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
And a New Term: the Pipeline Slot
15
FetchFetchFetchFetch DecodeDecodeDecodeDecode ExecuteExecuteExecuteExecute MemoryMemoryMemoryMemory CommitCommitCommitCommit
FrontFrontFrontFront----EndEndEndEnd BackBackBackBack----EndEndEndEnd
4 Potential4 Potential4 Potential4 PotentialAllocations Allocations Allocations Allocations per Cycleper Cycleper Cycleper Cycle
4 Potential4 Potential4 Potential4 PotentialRetirementsRetirementsRetirementsRetirementsper Cycleper Cycleper Cycleper Cycle
The “Pipeline Slot” is an abstraction representing all theresources needed to move one u-op through the pipeline.
Copyright© Copyright© Copyright© Copyright© 2013, 2013, 2013, 2013, Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
ExecuteExecuteExecuteExecute
And a New Term: the Pipeline Slot
16
FetchFetchFetchFetch DecodeDecodeDecodeDecode MemoryMemoryMemoryMemory CommitCommitCommitCommit
FrontFrontFrontFront----EndEndEndEnd BackBackBackBack----EndEndEndEnd
There are 4 Pipeline Slots available every cycle.
S1
S2
S3
S4
Copyright© Copyright© Copyright© Copyright© 2013, 2013, 2013, 2013, Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
And a New Term: the Pipeline Slot
17
FetchFetchFetchFetch DecodeDecodeDecodeDecode ExecuteExecuteExecuteExecute MemoryMemoryMemoryMemory CommitCommitCommitCommit
FrontFrontFrontFront----EndEndEndEnd BackBackBackBack----EndEndEndEnd
Pipeline slots are filled with u-ops that travel from allocationto retirement over multiple cycles.
S1
S2
S3
S4
S1
S2
S3
S4
S1
S2
S3
S4
S1
S2
S3
S4
S1
S2
S3
S4
Copyright© Copyright© Copyright© Copyright© 2013, 2013, 2013, 2013, Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
Cycles Per Instruction (CPI), a standard measure, has some special kinksFor multi-core processors, CPI can get as low as 0.25 cycles per instructions with current Intel processors.
Normally, something below CPI < ~1.0 is targeted for better performances.
Some would suggest CPI must be targeted around ~0.75 to 0.50.
But is this correct to any architecture?
18
Copyright© Copyright© Copyright© Copyright© 2013, 2013, 2013, 2013, Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
Cycles Per Instruction (CPI), a standard measure, has some special kinks• Threads on each Intel® Xeon™ Phi core share a clock
� If all 4 HW threads are active, each gets ¼ total cycles
• Multi-stage instruction decode requires two threads to utilize the whole core – one thread only gets half
• With two ops/per cycle (U-V-pipe dual issue):
• To get thread CPI, multiply by the active threads
19
Threads per Threads per Threads per Threads per CoreCoreCoreCore
BestBestBestBest CPI CPI CPI CPI per per per per CoreCoreCoreCore
1111 1.02222 0.53333 0.54444 0.5
Threads per Threads per Threads per Threads per CoreCoreCoreCore
BestBestBestBest CPI CPI CPI CPI per per per per CoreCoreCoreCore
Best CPI Best CPI Best CPI Best CPI per Threadper Threadper Threadper Thread
1 x1 x1 x1 x 1.0 = 1.02 x2 x2 x2 x 0.5 = 1.03 x3 x3 x3 x 0.5 = 1.54 x4 x4 x4 x 0.5 = 2.0
Copyright© Copyright© Copyright© Copyright© 2013, 2013, 2013, 2013, Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
The Top-Down Characterization
What is it?
The Top-Down Characterization is:
• A new way to organize and use processor events to identify the real hardware bottlenecks in systems/applications
• Based on PMU events specifically designed for this task
• Integrated into Intel® VTune Amplifier XE for Core
• Available on Intel® Microarchitecture code named Sandy Bridge and newer
20
Copyright© Copyright© Copyright© Copyright© 2013, 2013, 2013, 2013, Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
The Top-Down Characterization
Each pipeline slot on each cycle is classified into 1 of 4 categories.
For each slot on each cycle:
21
Copyright© Copyright© Copyright© Copyright© 2013, 2013, 2013, 2013, Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
The Top-Down Characterization
22
• Sum to 1.0
• Unit is “Percentage of total Pipeline Slots”
• This is the core of the new Top-Down characterization
• Each category is further broken down depending on available events
Copyright© Copyright© Copyright© Copyright© 2013, 2013, 2013, 2013, Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
23
Back-EndFront-End
Latency BandwithMemoryBoundMemoryBound
Core BoundCore Bound
L1
DRAM
Remote
DRAM Local ou Remote
L2
L3
DIV ActiveDIV
Active
Port Utilization
Port Utilization
0 .. 3 ports
Store BoundStore Bound
ITLBITLBOverhead
ICacheICacheMisses
DSB Switches
Branch Resteers
Retiring Bad Speculation
Branch MispredictBranch
MispredictMachine Clears
Machine Clears
General Microcode SequencerMicrocode Sequencer
DSBMITE
Issues breakdown
Copyright© Copyright© Copyright© Copyright© 2013, 2013, 2013, 2013, Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
Examples of Metrics (Xeon™ Phi)
24
Copyright© Copyright© Copyright© Copyright© 2013, 2013, 2013, 2013, Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
Problem Area: L1 Cache Usage
• Significantly affects data access latency and therefore application performance
• Tuning Suggestions:
� Software prefetching
� Tile/block data access for cache size
� Use streaming stores
� If using 4K access stride, may be experiencing conflict misses
� Examine Compiler prefetching (Compiler-generated L1 prefetches should not miss)
25
MetricMetricMetricMetric FormulaFormulaFormulaFormula InvestigateInvestigateInvestigateInvestigate ifififif
L1 Misses
DATA_READ_MISS_OR_WRITE_MISS + L1_DATA_HIT_INFLIGHT_PF1
L1 Hit Rate
(DATA_READ_OR_WRITE – L1 Misses) / DATA_READ_OR_WRITE
< 95%
Copyright© Copyright© Copyright© Copyright© 2013, 2013, 2013, 2013, Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
Problem Area: Data Access Latency
• Significantly affects application performance
• Tuning Suggestions:
� Software prefetching
� Tile/block data access for cache size
� Use streaming stores
� Check cache locality – turn off prefetching and use CACHE_FILL events - reduce sharing if needed/possible
� If using 64K access stride, may be experiencing conflict misses
26
MetricMetricMetricMetric FormulaFormulaFormulaFormula InvestigateInvestigateInvestigateInvestigate ifififif
Estimated Latency Impact
(CPU_CLK_UNHALTED– EXEC_STAGE_CYCLES– DATA_READ_OR_WRITE)/ DATA_READ_OR_WRITE_MISS
>145
Copyright© Copyright© Copyright© Copyright© 2013, 2013, 2013, 2013, Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
Problem Area: TLB Usage
• Also affects data access latency and therefore application performance
• Tuning Suggestions:
� Improve cache usage & data access latency
� If L1 TLB miss/L2 TLB miss is high, try using large pages
� For loops with multiple streams, try splitting into multiple loops
� If data access stride is a large power of 2, consider padding between arrays by one 4 KB page
27
MetricMetricMetricMetric FormulaFormulaFormulaFormula InvestInvestInvestInvest----igateigateigateigate ifififif
L1 TLB miss ratio DATA_PAGE_WALK/DATA_READ_OR_WRITE > 1%
L2 TLB miss ratio LONG_DATA_PAGE_WALK / DATA_READ_OR_WRITE
> .1%
L1 TLB misses per L2 TLB miss
DATA_PAGE_WALK / LONG_DATA_PAGE_WALK > 100x
Copyright© Copyright© Copyright© Copyright© 2013, 2013, 2013, 2013, Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
Problem Area: VPU Usage
• Indicates whether an application is vectorized successfully and efficiently
• Tuning Suggestions:
� Use the Compiler vectorization report!
� For data dependencies preventing vectorization, try using Intel® Cilk™ Plus #pragma SIMD (if safe!)
� Align data and tell the Compiler!
� Re-structure code if possible: Array notations, AOS->SOA
28
MetricMetricMetricMetric FormulaFormulaFormulaFormula InvestigateInvestigateInvestigateInvestigate ifififif
Vectorization Intensity
VPU_ELEMENTS_ACTIVE / VPU_INSTRUCTIONS_EXECUTED
<8 (DP), <16(SP)
Copyright© Copyright© Copyright© Copyright© 2013, 2013, 2013, 2013, Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
Problem Area: Memory Bandwidth
• Can increase data latency in the system or become a performance bottleneck
• Tuning Suggestions:
� Improve locality in caches
� Use streaming stores
� Improve software prefetching
29
MetricMetricMetricMetric FormulaFormulaFormulaFormula InvestigateInvestigateInvestigateInvestigate ifififif
MemoryBandwidth
(UNC_F_CH0_NORMAL_READ + UNC_F_CH0_NORMAL_WRITE+ UNC_F_CH1_NORMAL_READ + UNC_F_CH1_NORMAL_WRITE) * 64/time
< 80GB/sec(practical peak 140GB/sec)
(with 8 memory controllers)
Copyright© Copyright© Copyright© Copyright© 2013, 2013, 2013, 2013, Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
VTune™ Amplifier XE
30
Copyright© Copyright© Copyright© Copyright© 2013, 2013, 2013, 2013, Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
DEMO
31
Copyright© Copyright© Copyright© Copyright© 2013, 2013, 2013, 2013, Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
Running the General Exploration Collector
32
2. Select “General
Exploration” for your CPU
architecture
3. Click “Start” to begin
profiling
1. Click “New Analysis” button
Copyright© Copyright© Copyright© Copyright© 2013, 2013, 2013, 2013, Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
General Exploration Summary
33
Copyright© Copyright© Copyright© Copyright© 2013, 2013, 2013, 2013, Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
VTune™ Amplifier XE visualizes performance
34
Copyright© Copyright© Copyright© Copyright© 2013, 2013, 2013, 2013, Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
VTune™ Amplifier XE visualizes performance
35
Instructions Navigator New Open Properties Instructions Navigator New Open Properties Instructions Navigator New Open Properties Instructions Navigator New Open Properties New Open CompareNew Open CompareNew Open CompareNew Open CompareProject Project Project Project Result Result Result Result
ToolbarToolbarToolbarToolbar
Copyright© Copyright© Copyright© Copyright© 2013, 2013, 2013, 2013, Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
VTune™ Amplifier XE visualizes performance
36
ProjectProjectProjectProject
NavigatorNavigatorNavigatorNavigator
Copyright© Copyright© Copyright© Copyright© 2013, 2013, 2013, 2013, Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
VTune™ Amplifier XE visualizes performance
37
Result DisplayResult DisplayResult DisplayResult Display
TabsTabsTabsTabs
Copyright© Copyright© Copyright© Copyright© 2013, 2013, 2013, 2013, Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
VTune™ Amplifier XE visualizes performance
38
Result Analysis Result Analysis Result Analysis Result Analysis TypeTypeTypeType
Copyright© Copyright© Copyright© Copyright© 2013, 2013, 2013, 2013, Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
VTune™ Amplifier XE visualizes performance
39
Result ViewpointResult ViewpointResult ViewpointResult Viewpoint
Copyright© Copyright© Copyright© Copyright© 2013, 2013, 2013, 2013, Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
VTune™ Amplifier XE visualizes performance
40
Viewpoint Viewpoint Viewpoint Viewpoint AlternatesAlternatesAlternatesAlternates
Copyright© Copyright© Copyright© Copyright© 2013, 2013, 2013, 2013, Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
VTune™ Amplifier XE visualizes performance
41
ResultResultResultResult ComponentsComponentsComponentsComponents
Copyright© Copyright© Copyright© Copyright© 2013, 2013, 2013, 2013, Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
VTune™ Amplifier XE visualizes performance
42
Grid Grid Grid Grid PanePanePanePane
Copyright© Copyright© Copyright© Copyright© 2013, 2013, 2013, 2013, Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
VTune™ Amplifier XE visualizes performance
43
Grid Grid Grid Grid PanePanePanePane
Grouping pullGrouping pullGrouping pullGrouping pull----downdowndowndown
Copyright© Copyright© Copyright© Copyright© 2013, 2013, 2013, 2013, Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
VTune™ Amplifier XE visualizes performance
44
StackStackStackStack
PanePanePanePane
Copyright© Copyright© Copyright© Copyright© 2013, 2013, 2013, 2013, Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
VTune™ Amplifier XE visualizes performance
45
TimelineTimelineTimelineTimeline
Copyright© Copyright© Copyright© Copyright© 2013, 2013, 2013, 2013, Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
VTune™ Amplifier XE visualizes performance
46
Filter/OptionsFilter/OptionsFilter/OptionsFilter/Options
BarBarBarBar
Copyright© Copyright© Copyright© Copyright© 2013, 2013, 2013, 2013, Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
VTune™ Amplifier XE visualizes performance
Intel Confidential47
5/30/2014
Source View / Source View / Source View / Source View /
Per line localizationPer line localizationPer line localizationPer line localization
Copyright© Copyright© Copyright© Copyright© 2013, 2013, 2013, 2013, Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
VTune™ Amplifier XE visualizes performance
Intel Confidential48
5/30/2014
Source View / Source View / Source View / Source View /
View / Hot spot View / Hot spot View / Hot spot View / Hot spot Navigation controlsNavigation controlsNavigation controlsNavigation controls
Copyright© Copyright© Copyright© Copyright© 2013, 2013, 2013, 2013, Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
VTune™ Amplifier XE visualizes performance
Intel Confidential49
5/30/2014
Assembly View / Assembly View / Assembly View / Assembly View /
View / Hot spot View / Hot spot View / Hot spot View / Hot spot Navigation controlsNavigation controlsNavigation controlsNavigation controls
Copyright© Copyright© Copyright© Copyright© 2013, 2013, 2013, 2013, Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
VTune™ Amplifier XE visualizes performance
Intel Confidential50
5/30/2014
Assembly View / Assembly View / Assembly View / Assembly View /
Assembly Assembly Assembly Assembly groupingsgroupingsgroupingsgroupings
Copyright© Copyright© Copyright© Copyright© 2013, 2013, 2013, 2013, Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
Intel® Software Conference 2014For event collection the coprocessor is treated as a special HW architecture
51
Copyright© Copyright© Copyright© Copyright© 2013, 2013, 2013, 2013, Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
Intel® Software Conference 2014Project properties provides the means to invoke data collection by target type
52
Copyright© Copyright© Copyright© Copyright© 2013, 2013, 2013, 2013, Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
Intel® Software Conference 2014Launch Application serves many uses, from host/offload to native execution
53
Copyright© Copyright© Copyright© Copyright© 2013, 2013, 2013, 2013, Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
Intel® Software Conference 2014Search directories have been reorganized to speed symbol resolution during finalization
54
Notable coprocessor library paths:Notable coprocessor library paths:Notable coprocessor library paths:Notable coprocessor library paths:/opt/mpss/3.2/sysroots/k1om-mpss-Linux/boot/opt/mpss/3.2/sysroots/k1om-mpss-Linux/lib64/opt/intel/composerxe/lib/mic/opt/intel/composerxe/tbb/lib/mic/opt/intel/composerxe/mkl/lib/mic/opt/intel/mpi-rt/4.1.3/mic
Copyright© Copyright© Copyright© Copyright© 2013, 2013, 2013, 2013, Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
Intel® Software Conference 2014General Exploration runs a set of events to drive top-down analysis
55
Copyright© Copyright© Copyright© Copyright© 2013, 2013, 2013, 2013, Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
For more information on Intel® Xeon
Phi™ and VTune™ Amplifier XE
56
Optimization on the coprocessor: http://software.intel.com/en-us/articles/optimization-and-performance-tuning-for-intel-xeon-phi-coprocessors-part-1-optimization
http://software.intel.com/en-us/articles/optimization-and-performance-tuning-for-intel-xeon-phi-coprocessors-part-2-understanding
Coprocessor Performance Monitoring Unit: http://software.intel.com/sites/default/files/forum/278102/intelr-xeon-phitm-pmu-rev1.01.pdf
For general information: http://software.intel.com/mic-developer
Copyright© Copyright© Copyright© Copyright© 2013, 2013, 2013, 2013, Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
Grid is Based on Top-Down
57
Copyright© Copyright© Copyright© Copyright© 2013, 2013, 2013, 2013, Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
Use the Hover Text to Understand Metrics*
*Suggestions welcome: Submit issues if the text isn’t helpful
58
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Event collections on the coprocessor can generate volumes of datadgemm: on 60+ cores
Tip: Use cpu-mask to reduce data set, while maintaining the same accuracy.
59
Copyright© Copyright© Copyright© Copyright© 2013, 2013, 2013, 2013, Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.*Other brands and names are the property of their respective owners.
Resources
Top-Down Characterization White Paper
http://software.intel.com/en-us/articles/how-to-tune-applications-using-a-top-down-characterization-of-microarchitectural-issues
Tuning Guides
http://software.intel.com/en-us/articles/processor-specific-performance-analysis-papers
60