Upload
warren-sanders
View
216
Download
2
Tags:
Embed Size (px)
Citation preview
PSI-SIM: System Performance Evaluation Environment for
Next-Generation Supercomputers
K. Inoue, H. Shibamura, R. Susukita, Y. Inadomi, H. Honda, Y. Yu, and M.
Aoyagi
Kyusyu University, ISIT, IST
Background
• “Peta” is tremendous!– Compared with “Giga or Tera” scale machines
How are you Mr. Tera?
I am fine!How about you, Mr. Peta?
Background
• “Peta” is tremendous!– Compared with “Giga or Tera” scale machines
• If you would like to develop a “Peta-Scale” supercomputer, it is required to…– Explore the design space both of computation
nodes and inter-connection network!– Verify the effective performance to be achieved!
• So, we need a performance evaluation environment for peta-scale supercomputers!
Our Goal!
• Problem…– Simulations are 3-orders of magnitude slower
than real machines!– “Peta-scale” is 3-orders of magnitude larger
than “Tera-scale” (i.e. available machines) ! – How can we bridge the gap?
• Develop an efficient performance evaluation environment: PSI-SIM– Divide compute-node simulations and network
simulations!– Abstract the target application program to
accelerate simulation speed!
Performance-Evaluation Flowof PSI-SIM
BSIM-Parser
BSIM-Logger
Comm. profile (w/o Latency)
Comm. Profile(w/ Latency)
ANA
Performance Info.
InterconnectConfiguration
DB for Processors
Interconnect Arch.
VisualizationHints for Optimization
Parallelized Application
(e.g. Peta-scale)
SkeletonCode
Step1: Generate a skeleton code
Step2: Execute on an existing machine
Step3: Simulate inter connection network
Step4: Visualize and analyze the results
NSIM
Target machine
Target machine
Performance-Evaluation Flowof PSI-SIM
BSIM-Parser
BSIM-Logger
Comm. profile (w/o Latency)
Comm. Profile(w/ Latency)
ANA
Performance Info.
InterconnectConfiguration
DB for Processors
Interconnect Arch.
VisualizationHints for Optimization
Parallelized Application
(e.g. Peta-scale)
SkeletonCode
Step1: Generate a skeleton code
Step2: Execute on an existing machine
Step3: Simulate inter connection network
Step4: Visualize and analyze the results
NSIM
Target machine
Target machine
What is the Skeleton Code?Original code
foo( ) { Inst. Block A for (i=0;i<n;i++) { Inst. Block B if (hoge) { Inst. Block C } else { Inst. Block D } Inst. Block E } MPI_Comm. Inst. Block F
for (j=0; j<n; j++) for (k=0; k<n; k++) Func( );}
foo( ) {
BSIM_ADD_TIME(10ms)
MPI_Comm.
BSIM_ADD_TIME(1ms)
BSIM_ADD_TIME(15s)
}
Skeleton code
• Computation blocks are replaced by “Estimated” execution times!• Other modifications (e.g. reducing required memory size)
Performance-Evaluation Flowof PSI-SIM
BSIM-Parser
BSIM-Logger
Comm. profile (w/o Latency)
Comm. Profile(w/ Latency)
ANA
Performance Info.
InterconnectConfiguration
DB for Processors
Interconnect Arch.
VisualizationHints for Optimization
Parallelized Application
(e.g. Peta-scale)
SkeletonCode
Step1: Generate a skeleton code
Step2: Execute on an existing machine
Step3: Simulate inter connection network
Step4: Visualize and analyze the results
NSIM
Target machine
Target machine
Generating Communication Profile
• BSIM-Logger– Executes the skeleton code on an existing
machine– Emulates the behavior of target machine– Generates a communication profile under the
assumption of a ZERO-latency ideal network
• Why Fast?– Abstracted computation blocks are NOT
executed (just update virtual timers)– Mask real communications, but generate
accurate logs
How Fast? How Accurate?ERI (Electron Repulsion Integral)
Tim
e f
or
loggin
g (
s) Original
Skeleton
Exe.
Tim
e P
redic
ted (
s)
Original Skeleton
NAS PARALLEL FT
Tim
e f
or
loggin
g (
s) Original
Skeleton
Exe.
Tim
e P
redic
ted (
s)
OriginalSkeleton
Performance-Evaluation Flowof PSI-SIM
BSIM-Parser
BSIM-Logger
Comm. profile (w/o Latency)
Comm. Profile(w/ Latency)
ANA
Performance Info.
InterconnectConfiguration
DB for Processors
Interconnect Arch.
VisualizationHints for Optimization
Parallelized Application
(e.g. Peta-scale)
SkeletonCode
Step1: Generate a skeleton code
Step2: Execute on an existing machine
Step3: Simulate inter connection network
Step4: Visualize and analyze the results
NSIM
Target machine
Target machine
Fast, Flexible Interconnection Network Simulator
• NSIM– Inputs the communication profile and a
network configuration file– Generates a communication profile with
estimated interconnect latency
• Why Fast? Why Flexible?– Parallelized implementation– Support a number of parameters
• Topology , Spec. of routers/switches, buffer size, and so on
Performance of BSIM + NSIM
• Performance prediction for HPL execution @16nodes PC cluster
• <120s (problem size = 5,000) @8CPU• About 9,000 MPI-Comm./s@8CPU
Execu
tion T
ime (
s) Measured Predicted
Error=5.3%
Not skeleton execution
Performance-Evaluation Flowof PSI-SIM
BSIM-Parser
BSIM-Logger
Comm. profile (w/o Latency)
Comm. Profile(w/ Latency)
ANA
Performance Info.
InterconnectConfiguration
DB for Processors
Interconnect Arch.
VisualizationHints for Optimization
Parallelized Application
(e.g. Peta-scale)
SkeletonCode
Step1: Generate a skeleton code
Step2: Execute on an existing machine
Step3: Simulate inter connection network
Step4: Visualize and analyze the results
NSIM
Target machine
Target machine
ANA GroupWork Viewer
Group Work•Indicate load balance
Performance Indicator•Execution time after load-balance optimization
Communication Indicator•Amount of communications per second
Conclusions
• PSI-SIM– Performance evaluation environment for
supercomputers– BSIM+NSIM+ANA
• On Going Work: Performance Prediction for – “Tera-Scale” machine (1K CPU Cores) by using a
“Giga-scale” machine (e.g. 32 CPU Cores)– “Peta-Scale” machine (4K PSI-SIMD CPUs) by
using a “Giga-scale” machine
Backup Slides
Peta-scale Performance Prediction
• Assumption– HPL problem size: 3Million– #of nodes: 4K (PSI-SIMD)– BSIM: use 32 cpus (3GHz Xeon)– NSIM: 10,000 MPI-Comm./s@8CPU
• How long we need to spend?– BSIM: about 300h (<2 weeks)– NSIM: about ??
• under the estimation…
予測実行時間 (FT)
誤差 -11.6%
誤差 -11.3%
Target machine?: rsccUsed machine?: rscc
通信プロファイル時間 (FT)
86%削減
19%削減
Target machine?: rsccUsed machine?: rscc
予測実行時間( ERI)
誤差 -0.2%
誤差 1.5%
誤差 -0.6%
Target machine?: rsccUsed machine?: rscc
通信プロファイル生成時間( ERI)
91%削減
96%削減
97%削減
Target machine?: rsccUsed machine?: rscc
実行時間の予測性能通信レイテンシ
評価アプリケーションの規模増加 ⇒ 予測精度が向上評価アプリケーションの規模増加 ⇒ 予測精度が向上
予測精度: 94.7%
シミュレーション時間(問題サイズ固定: 2000 )
評価アプリケーションのプロセス数増加 ⇒ 並列処理効率が向上評価アプリケーションのプロセス数増加 ⇒ 並列処理効率が向上
最近の成果(高速化)分
16プロセス
256プロセス
1,024プロセス
Performance of NSIM
Accuracy : 94.7%
7.92,8.36,8.04
114sTarget machine? : PSI-hexaUsed machine?: PSI-hexa