Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
IntroductionThree dimensional FFTs
Implementation in ABINIT
Optimization of a parallel 3d-FFT withnon-blocking collective operations
Torsten Hoefler, Gilles Zérah
Chair of Computer Architecture Département de Physique Théorique et AppliquéeTechnical University of Chemnitz Commissariat à l’Énergie Atomique/DAM
3rd International ABINIT Developer WorkshopLiège, Belgium
29th January 2007
Torsten Hoefler, Gilles Zérah Overlap in 3d-FFTs
IntroductionThree dimensional FFTs
Implementation in ABINIT
1 IntroductionShort introduction to non-blocking collectives
2 Three dimensional FFTsTraditional parallel 3d-FFTParallel 3d-FFT with maximum overlapParallel cache optimized 3d-FFT with partial overlap
3 Implementation in ABINITAvoidance of the transformation of zeroesAutotuning of parametersPreliminary Performance Results
Torsten Hoefler, Gilles Zérah Overlap in 3d-FFTs
IntroductionThree dimensional FFTs
Implementation in ABINITShort introduction to non-blocking collectives
1 IntroductionShort introduction to non-blocking collectives
2 Three dimensional FFTsTraditional parallel 3d-FFTParallel 3d-FFT with maximum overlapParallel cache optimized 3d-FFT with partial overlap
3 Implementation in ABINITAvoidance of the transformation of zeroesAutotuning of parametersPreliminary Performance Results
Torsten Hoefler, Gilles Zérah Overlap in 3d-FFTs
IntroductionThree dimensional FFTs
Implementation in ABINITShort introduction to non-blocking collectives
Non-blocking Collective Operations
Advantages - Overlap
leverage hardware parallelism (e.g. InfiniBandTM)overlap similar to non-blocking point-to-point
Usage?extension to MPI-2”mixture” between non-blocking ptp and collectivesuses MPI_Requests and MPI_Test/MPI_Wait
MPI_Ibcast(buf1, p, MPI_INT, 0, comm, &req);MPI_Wait(&req);
Torsten Hoefler, Gilles Zérah Overlap in 3d-FFTs
IntroductionThree dimensional FFTs
Implementation in ABINITShort introduction to non-blocking collectives
AvailabilityPrototype LibNBC: requires ANSI-C and MPI-2LibNBC dowload and documentation:http://www.unixer.de/NBC
DocumentationT. HOEFLER, J. SQUYRES, W. REHM, AND A. LUMSDAINE: ACase for Non-Blocking Collective Operations. In Frontiers ofHigh Performance Computing and Networking, pages 155-164,Springer Berlin, ISBN: 978-3-540-49860-5 Dec. 2006
T. HOEFLER, J. SQUYRES, G. BOSILCA, G. FAGG, A.LUMSDAINE, AND W. REHM: Non-Blocking CollectiveOperations for MPI-2. Open Systems Lab, Indiana University.presented in Bloomington, IN, USA, Aug. 2006
Torsten Hoefler, Gilles Zérah Overlap in 3d-FFTs
IntroductionThree dimensional FFTs
Implementation in ABINITShort introduction to non-blocking collectives
Performance Benefits?
0
2000
4000
6000
8000
10000
12000
0 10000 20000 30000 40000 50000 60000
Tim
e in
mic
rose
cond
s
Datasize
MPI LatencyNBC Latency (max. overlap)
Figure: MPI_Alltoall latency on the “tantale” cluster@CEATorsten Hoefler, Gilles Zérah Overlap in 3d-FFTs
IntroductionThree dimensional FFTs
Implementation in ABINIT
Traditional parallel 3d-FFTParallel 3d-FFT with maximum overlapParallel cache optimized 3d-FFT with partial overlap
1 IntroductionShort introduction to non-blocking collectives
2 Three dimensional FFTsTraditional parallel 3d-FFTParallel 3d-FFT with maximum overlapParallel cache optimized 3d-FFT with partial overlap
3 Implementation in ABINITAvoidance of the transformation of zeroesAutotuning of parametersPreliminary Performance Results
Torsten Hoefler, Gilles Zérah Overlap in 3d-FFTs
IntroductionThree dimensional FFTs
Implementation in ABINIT
Traditional parallel 3d-FFTParallel 3d-FFT with maximum overlapParallel cache optimized 3d-FFT with partial overlap
Domain Decomposition
Discretized 3D Domain (FFT-Box)
y x
z
Torsten Hoefler, Gilles Zérah Overlap in 3d-FFTs
IntroductionThree dimensional FFTs
Implementation in ABINIT
Traditional parallel 3d-FFTParallel 3d-FFT with maximum overlapParallel cache optimized 3d-FFT with partial overlap
Domain Decomposition
Distributed 3d Domain
y x
z 0 1 2
Torsten Hoefler, Gilles Zérah Overlap in 3d-FFTs
IntroductionThree dimensional FFTs
Implementation in ABINIT
Traditional parallel 3d-FFTParallel 3d-FFT with maximum overlapParallel cache optimized 3d-FFT with partial overlap
1D Transformation
1D Transformation in y Direction
y x
z
Torsten Hoefler, Gilles Zérah Overlap in 3d-FFTs
IntroductionThree dimensional FFTs
Implementation in ABINIT
Traditional parallel 3d-FFTParallel 3d-FFT with maximum overlapParallel cache optimized 3d-FFT with partial overlap
1D Transformation
1D Transformation in z Direction
y x
z
Torsten Hoefler, Gilles Zérah Overlap in 3d-FFTs
IntroductionThree dimensional FFTs
Implementation in ABINIT
Traditional parallel 3d-FFTParallel 3d-FFT with maximum overlapParallel cache optimized 3d-FFT with partial overlap
1D Transformation
1D Transformation in x Direction
y x
z
Torsten Hoefler, Gilles Zérah Overlap in 3d-FFTs
IntroductionThree dimensional FFTs
Implementation in ABINIT
Traditional parallel 3d-FFTParallel 3d-FFT with maximum overlapParallel cache optimized 3d-FFT with partial overlap
1 IntroductionShort introduction to non-blocking collectives
2 Three dimensional FFTsTraditional parallel 3d-FFTParallel 3d-FFT with maximum overlapParallel cache optimized 3d-FFT with partial overlap
3 Implementation in ABINITAvoidance of the transformation of zeroesAutotuning of parametersPreliminary Performance Results
Torsten Hoefler, Gilles Zérah Overlap in 3d-FFTs
IntroductionThree dimensional FFTs
Implementation in ABINIT
Traditional parallel 3d-FFTParallel 3d-FFT with maximum overlapParallel cache optimized 3d-FFT with partial overlap
Non-blocking 3D-FFT
Derivation from “normal” implementationdistribution identical to “normal” 3D-FFTfirst FFT in y direction and local data transpose
Design Goals to Minimize Communication Overheadstart communication as early as possibleachieve maximum overlap time
Solutionstart NBC_Ialltoall as soon as first xz-plane is readycalculate next xz-planestart next communication accordingly ...collect multiple xz-planes (A2A data size)
Torsten Hoefler, Gilles Zérah Overlap in 3d-FFTs
IntroductionThree dimensional FFTs
Implementation in ABINIT
Traditional parallel 3d-FFTParallel 3d-FFT with maximum overlapParallel cache optimized 3d-FFT with partial overlap
Non-blocking 3D-FFT
Derivation from “normal” implementationdistribution identical to “normal” 3D-FFTfirst FFT in y direction and local data transpose
Design Goals to Minimize Communication Overheadstart communication as early as possibleachieve maximum overlap time
Solutionstart NBC_Ialltoall as soon as first xz-plane is readycalculate next xz-planestart next communication accordingly ...collect multiple xz-planes (A2A data size)
Torsten Hoefler, Gilles Zérah Overlap in 3d-FFTs
IntroductionThree dimensional FFTs
Implementation in ABINIT
Traditional parallel 3d-FFTParallel 3d-FFT with maximum overlapParallel cache optimized 3d-FFT with partial overlap
Non-blocking 3D-FFT
Derivation from “normal” implementationdistribution identical to “normal” 3D-FFTfirst FFT in y direction and local data transpose
Design Goals to Minimize Communication Overheadstart communication as early as possibleachieve maximum overlap time
Solutionstart NBC_Ialltoall as soon as first xz-plane is readycalculate next xz-planestart next communication accordingly ...collect multiple xz-planes (A2A data size)
Torsten Hoefler, Gilles Zérah Overlap in 3d-FFTs
IntroductionThree dimensional FFTs
Implementation in ABINIT
Traditional parallel 3d-FFTParallel 3d-FFT with maximum overlapParallel cache optimized 3d-FFT with partial overlap
Transformation of in z Direction
Data already transformed in y direction
y x
z
1 block = 1 double value (3x3x3 grid)
Torsten Hoefler, Gilles Zérah Overlap in 3d-FFTs
IntroductionThree dimensional FFTs
Implementation in ABINIT
Traditional parallel 3d-FFTParallel 3d-FFT with maximum overlapParallel cache optimized 3d-FFT with partial overlap
Transformation of in z Direction
Transform first xz plane in z direction in parallel
y x
z
������������������������������������������
������������������������������
�������������������������
�������������������������������������������������������
������������������������������
����������������������������
�������
���������������������
�������
� � � � � � � � � �
�������������������������
������������������������������
������������������������������
������������������������������
����������������������������������������������������������
������������������������������
������������������������������
��������������������������������
���������������������
�������
�����������������������������������
�������������������������������������������������������������������
������������������������������
pattern means that data was transformed in y and z direction
Torsten Hoefler, Gilles Zérah Overlap in 3d-FFTs
IntroductionThree dimensional FFTs
Implementation in ABINIT
Traditional parallel 3d-FFTParallel 3d-FFT with maximum overlapParallel cache optimized 3d-FFT with partial overlap
Transformation of in z Direction
start NBC_Ialltoall of first xz plane and transform second plane
������������������������������
� � � � � � � � � � � � !�!�!�!!�!�!�!!�!�!�!"�"�"�""�"�"�""�"�"�"
#######
$$$$$$$
%�%�%%�%�%%�%�%%�%�%%�%�%%�%�%
&�&�&&�&�&&�&�&&�&�&&�&�&&�&�&
'�'�''�'�''�'�''�'�''�'�'
(�(�((�(�((�(�((�(�((�(�( )))))))
*******
+++++++
,,,,,,,
-�-�-�--�-�-�--�-�-�--�-�-�--�-�-�--�-�-�-
.�.�.�..�.�.�..�.�.�..�.�.�..�.�.�..�.�.�.
/�/�//�/�//�/�//�/�//�/�//�/�/
0�0�00�0�00�0�00�0�00�0�00�0�01�1�1�1�11�1�1�1�11�1�1�1�12�2�2�2�22�2�2�2�22�2�2�2�2
3�3�3�33�3�3�33�3�3�34�4�4�44�4�4�44�4�4�4
y x
z
5�5�55�5�55�5�55�5�55�5�5
6�6�66�6�66�6�66�6�66�6�67�7�77�7�77�7�77�7�77�7�77�7�7
8�8�88�8�88�8�88�8�88�8�88�8�8 9�99�99�99�99�99�99�9
:::::::
;�;;�;;�;;�;;�;;�;;�;
<<<<<<<=�=�==�=�==�=�==�=�==�=�==�=�=
>�>�>>�>�>>�>�>>�>�>>�>�>>�>�>
?�??�??�??�??�??�??�?
@@@@@@@
A�A�A�AA�A�A�AA�A�A�AA�A�A�AA�A�A�A
B�B�BB�B�BB�B�BB�B�BB�B�BC�C�C�CC�C�C�CC�C�C�CC�C�C�CC�C�C�CC�C�C�C
D�D�DD�D�DD�D�DD�D�DD�D�DD�D�DE�E�E�EE�E�E�EE�E�E�EE�E�E�EE�E�E�EE�E�E�E
F�F�FF�F�FF�F�FF�F�FF�F�FF�F�F
G�G�GG�G�GG�G�GG�G�GG�G�G
H�H�HH�H�HH�H�HH�H�HH�H�H
I�I�II�I�II�I�II�I�II�I�II�I�I
J�J�JJ�J�JJ�J�JJ�J�JJ�J�JJ�J�JK�K�K�K�KK�K�K�K�KL�L�L�LL�L�L�LM�M�M�MM�M�M�MN�N�N�NN�N�N�N
O�O�OO�O�OO�O�OO�O�OO�O�OO�O�O
P�P�PP�P�PP�P�PP�P�PP�P�PP�P�PQ�Q�Q�QQ�Q�Q�QR�R�R�RR�R�R�R
cyan color means that data is communicated in the background
Torsten Hoefler, Gilles Zérah Overlap in 3d-FFTs
IntroductionThree dimensional FFTs
Implementation in ABINIT
Traditional parallel 3d-FFTParallel 3d-FFT with maximum overlapParallel cache optimized 3d-FFT with partial overlap
Transformation of in z Direction
start NBC_Ialltoall of second xz plane and transform third planeS�S�S�SS�S�S�SS�S�S�ST�T�T�TT�T�T�TT�T�T�T
U�U�U�U�UU�U�U�U�UU�U�U�U�UV�V�V�VV�V�V�VV�V�V�V
W�WW�WW�WW�WW�WW�WW�W
X�XX�XX�XX�XX�XX�XX�X
Y�Y�Y�YY�Y�Y�YY�Y�Y�YY�Y�Y�YY�Y�Y�YY�Y�Y�Y
Z�Z�ZZ�Z�ZZ�Z�ZZ�Z�ZZ�Z�ZZ�Z�Z
[�[�[�[[�[�[�[[�[�[�[[�[�[�[[�[�[�[[�[�[�[
\�\�\\�\�\\�\�\\�\�\\�\�\\�\�\
]�]�]]�]�]]�]�]]�]�]]�]�]
^�^�^^�^�^^�^�^^�^�^^�^�^
_�_�_�__�_�_�__�_�_�__�_�_�__�_�_�_
`�`�``�`�``�`�``�`�``�`�`a�a�a�a�aa�a�a�a�aa�a�a�a�ab�b�b�bb�b�b�bb�b�b�b
c�cc�cc�cc�cc�cc�cc�c
d�dd�dd�dd�dd�dd�dd�de�ee�ee�ee�ee�ee�ee�e
f�ff�ff�ff�ff�ff�ff�fg�g�g�gg�g�g�gg�g�g�gg�g�g�gg�g�g�g
h�h�hh�h�hh�h�hh�h�hh�h�h
i�i�ii�i�ii�i�ii�i�ii�i�ii�i�i
j�j�jj�j�jj�j�jj�j�jj�j�jj�j�j
k�k�kk�k�kk�k�kk�k�kk�k�k
l�l�ll�l�ll�l�ll�l�ll�l�l mmmmmmm
nnnnnnn
ooooooo
ppppppp
q�q�q�qq�q�q�qq�q�q�qq�q�q�qq�q�q�qq�q�q�q
r�r�r�rr�r�r�rr�r�r�rr�r�r�rr�r�r�rr�r�r�r
sssssss
ttttttt
u�u�uu�u�uu�u�uu�u�uu�u�uu�u�u
v�v�vv�v�vv�v�vv�v�vv�v�vv�v�vw�w�ww�w�ww�w�ww�w�ww�w�ww�w�w
x�x�xx�x�xx�x�xx�x�xx�x�xx�x�x
y x
z
y�y�yy�y�yy�y�yy�y�yy�y�y
z�z�zz�z�zz�z�zz�z�zz�z�z{�{�{{�{�{{�{�{{�{�{{�{�{{�{�{
|�|�||�|�||�|�||�|�||�|�||�|�| }�}}�}}�}}�}}�}}�}}�}
~~~~~~~
���������������������
�������
�������������������������
�������������������������
������������������������������
����������������������������������������������������������
���������������������
�������
�����������������������������������
�������������������������
������������������������������������������
������������������������������
������������������������������
��������������������������������
������������������������������������������
������������������������������������������������������������������������
������������������������������
������������������������������
����������������������������������������������������������������������������������������������������������������
������������������������������������������
data of two planes is not accessible due to communication
Torsten Hoefler, Gilles Zérah Overlap in 3d-FFTs
IntroductionThree dimensional FFTs
Implementation in ABINIT
Traditional parallel 3d-FFTParallel 3d-FFT with maximum overlapParallel cache optimized 3d-FFT with partial overlap
Transformation of in z Direction
start communication of the third plane and ...������������������������������������������
��������������������������� � � � � � � � � �
¡�¡¡�¡¡�¡¡�¡¡�¡¡�¡¡�¡
¢�¢¢�¢¢�¢¢�¢¢�¢¢�¢¢�¢
£�£�£�££�£�£�££�£�£�££�£�£�££�£�£�££�£�£�£
¤�¤�¤¤�¤�¤¤�¤�¤¤�¤�¤¤�¤�¤¤�¤�¤
¥�¥�¥�¥¥�¥�¥�¥¥�¥�¥�¥¥�¥�¥�¥¥�¥�¥�¥¥�¥�¥�¥
¦�¦�¦¦�¦�¦¦�¦�¦¦�¦�¦¦�¦�¦¦�¦�¦
§�§�§§�§�§§�§�§§�§�§§�§�§
¨�¨�¨¨�¨�¨¨�¨�¨¨�¨�¨¨�¨�¨
©�©�©�©©�©�©�©©�©�©�©©�©�©�©©�©�©�©
ª�ª�ªª�ª�ªª�ª�ªª�ª�ªª�ª�ª«�«�«�«�««�«�«�«�««�«�«�«�«¬�¬�¬�¬¬�¬�¬�¬¬�¬�¬�¬
���������������
®�®�®®�®�®®�®�®®�®�®®�®�®
¯�¯¯�¯¯�¯¯�¯¯�¯¯�¯¯�¯
°�°°�°°�°°�°°�°°�°°�°±�±±�±±�±±�±±�±±�±±�±
²�²²�²²�²²�²²�²²�²²�²
³�³�³³�³�³³�³�³³�³�³³�³�³³�³�³
´�´�´´�´�´´�´�´´�´�´´�´�´´�´�´µ�µ�µ�µµ�µ�µ�µµ�µ�µ�µ¶�¶�¶�¶¶�¶�¶�¶¶�¶�¶�¶
·�·�··�·�··�·�··�·�··�·�·
¸�¸�¸¸�¸�¸¸�¸�¸¸�¸�¸¸�¸�¸ ¹¹¹¹¹¹¹
ººººººº
»»»»»»»
¼¼¼¼¼¼¼
½�½�½�½½�½�½�½½�½�½�½½�½�½�½½�½�½�½½�½�½�½
¾�¾�¾�¾¾�¾�¾�¾¾�¾�¾�¾¾�¾�¾�¾¾�¾�¾�¾¾�¾�¾�¾¿�¿�¿�¿�¿¿�¿�¿�¿�¿¿�¿�¿�¿�¿À�À�À�À�ÀÀ�À�À�À�ÀÀ�À�À�À�À
Á�Á�Á�ÁÁ�Á�Á�ÁÁ�Á�Á�ÁÂ�Â�Â�ÂÂ�Â�Â�ÂÂ�Â�Â�Â
Ã�Ã�ÃÃ�Ã�ÃÃ�Ã�ÃÃ�Ã�ÃÃ�Ã�ÃÃ�Ã�Ã
Ä�Ä�ÄÄ�Ä�ÄÄ�Ä�ÄÄ�Ä�ÄÄ�Ä�ÄÄ�Ä�Ä ÅÅÅÅÅÅÅ
ÆÆÆÆÆÆÆ
Ç�Ç�ÇÇ�Ç�ÇÇ�Ç�ÇÇ�Ç�ÇÇ�Ç�ÇÇ�Ç�Ç
È�È�ÈÈ�È�ÈÈ�È�ÈÈ�È�ÈÈ�È�ÈÈ�È�È
y x
z É�É�ÉÉ�É�ÉÉ�É�ÉÉ�É�ÉÉ�É�ÉÉ�É�É
Ê�Ê�ÊÊ�Ê�ÊÊ�Ê�ÊÊ�Ê�ÊÊ�Ê�ÊÊ�Ê�Ê Ë�ËË�ËË�ËË�ËË�ËË�ËË�Ë
ÌÌÌÌÌÌÌ
Í�ÍÍ�ÍÍ�ÍÍ�ÍÍ�ÍÍ�ÍÍ�Í
ÎÎÎÎÎÎÎ
Ï�Ï�ÏÏ�Ï�ÏÏ�Ï�ÏÏ�Ï�ÏÏ�Ï�Ï
Ð�Ð�ÐÐ�Ð�ÐÐ�Ð�ÐÐ�Ð�ÐÐ�Ð�Ð
Ñ�Ñ�Ñ�ÑÑ�Ñ�Ñ�ÑÒ�Ò�Ò�ÒÒ�Ò�Ò�Ò
Ó�Ó�ÓÓ�Ó�ÓÓ�Ó�ÓÓ�Ó�ÓÓ�Ó�ÓÓ�Ó�Ó
Ô�Ô�ÔÔ�Ô�ÔÔ�Ô�ÔÔ�Ô�ÔÔ�Ô�ÔÔ�Ô�Ô
Õ�Õ�Õ�Õ�ÕÕ�Õ�Õ�Õ�ÕÖ�Ö�Ö�ÖÖ�Ö�Ö�Ö
×�××�××�××�××�××�××�×
ØØØØØØØ
Ù�Ù�Ù�ÙÙ�Ù�Ù�ÙÙ�Ù�Ù�ÙÙ�Ù�Ù�ÙÙ�Ù�Ù�Ù
Ú�Ú�ÚÚ�Ú�ÚÚ�Ú�ÚÚ�Ú�ÚÚ�Ú�ÚÛ�Û�Û�ÛÛ�Û�Û�ÛÛ�Û�Û�ÛÛ�Û�Û�ÛÛ�Û�Û�ÛÛ�Û�Û�Û
Ü�Ü�ÜÜ�Ü�ÜÜ�Ü�ÜÜ�Ü�ÜÜ�Ü�ÜÜ�Ü�Ü
Ý�Ý�Ý�ÝÝ�Ý�Ý�ÝÞ�Þ�Þ�ÞÞ�Þ�Þ�Þ
ß�ß�ß�ßß�ß�ß�ßß�ß�ß�ßß�ß�ß�ßß�ß�ß�ßß�ß�ß�ß
à�à�àà�à�àà�à�àà�à�àà�à�àà�à�à
á�á�áá�á�áá�á�áá�á�áá�á�á
â�â�ââ�â�ââ�â�ââ�â�ââ�â�â
ã�ã�ãã�ã�ãã�ã�ãã�ã�ãã�ã�ãã�ã�ã
ä�ä�ää�ä�ää�ä�ää�ä�ää�ä�ää�ä�ä
å�å�åå�å�åå�å�åå�å�åå�å�åå�å�å
æ�æ�ææ�æ�ææ�æ�ææ�æ�ææ�æ�ææ�æ�æ
we need the first xz plane to go on ...
Torsten Hoefler, Gilles Zérah Overlap in 3d-FFTs
IntroductionThree dimensional FFTs
Implementation in ABINIT
Traditional parallel 3d-FFTParallel 3d-FFT with maximum overlapParallel cache optimized 3d-FFT with partial overlap
Transformation of in x Direction
... so NBC_Wait for the first NBC_Ialltoall!ç�ç�ç�çç�ç�ç�çç�ç�ç�çè�è�è�èè�è�è�èè�è�è�è
é�é�é�é�éé�é�é�é�éé�é�é�é�éê�ê�ê�êê�ê�ê�êê�ê�ê�ê
ë�ëë�ëë�ëë�ëë�ëë�ëë�ë
ì�ìì�ìì�ìì�ìì�ìì�ìì�ì
í�í�í�íí�í�í�íí�í�í�íí�í�í�íí�í�í�íí�í�í�í
î�î�îî�î�îî�î�îî�î�îî�î�îî�î�î
ï�ï�ï�ïï�ï�ï�ïï�ï�ï�ïï�ï�ï�ïï�ï�ï�ïï�ï�ï�ï
ð�ð�ðð�ð�ðð�ð�ðð�ð�ðð�ð�ðð�ð�ð
ñ�ñ�ññ�ñ�ññ�ñ�ññ�ñ�ññ�ñ�ñ
ò�ò�òò�ò�òò�ò�òò�ò�òò�ò�ò
ó�ó�ó�óó�ó�ó�óó�ó�ó�óó�ó�ó�óó�ó�ó�ó
ô�ô�ôô�ô�ôô�ô�ôô�ô�ôô�ô�ôõ�õ�õ�õ�õõ�õ�õ�õ�õõ�õ�õ�õ�õö�ö�ö�öö�ö�ö�öö�ö�ö�ö
÷�÷÷�÷÷�÷÷�÷÷�÷÷�÷÷�÷
ø�øø�øø�øø�øø�øø�øø�øù�ùù�ùù�ùù�ùù�ùù�ùù�ù
ú�úú�úú�úú�úú�úú�úú�úû�û�û�ûû�û�û�ûû�û�û�ûû�û�û�ûû�û�û�û
ü�ü�üü�ü�üü�ü�üü�ü�üü�ü�ü
ý�ý�ýý�ý�ýý�ý�ýý�ý�ýý�ý�ýý�ý�ý
þ�þ�þþ�þ�þþ�þ�þþ�þ�þþ�þ�þþ�þ�þÿ�ÿ�ÿ�ÿÿ�ÿ�ÿ�ÿÿ�ÿ�ÿ�ÿ���������������������
�������������������������
������������������������� �������
�������
�������
�������
������������������������������������������
���������������������������������������������������������
��������������������� � � � � � � � � �
������������������������������
������������������������������ �������
�������
������������������������������
������������������������������
y x
z ������������������������������
������������������������������ ���������������������
�������
���������������������
�������
�������������������������
�������������������������
����������������������������
������������������������������
������������������������������
� � � � � � � � !�!�!�!!�!�!�!
"�""�""�""�""�""�""�"
#######
$�$�$�$$�$�$�$$�$�$�$$�$�$�$$�$�$�$
%�%�%%�%�%%�%�%%�%�%%�%�%&�&�&�&&�&�&�&&�&�&�&&�&�&�&&�&�&�&&�&�&�&
'�'�''�'�''�'�''�'�''�'�''�'�'
(�(�(�((�(�(�()�)�)�))�)�)�)
*�*�*�**�*�*�**�*�*�**�*�*�**�*�*�**�*�*�*
+�+�++�+�++�+�++�+�++�+�++�+�+
,�,�,,�,�,,�,�,,�,�,,�,�,
-�-�--�-�--�-�--�-�--�-�-
.�.�..�.�..�.�..�.�..�.�..�.�.
/�/�//�/�//�/�//�/�//�/�//�/�/
0�0�00�0�00�0�00�0�00�0�00�0�0
1�1�11�1�11�1�11�1�11�1�11�1�1
and transform first plane (pattern means xyz transformed)
Torsten Hoefler, Gilles Zérah Overlap in 3d-FFTs
IntroductionThree dimensional FFTs
Implementation in ABINIT
Traditional parallel 3d-FFTParallel 3d-FFT with maximum overlapParallel cache optimized 3d-FFT with partial overlap
Transformation of in x Direction
Wait and transform second xz plane2�2�2�22�2�2�22�2�2�23�3�3�33�3�3�33�3�3�3
4�4�4�4�44�4�4�4�44�4�4�4�45�5�5�55�5�5�55�5�5�5
6�66�66�66�66�66�66�6
7�77�77�77�77�77�77�7
8�8�8�88�8�8�88�8�8�88�8�8�88�8�8�88�8�8�8
9�9�99�9�99�9�99�9�99�9�99�9�9
:�:�:�::�:�:�::�:�:�::�:�:�::�:�:�::�:�:�:
;�;�;;�;�;;�;�;;�;�;;�;�;;�;�;
<�<�<<�<�<<�<�<<�<�<<�<�<
=�=�==�=�==�=�==�=�==�=�=
>�>�>�>>�>�>�>>�>�>�>>�>�>�>>�>�>�>
?�?�??�?�??�?�??�?�??�?�?@�@�@�@�@@�@�@�@�@@�@�@�@�@A�A�A�AA�A�A�AA�A�A�A
B�BB�BB�BB�BB�BB�BB�B
C�CC�CC�CC�CC�CC�CC�CD�DD�DD�DD�DD�DD�DD�D
E�EE�EE�EE�EE�EE�EE�EF�F�F�FF�F�F�FF�F�F�FF�F�F�FF�F�F�F
G�G�GG�G�GG�G�GG�G�GG�G�G
H�H�HH�H�HH�H�HH�H�HH�H�HH�H�H
I�I�II�I�II�I�II�I�II�I�II�I�IJ�J�J�JJ�J�J�JJ�J�J�JK�K�K�KK�K�K�KK�K�K�K
L�L�LL�L�LL�L�LL�L�LL�L�L
M�M�MM�M�MM�M�MM�M�MM�M�M NNNNNNN
OOOOOOO
PPPPPPP
QQQQQQQ
R�R�R�RR�R�R�RR�R�R�RR�R�R�RR�R�R�RR�R�R�R
S�S�S�SS�S�S�SS�S�S�SS�S�S�SS�S�S�SS�S�S�S T�T�T�TT�T�T�TT�T�T�TU�U�U�UU�U�U�UU�U�U�U
V�V�VV�V�VV�V�VV�V�VV�V�VV�V�V
W�W�WW�W�WW�W�WW�W�WW�W�WW�W�WX�X�XX�X�XX�X�XX�X�XX�X�XX�X�X
Y�Y�YY�Y�YY�Y�YY�Y�YY�Y�YY�Y�YZ�Z�Z�Z�ZZ�Z�Z�Z�ZZ�Z�Z�Z�Z[�[�[�[�[[�[�[�[�[[�[�[�[�[
\\\\\\\
]]]]]]]
y x
z ^�^�^^�^�^^�^�^^�^�^^�^�^^�^�^
_�_�__�_�__�_�__�_�__�_�__�_�_
`�`�``�`�``�`�``�`�``�`�`
a�a�aa�a�aa�a�aa�a�aa�a�a
b�b�b�bb�b�b�bc�c�c�cc�c�c�c
d�d�dd�d�dd�d�dd�d�dd�d�dd�d�d
e�e�ee�e�ee�e�ee�e�ee�e�ee�e�e
f�f�f�f�ff�f�f�f�fg�g�g�gg�g�g�g
h�h�h�hh�h�h�hh�h�h�hh�h�h�hh�h�h�h
i�i�ii�i�ii�i�ii�i�ii�i�ij�j�j�jj�j�j�jj�j�j�jj�j�j�jj�j�j�jj�j�j�j
k�k�kk�k�kk�k�kk�k�kk�k�kk�k�k
l�l�l�ll�l�l�lm�m�m�mm�m�m�m
n�n�n�nn�n�n�nn�n�n�nn�n�n�nn�n�n�nn�n�n�n
o�o�oo�o�oo�o�oo�o�oo�o�oo�o�o
p�p�pp�p�pp�p�pp�p�pp�p�p
q�q�qq�q�qq�q�qq�q�qq�q�q
r�r�rr�r�rr�r�rr�r�rr�r�rr�r�r
s�s�ss�s�ss�s�ss�s�ss�s�ss�s�s
t�t�tt�t�tt�t�tt�t�tt�t�tt�t�t
u�u�uu�u�uu�u�uu�u�uu�u�uu�u�u
v�vv�vv�vv�vv�vv�vv�v
wwwwwww
x�xx�xx�xx�xx�xx�xx�x
yyyyyyyz�zz�zz�zz�zz�zz�zz�z
{{{{{{{
first plane’s data could be accessed
Torsten Hoefler, Gilles Zérah Overlap in 3d-FFTs
IntroductionThree dimensional FFTs
Implementation in ABINIT
Traditional parallel 3d-FFTParallel 3d-FFT with maximum overlapParallel cache optimized 3d-FFT with partial overlap
Transformation of in x Direction
wait and transform last xz plane|�|�|�||�|�|�||�|�|�|}�}�}�}}�}�}�}}�}�}�}
~�~�~�~�~~�~�~�~�~~�~�~�~�~���������������������
���������������������
���������������������
������������������������������������������
������������������������������
������������������������������������������
������������������������������
�������������������������
�������������������������
�����������������������������������
�������������������������������������������������������������������������
���������������������
���������������������
�����������������������������������
�������������������������
���������������������
���������������������
������������������������������
������������������������������������������������������������������������
�������������������������
�������������������������
������������������������������������������
������������������������������������������ ������������������������������������������
������������������������������
������������������������������������������������������������
������������������������������ � � � � � � � � � � � � ¡�¡�¡�¡�¡¡�¡�¡�¡�¡¡�¡�¡�¡�¡
y x
z ¢�¢�¢¢�¢�¢¢�¢�¢¢�¢�¢¢�¢�¢¢�¢�¢
£�£�££�£�££�£�££�£�££�£�££�£�£
¤�¤¤�¤¤�¤¤�¤¤�¤¤�¤¤�¤
¥¥¥¥¥¥¥
¦�¦�¦¦�¦�¦¦�¦�¦¦�¦�¦¦�¦�¦
§�§�§§�§�§§�§�§§�§�§§�§�§
¨�¨�¨�¨¨�¨�¨�¨©�©�©�©©�©�©�© ª�ª�ª�ª�ªª�ª�ª�ª�ª«�«�«�««�«�«�«
¬�¬�¬�¬¬�¬�¬�¬¬�¬�¬�¬¬�¬�¬�¬¬�¬�¬�¬
����������
®�®�®�®®�®�®�®¯�¯�¯�¯¯�¯�¯�¯
°�°�°°�°�°°�°�°°�°�°°�°�°
±�±�±±�±�±±�±�±±�±�±±�±�±
²�²�²²�²�²²�²�²²�²�²²�²�²²�²�²
³�³�³³�³�³³�³�³³�³�³³�³�³³�³�³
´�´´�´´�´´�´´�´´�´´�´
µµµµµµµ
¶¶¶¶¶¶¶
·······
¸�¸�¸¸�¸�¸¸�¸�¸¸�¸�¸¸�¸�¸¸�¸�¸
¹�¹�¹¹�¹�¹¹�¹�¹¹�¹�¹¹�¹�¹¹�¹�¹
º�ºº�ºº�ºº�ºº�ºº�ºº�º
»»»»»»»
¼�¼�¼�¼¼�¼�¼�¼¼�¼�¼�¼¼�¼�¼�¼¼�¼�¼�¼¼�¼�¼�¼
½�½�½½�½�½½�½�½½�½�½½�½�½½�½�½¾�¾�¾�¾¾�¾�¾�¾¾�¾�¾�¾¾�¾�¾�¾¾�¾�¾�¾¾�¾�¾�¾
¿�¿�¿¿�¿�¿¿�¿�¿¿�¿�¿¿�¿�¿¿�¿�¿
À�À�ÀÀ�À�ÀÀ�À�ÀÀ�À�ÀÀ�À�ÀÀ�À�À
Á�Á�ÁÁ�Á�ÁÁ�Á�ÁÁ�Á�ÁÁ�Á�ÁÁ�Á�Á
ÂÂÂÂÂÂÂ
ÃÃÃÃÃÃÃ
ÄÄÄÄÄÄÄ
ÅÅÅÅÅÅÅ
done! → 1 complete 1D-FFT overlaps a communication
Torsten Hoefler, Gilles Zérah Overlap in 3d-FFTs
IntroductionThree dimensional FFTs
Implementation in ABINIT
Traditional parallel 3d-FFTParallel 3d-FFT with maximum overlapParallel cache optimized 3d-FFT with partial overlap
Parameter and Problems
Tile factor# of z-planes to gather before NBC_Ialltoall is startedvery performance critical!not easily predictable
Window size and MPI_Test intervalWindow size = number of outstanding communicationsnot very performance critical → fine-tuningMPI_Test progresses internal state of MPIunneccessary in threaded NBC implementation (future)
Problems?NOT cache friendly :-(
Torsten Hoefler, Gilles Zérah Overlap in 3d-FFTs
IntroductionThree dimensional FFTs
Implementation in ABINIT
Traditional parallel 3d-FFTParallel 3d-FFT with maximum overlapParallel cache optimized 3d-FFT with partial overlap
Parameter and Problems
Tile factor# of z-planes to gather before NBC_Ialltoall is startedvery performance critical!not easily predictable
Window size and MPI_Test intervalWindow size = number of outstanding communicationsnot very performance critical → fine-tuningMPI_Test progresses internal state of MPIunneccessary in threaded NBC implementation (future)
Problems?NOT cache friendly :-(
Torsten Hoefler, Gilles Zérah Overlap in 3d-FFTs
IntroductionThree dimensional FFTs
Implementation in ABINIT
Traditional parallel 3d-FFTParallel 3d-FFT with maximum overlapParallel cache optimized 3d-FFT with partial overlap
Parameter and Problems
Tile factor# of z-planes to gather before NBC_Ialltoall is startedvery performance critical!not easily predictable
Window size and MPI_Test intervalWindow size = number of outstanding communicationsnot very performance critical → fine-tuningMPI_Test progresses internal state of MPIunneccessary in threaded NBC implementation (future)
Problems?NOT cache friendly :-(
Torsten Hoefler, Gilles Zérah Overlap in 3d-FFTs
IntroductionThree dimensional FFTs
Implementation in ABINIT
Traditional parallel 3d-FFTParallel 3d-FFT with maximum overlapParallel cache optimized 3d-FFT with partial overlap
3D-FFT Benchmark Results (small input)
0
5
10
15
20
25
30
35
0 5 10 15 20 25 30 35
Spe
edup
Nodes
idealNBCMPI
0
5
10
15
20
25
30
35
40
45
50
0 5 10 15 20 25 30 35C
omm
unic
atio
n O
verh
ead
(%)
Nodes
NBCMPI
“tantale”@CEA: 128 2 GHz Quad Opteron 844 nodesInterconnect: InfiniBandTM
System size 128x128x128 (1 CPU ≈ 0.75 s)
Torsten Hoefler, Gilles Zérah Overlap in 3d-FFTs
IntroductionThree dimensional FFTs
Implementation in ABINIT
Traditional parallel 3d-FFTParallel 3d-FFT with maximum overlapParallel cache optimized 3d-FFT with partial overlap
1 IntroductionShort introduction to non-blocking collectives
2 Three dimensional FFTsTraditional parallel 3d-FFTParallel 3d-FFT with maximum overlapParallel cache optimized 3d-FFT with partial overlap
3 Implementation in ABINITAvoidance of the transformation of zeroesAutotuning of parametersPreliminary Performance Results
Torsten Hoefler, Gilles Zérah Overlap in 3d-FFTs
IntroductionThree dimensional FFTs
Implementation in ABINIT
Traditional parallel 3d-FFTParallel 3d-FFT with maximum overlapParallel cache optimized 3d-FFT with partial overlap
Cache optimal implementation
cache optimality by yz transforming plane by plane (in cache)!
y x
z
→ we need all yz-planes before we can start x transform :-(Torsten Hoefler, Gilles Zérah Overlap in 3d-FFTs
IntroductionThree dimensional FFTs
Implementation in ABINIT
Traditional parallel 3d-FFTParallel 3d-FFT with maximum overlapParallel cache optimized 3d-FFT with partial overlap
Applying Non-blocking collectives
Pipelined communicationretain plane-by-plane transformsimple pipelined schemestart A2A of plane as soon as it is transformedwait for all before x transformA2A overlapped with computation of remaining planeslast A2A blocks (immediate wait :-( )
Issuesless overap potentialplane coalescing to adjust datasizenew parameter: “pipeline depth” (# of A2As)
Torsten Hoefler, Gilles Zérah Overlap in 3d-FFTs
IntroductionThree dimensional FFTs
Implementation in ABINIT
Traditional parallel 3d-FFTParallel 3d-FFT with maximum overlapParallel cache optimized 3d-FFT with partial overlap
3D-FFT Benchmark Results (small input)
System“tantale”@CEA2 GHz Quad OpteronInfiniBandTM
Parameters128x128x12816 CPUs, 4 nodes1 CPU ≈ 28 s8 planes/proc16kb/plane
1
2
3
4
5
6
7
8
9
10
0 1 2 4
FORW
0 1 2 4
Tim
e (s)
coalesced planes (0 = original impl.)
BACK
Torsten Hoefler, Gilles Zérah Overlap in 3d-FFTs
IntroductionThree dimensional FFTs
Implementation in ABINIT
Avoidance of the transformation of zeroesAutotuning of parametersPreliminary Performance Results
1 IntroductionShort introduction to non-blocking collectives
2 Three dimensional FFTsTraditional parallel 3d-FFTParallel 3d-FFT with maximum overlapParallel cache optimized 3d-FFT with partial overlap
3 Implementation in ABINITAvoidance of the transformation of zeroesAutotuning of parametersPreliminary Performance Results
Torsten Hoefler, Gilles Zérah Overlap in 3d-FFTs
IntroductionThree dimensional FFTs
Implementation in ABINIT
Avoidance of the transformation of zeroesAutotuning of parametersPreliminary Performance Results
Avoidance of the transformation of zeroes
ABINIT Implementationchanged routines back, forw, back_wf and forw_wf
some minor changes to others (input params ...)
The routines back_wf and forw_wf
avoid transformation of zeroesless computation and less communicationchanged communication (boxcut=2):
forw_wf: nz/p planes, each has nx/2 · ny/(2 · p) doublesback_wf: nz/(2 · p) planes, each has nx/2 · ny/p doubles
New Parametersall routines have different # planes → three parameters
Torsten Hoefler, Gilles Zérah Overlap in 3d-FFTs
IntroductionThree dimensional FFTs
Implementation in ABINIT
Avoidance of the transformation of zeroesAutotuning of parametersPreliminary Performance Results
1 IntroductionShort introduction to non-blocking collectives
2 Three dimensional FFTsTraditional parallel 3d-FFTParallel 3d-FFT with maximum overlapParallel cache optimized 3d-FFT with partial overlap
3 Implementation in ABINITAvoidance of the transformation of zeroesAutotuning of parametersPreliminary Performance Results
Torsten Hoefler, Gilles Zérah Overlap in 3d-FFTs
IntroductionThree dimensional FFTs
Implementation in ABINIT
Avoidance of the transformation of zeroesAutotuning of parametersPreliminary Performance Results
Autotuning of parameters
Three new input parametersfftplanes_fourdp, fftplanes_forw_wf, andfftplanes_back_wf
default = 0 → standard MPI implementationperformance criticalcomplicated to determine by hand
Autotuningautomatically determine them at runtimeeach planes parameter is benchmarked (after warmupround)fastest is chosen automaticallyrelatively accurate but problems with jitter
Torsten Hoefler, Gilles Zérah Overlap in 3d-FFTs
IntroductionThree dimensional FFTs
Implementation in ABINIT
Avoidance of the transformation of zeroesAutotuning of parametersPreliminary Performance Results
1 IntroductionShort introduction to non-blocking collectives
2 Three dimensional FFTsTraditional parallel 3d-FFTParallel 3d-FFT with maximum overlapParallel cache optimized 3d-FFT with partial overlap
3 Implementation in ABINITAvoidance of the transformation of zeroesAutotuning of parametersPreliminary Performance Results
Torsten Hoefler, Gilles Zérah Overlap in 3d-FFTs
IntroductionThree dimensional FFTs
Implementation in ABINIT
Avoidance of the transformation of zeroesAutotuning of parametersPreliminary Performance Results
Microbenchmarks
1
2
3
4
5
6
7
8
9
10
0 1 2 4 0 1 2 4
Tim
e (s)
0 1 2 4 0 1 2 4coalesced planes (0 = original impl.)
FORW BACK FORW_WF BACK_WF
Torsten Hoefler, Gilles Zérah Overlap in 3d-FFTs
IntroductionThree dimensional FFTs
Implementation in ABINIT
Avoidance of the transformation of zeroesAutotuning of parametersPreliminary Performance Results
ABINIT - Si, 60 bands, 1283 FFT
50
100
150
200
250
300
350
FOU
RD
P
0 −1 0 −1 0 −1
FOR
W_W
F
BA
CK
_WF
FOR
W_W
F
BA
CK
_WF
0 −1 0 −1 0 −1TOTAL 41.1s −> 37.1s39.5s −> 23.6s
npbandxnfft = 4x8npbandxnfft = 2x16
Tim
e (ms)
Torsten Hoefler, Gilles Zérah Overlap in 3d-FFTs
IntroductionThree dimensional FFTs
Implementation in ABINIT
Avoidance of the transformation of zeroesAutotuning of parametersPreliminary Performance Results
Conclusions & Future Work
Conclusionsapplying NBC requires some effortNBC can improve parallel efficiencecache usage vs. overlap potential
Future Worktune FFT further (reduce serial overhead)better automatic parameter assessment (?)parallel model for 3d-FFTuse NBC for parallel orthogonalizationapply NBC at higher level (LOBPCG?)
Torsten Hoefler, Gilles Zérah Overlap in 3d-FFTs
IntroductionThree dimensional FFTs
Implementation in ABINIT
Avoidance of the transformation of zeroesAutotuning of parametersPreliminary Performance Results
Conclusions & Future Work
Conclusionsapplying NBC requires some effortNBC can improve parallel efficiencecache usage vs. overlap potential
Future Worktune FFT further (reduce serial overhead)better automatic parameter assessment (?)parallel model for 3d-FFTuse NBC for parallel orthogonalizationapply NBC at higher level (LOBPCG?)
Torsten Hoefler, Gilles Zérah Overlap in 3d-FFTs
IntroductionThree dimensional FFTs
Implementation in ABINIT
Avoidance of the transformation of zeroesAutotuning of parametersPreliminary Performance Results
Discussion
ABINIT patch (soon):http://www.unixer.de/research/abinit/
Thanks to the CEA/DAM for support of this work and you foryour attention!
Torsten Hoefler, Gilles Zérah Overlap in 3d-FFTs