A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping...

A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and

Communication Overlapping

Maria Athanasaki, Aristidis Sotiropoulos, Georgios Tsoukalas, Nectarios Koziris

National Technical University of AthensDept. of Electrical and Computer Engineering

Computing Systems Laboratory

A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping

National Technical University of AthensComputing Systems Laboratory

Overview

Advanced Architectures Tiling for parallelization Non-overlapping vs. Overlapping

scheme Vertical vs. hyperplane grouping Application on clusters of SMP

TCP/IP over FastEthernet

Use of popular Socket Interface create socket descriptor sd, then read/write from/to descriptor

write send

read receive

CPUkernelmode

bufferlength

IP ETH

2) CPU copies datafrom user to kernel space

3) CPU adds protocolheaders

5) DMA copies data to NIC

write(sd, buffer, length);

Example: Send

1) system call (CPU)

user 4) CPU programs DMA eng.

SCI What about Scalable Coherent

Interface? Point-to-point , DSM approach

SCI DSM schemeexportedmemorysegment

importedmemorysegment

write 100

process VM area

Physical Memory

Contiguous data in process VMare not contiguous in Physical Memory

SCI Zero Copy Scheme

process VM area

Physical Memory

is mapped to

pinned down memory

SCICreateSegment,SCIMapLocalSegment mappingbetween Virtual and contiguous Physical Memory

SCI Zero Copy Scheme

Data transfers

Programmed I/O mode CPU handles data transferring “lost” CPU cycles

DMA mode CPU programs the NIC’s buffers Not blocked during transfer Performs useful tasks

SCI DMA approach

No copying by CPU•Data already

contiguous in PM•DMA engine copies

data to network

•No packetizationDone in hardware

•But, init only by kernel

We need VIA

Nested For-Loops

for (i1=l1; i1<=u1; i1++)

for (i2=l2; i2<=u2; i2++)

… … … … …

for (in=ln; in<=un; in++)

Loop Body

Dependence Vectors

for (i1=0; i1<=7; i1++)

for (i2=0; i2<=7; i2++)

A[i,j]=A[i-1,j]+A[i,j-1]

Tiling

Processor 0

Processor 1

Non-Overlapping Scheme

Processor 0

Processor 1

Processor 2

Non-Overlapping vs. Overlapping Scheme

Overlapping Scheme

Processor 0

Processor 1

Processor 2

Generalization to SMPs

Vertical vs. Hyperplane grouping

Example

Tile SpaceGroup Space

SMP node0

SMP node1

Scheduling vector Π=(1,1)

Non-overlapping vs. Overlapping scheme

Almost half duration of execution steps Slightly more steps

Non-overlapping scheme

9 computation +8 communication steps

Overlapping scheme

12 steps

Vertical vs. Hyperplane Grouping

Slower pipeline filling Faster execution because of lack of intratile synchronization

preferable for Tile Spaces, where the mapping direction is comparatively large

Experimental Platform

Linux SMP (Symmetric Multi-Processors) Cluster

8 nodes 128MB RAM 2 Pentium III 800MHz

SCI ring (SCI Dolphin’s PCI-SCI D330 cards)

Initial Code

for (i=1; i<=X; i++)for (j=1; j<=Y; j++)

for (k=1; k<=Z; k++){

A[i][j][k] = func(A[i-1][j][k],

A[i][j-1][k], A[i][j][k-1])

Experimental results

0 5000 10000 15000 20000 25000 30000 35000

Tile Height

0 5000 10000 15000 20000 25000 30000 35000

Tile Height

Iteration Space 16x16x1024K Iteration Space 48x48x512K

Non-overlapping scheme – vertical

grouping

Overlapping scheme – vertical grouping

Non-overlapping scheme – hyperplane

grouping

Overlapping scheme – hyperplane grouping

Grouping matrix

nii mmmm 111 = number of CPUs within an SMP node

Example

Tile SpaceGroup Space

SMP node0

SMP node1

111GGG HPH

Scheduling vector Π=(1,1)

A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping...

Documents

link.springer.com · 2017-08-28 · J Glob Optim (2014) 59:633–662 DOI 10.1007/s10898-014-0176-0 Multivariate McCormick relaxations A. Tsoukalas · A. Mitsos Received: 17 April

15834247 Steven Tsoukalas Masonic Rites and Wrongs

TurkExim Importers Search Engine | · wk an eurochemo.com IthalatGl Findlk Unu FALCON INGREDIENTS Yunanistan Aristidis Mavridis +30 210 9656580 +30 210 9653582 aris mavrid' falcon-sa

The Greek Debt Crisis: Likely Causes, Mechanics and Outcomes · 2017. 5. 5. · The Greek Debt Crisis: Likely Causes, Mechanics and Outcomes Michael G. Arghyrou John D. Tsoukalas

Josephine Linke Yibeltal - Dr Tsoukalas · History 2000-1500 BC The first paper was titled, “Quantitative Analysis of Urine Vapor and Breath by Gas-Liquid Partition Chromatography”,

The global k-means clustering algorithm - Robert Haralickharalick.org/ML/global_k-means.pdf · 2015-12-13 · The global k-means clustering algorithm Aristidis Likasa,∗ Nikos Vlassisb

Event #68: No-Limit Hold'em MAIN EVENT END OF DAY … · 135 Patrick Clarke Ardee, IE 168,200 Brasilia / 6 / 4 136 Nick Tsoukalas LAS VEGAS, NV, US 168,100 Brasilia / 73 / 4 137 Jonathan

Robust Dual Dynamic Programming · 2018. 11. 15. · 2 Georghiou, Tsoukalas and Wiesemann: Robust Dual Dynamic Programming we assume to be stage-wise rectangular. The cost vectors

JU 20 14 - Truth in Advertising · Civil Action 14-cv-1324 JU 1 1 20 14 U.S.D.C. S.D. N.Y. CASHIERS CLASS ACTION COMPLAINT v. KAN GAD IS FAMILY MANAGEMENT LLC, ARISTIDIS KANGADIS

Code Generation Methods for Tiling Transformationsgoumas/downloads/jise2002.pdf · Code Generation Methods for Tiling Transformations GEORGIOS GOUMAS,MARIA ATHANASAKI AND NECTARIOS

DEsubs - Bioconductor · DEsubs Aristidis G. Vrahatis, Panos Balomenos 2019-05-02 Table of Contents 1.Packagesetup 2.Userinput 3.Pathwaynetworkconstruction 4.Pathwaynetworkprocessing

Giannis Chantas, Nikolaos Galatsanos, Aristidis Likas, and

Inventing an Energy Internet Concepts, Architectures and Protocols for Smart Energy Utilization Lefteri H. Tsoukalas Purdue University Fermi National Accelerator

Joy Global Inc. (NYSE: JOYG) Tyler Haida Chris Tsoukalas Aaron Czerkies Frank Damian Amin Rizwan November 17, 2011

Giannis Chantas, Nikolaos Galatsanos, Aristidis Likas, and … · IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 17, NO. 10, OCTOBER 2008 1795 Variational Bayesian Image Restoration

Package ‘CopulaREMADA’ - The Comprehensive R …€¦ · Package ‘CopulaREMADA ... Author Aristidis K. Nikoloulopoulos Maintainer Aristidis

1 Bayesian Restoration Using a New Nonstationary Edge-Preserving Image Prior Giannis K. Chantas, Nikolaos P. Galatsanos, and Aristidis C. Likas IEEE Transactions

Christoph Görtz, John D. Tsoukalas, Francesco ZanettiFrank Schorfheide, Felipe Schwarzman, Stephanie Schmitt-Grohe, Albert Queralto, Harald Uhlig, Mark Watson, Tony Yates, Steff en

Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Maria Athanasaki, Evangelos Koukis, Nectarios Koziris National Technical

C ONVEX M IXTURE M ODELS FOR M ULTI - VIEW C LUSTERING Grigorios Tzortzis and Aristidis Likas Department of Computer Science, University of Ioannina, Greece